Firstly never try and implement a locking algorithm on GPU. It will deadlock and stall. This is because a GPU is a SIMD device and the threads don't execute independently as on a CPU. A GPU executes a set threads called a WARP/WaveFront synchronously. So if one thread in the Wave Front is stalled it stalls all other threads in the Wave Front. If the unlock thread is in a stalled Wave Front it will NOT execute and unlock the the mutex.
Atomic operations are OK.
What you should consider is a lock free approach. See this paper for an explanation and sample CUDA code: http://www.cse.iitk.ac.in/users/mainakc/pub/icpads2012.pdf/
It describes lock free hash tables, linked list and skip lists with some sample CUDA code.
A suggested approach is to create a two level data structure.
The first level is a lock free skip list. Each skip list entry has the second level structure of a lock free linked list for duplicate values. And an atomic count of the number of entries.
The insert method
1) Generate 64 bucket key 2) Find key in Skip list 3) If not found insert into the skip list 4) Insert data in to linked list 5) increment the atomic counter for this bucket
After insertion prefix sum all the counters of the skip list buckets so you find the offset of the out put.