Poor performance when calling cudaMalloc with 2 GPUs simultaneously

Question 1

To summarize the problem and a give a possible solution:

The cudaMalloc contention probably stems from driver level contention (possibly due to the need to switch device contexts as talonmies suggestsed) and one could avoid this extra latency in performance critical sections by cudaMalloc-ing and temporary buffers beforehand.

It looks like I probably need to refactor my code so that I am not calling any sorting routine that calls cudaMalloc under the hood (in my case thrust::sort_by_key). The CUB library looks promising in this regard. As a bonus, CUB also exposes a CUDA stream parameter to the user, which could also serve to boost performance.

See CUB (CUDA UnBound) equivalent of thrust::gather for some details on moving from thrust to CUB.

UPDATE:

I backed out the calls to thrust::sort_by_key in favor of cub::DeviceRadixSort::SortPairs.
Doing this shaved milliseconds off my per-interval processing time. Also the multi-GPU contention issue has resolved itself -- offloading to 2 GPUs almost drops the processing time by 50%, as expected.

Question 2

I will preface this with a disclaimer: I'm not privy to the internals of the NVIDIA driver, so this is somewhat speculative.

The slow down you are seeing is just driver level contention caused by competition from multiple threads calling device malloc simultaneously. Device memory allocation requires a number of OS system calls, as does driver level context switching. There is a non-trivial amount of latency in both operations. It is probable that the extra time you see when two threads try and allocate memory simultaneously is caused by the additional driver latency from switching from one device to another throughout the sequence of system calls required to allocate memory on both devices.

I can think of a few ways you should be able to mitigate this:

You could reduce the system call overhead of thrust memory allocation to zero by writing your own custom thrust memory allocator for the device that works off a slab of memory allocated during initialisation. This would get rid of all of the system call overhead within each sort_by_key, but the effort of writing your own user memory manager is non trivial. On the other hand it leaves the rest of your thrust code intact.
You could switch to an alternative sort library and take back the manage the allocation of temporary memory yourself. If you do all the allocation in an initialization phase, the cost of the one time memory allocations can be amortized to almost zero over the life of each thread.

In multi-GPU CUBLAS based linear algebra codes I have written, I combined both ideas and wrote a standalone user space device memory manager which works off a one time allocated device memory pool. I found that removing all of overhead cost of intermediate device memory allocations yielded a useful speed up. Your use case might benefit from a similar strategy.