Question

I have an application where I split the processing load among the GPUs on a user's system. Basically, there is CPU thread per GPU that initiates a GPU processing interval when triggered periodically by the main application thread.

Consider the following image (generated using NVIDIA's CUDA profiler tool) for an example of a GPU processing interval -- here the application is using a single GPU.

enter image description here

As you can see a big portion of the GPU processing time is consumed by the two sorting operations and I am using the Thrust library for this (thrust::sort_by_key). Also, it looks like thrust::sort_by_key calls a few cudaMallocs under the hood before it starts the actual sort.

Now consider the same processing interval where the application has spread the processing load over two GPUs:

enter image description here

In a perfect world you would expect the 2 GPU processing interval to be exactly half that of the single GPU (because each GPU is doing half the work). As you can see, this it not the case partially because the cudaMallocs seem to take longer when they are called simultaneously (sometimes 2-3 times longer) due to some sort of contention issue. I don't see why this needs to be the case because the memory allocation space for the 2 GPUs are completely independent so there shouldn't be a system-wide lock on cudaMalloc -- a per-GPU lock would be more reasonable.

To prove my hypothesis that the issue is with simultaneous cudaMalloc calls, I created a ridiculously simple program with two CPU threads (for each GPU) each calling cudaMalloc several times. I first ran this program so that the separate threads do not call cudaMalloc at the same time:

enter image description here

You see it takes ~175 microseconds per allocation. Next, I ran the program with the threads calling cudaMalloc simultaneously:

enter image description here

Here, each call took ~538 microseconds or 3 times longer than the previous case! Needless to say, this is slowing down my application tremendously and it stands to reason the issue would only get worse with more than 2 GPUs.

I have noticed this behavior on Linux and Windows. On Linux, I am using Nvidia driver version 319.60 and on Windows I am using the 327.23 version. I am using CUDA toolkit 5.5.

Possible Reason: I am using a GTX 690 in these tests. This card is basically 2 680-like GPUs housed in the same unit. This is the only "multi-GPU" setup I've run, so perhaps the cudaMalloc issue has something to do with some hardware dependence between the 690's 2 GPUs?

Was it helpful?

Solution 2

To summarize the problem and a give a possible solution:

The cudaMalloc contention probably stems from driver level contention (possibly due to the need to switch device contexts as talonmies suggestsed) and one could avoid this extra latency in performance critical sections by cudaMalloc-ing and temporary buffers beforehand.

It looks like I probably need to refactor my code so that I am not calling any sorting routine that calls cudaMalloc under the hood (in my case thrust::sort_by_key). The CUB library looks promising in this regard. As a bonus, CUB also exposes a CUDA stream parameter to the user, which could also serve to boost performance.

See CUB (CUDA UnBound) equivalent of thrust::gather for some details on moving from thrust to CUB.

UPDATE:

I backed out the calls to thrust::sort_by_key in favor of cub::DeviceRadixSort::SortPairs.
Doing this shaved milliseconds off my per-interval processing time. Also the multi-GPU contention issue has resolved itself -- offloading to 2 GPUs almost drops the processing time by 50%, as expected.

OTHER TIPS

I will preface this with a disclaimer: I'm not privy to the internals of the NVIDIA driver, so this is somewhat speculative.

The slow down you are seeing is just driver level contention caused by competition from multiple threads calling device malloc simultaneously. Device memory allocation requires a number of OS system calls, as does driver level context switching. There is a non-trivial amount of latency in both operations. It is probable that the extra time you see when two threads try and allocate memory simultaneously is caused by the additional driver latency from switching from one device to another throughout the sequence of system calls required to allocate memory on both devices.

I can think of a few ways you should be able to mitigate this:

  • You could reduce the system call overhead of thrust memory allocation to zero by writing your own custom thrust memory allocator for the device that works off a slab of memory allocated during initialisation. This would get rid of all of the system call overhead within each sort_by_key, but the effort of writing your own user memory manager is non trivial. On the other hand it leaves the rest of your thrust code intact.
  • You could switch to an alternative sort library and take back the manage the allocation of temporary memory yourself. If you do all the allocation in an initialization phase, the cost of the one time memory allocations can be amortized to almost zero over the life of each thread.

In multi-GPU CUBLAS based linear algebra codes I have written, I combined both ideas and wrote a standalone user space device memory manager which works off a one time allocated device memory pool. I found that removing all of overhead cost of intermediate device memory allocations yielded a useful speed up. Your use case might benefit from a similar strategy.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top