Question

If I have three gpus and I need to transfer a huge buffer to all three of them, will it make any difference if I use a CUDA stream for each one of them so that their copy engines can perform the transfers simultaneously? I mean: the PCI-E bus to reach all three of them is the same, isn't it?

Was it helpful?

Solution

PCIe scaling one of the areas covered in this text book on CUDA, using a number of different processor architectures.

Yes, you need to use a separate stream for each transfer, as this moves you away from the default, serialized, stream zero. You will also hit various bandwidth limits, but yes, they will run concurrently and you do get a speed up over doing the transfer sequentially.

However, you will be limited by the ability of the processor/memory/pcie controller to provide concurrent (PCIe 2) 5GB/s streams. Where adding more cards does not reduce the number of PCIe lanes available, you usually see a significant benefit. Generally for 2 cards this works well, but rapidly drops away at more than 3 cards as bandwidth issues get in the way when adding more cards. Especially with more than 2 cards, you're unlikely to have the full 16 PCIe lanes available on many systems.

The Nsight tool is very good at displaying timelines showing what is going on with the transfers, as well as showing the actual transfer rates achieved, so I suggest you give this a try to let you see what is really happening.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top