CUDA - transferring a buffer to multiple devices

Question

PCIe scaling one of the areas covered in this text book on CUDA, using a number of different processor architectures.

Yes, you need to use a separate stream for each transfer, as this moves you away from the default, serialized, stream zero. You will also hit various bandwidth limits, but yes, they will run concurrently and you do get a speed up over doing the transfer sequentially.

However, you will be limited by the ability of the processor/memory/pcie controller to provide concurrent (PCIe 2) 5GB/s streams. Where adding more cards does not reduce the number of PCIe lanes available, you usually see a significant benefit. Generally for 2 cards this works well, but rapidly drops away at more than 3 cards as bandwidth issues get in the way when adding more cards. Especially with more than 2 cards, you're unlikely to have the full 16 PCIe lanes available on many systems.

The Nsight tool is very good at displaying timelines showing what is going on with the transfers, as well as showing the actual transfer rates achieved, so I suggest you give this a try to let you see what is really happening.