Measuring effective bandwidth on CUDA

https://stackoverflow.com/questions/14948275

10-03-2022
|

Frage

So I want to know how to calculate the total memory effective bandwidth for:

cublasSdot(handle, M, devPtrA, 1, devPtrB, 1, &curesult);

where that function belows to cublas_v2.h

That function runs in 0.46 ms, and the vectors are 10000 * sizeof(float)

Am I having ((10000 * 4) / 10^9 )/0.00046 = 0.086 GB/s?

I'm wondering about it because I don't know what is inside the cublasSdot function, and I don't know if it is necesary.

Lösung

In your case, the size of the input data is 10000 * 4 * 2 since you have 2 input vectors, and the size of the output data is 4. The effective bandwidth should be about 0.172 GB/s.

Basically cublasSdot() does nothing much more than computing. Profile result shows cublasSdot() invokes 2 kernels to compute the result. An extra 4-bytes device-to-host mem transfer is also invoked if the pointer mode is CUBLAS_POINTER_MODE_HOST, which is the default mode for cublas lib.

Andere Tipps

If kernel time is in ms then a multiplication factor of 1000 is necessary. That results in 86 GB/s.

As an example refer to example provide by NVIDIA for Matrix Transpose at http://docs.nvidia.com/cuda/samples/6_Advanced/transpose/doc/MatrixTranspose.pdf

On Last Page entire code is present. The way the Effective Bandwidth is computed is 2.*1000*mem_size/(1024*1024*1024)/(Time in ms)

Lizenziert unter: CC-BY-SA mit Zuschreibung

Nicht verbunden mit StackOverflow