In your case, the size of the input data is 10000 * 4 * 2 since you have 2 input vectors, and the size of the output data is 4. The effective bandwidth should be about 0.172 GB/s.
Basically cublasSdot()
does nothing much more than computing.
Profile result shows cublasSdot()
invokes 2 kernels to compute the result. An extra 4-bytes device-to-host mem transfer is also invoked if the pointer mode is CUBLAS_POINTER_MODE_HOST
, which is the default mode for cublas lib.