Retaining dot product on GPGPU using CUBLAS routine

https://stackoverflow.com/questions/12400477

01-07-2021
|

Question

I am writing a code to compute dot product of two vectors using CUBLAS routine of dot product but it returns the value in host memory. I want to use the dot product for further computation on GPGPU only. How can I make the value reside on GPGPU only and use it for further computations without making an explicit copy from CPU to GPGPU?

Solution

~~You can't, exactly, using CUBLAS.~~ As per talonmies' answer, starting with the CUBLAS V2 api (CUDA 4.0) the return value can be a device pointer. Refer to his answer. But if you are using the V1 API it's a single value, so it's pretty trivial to pass it as an argument to a kernel that uses it—you don't need an explicit cudaMemcpy (but there is one implied in order to return a host value).

Starting with the Tesla K20 GPU and CUDA 5, you will be able to call CUBLAS routines from device kernels using CUDA Dynamic Parallelism. This means you would be able to call cublasSdot (for example) from inside a __global__ kernel function, and your result would therefore be returned on the GPU.

OTHER TIPS

You can do this in CUBLAS as long as you use the "V2" API. The newer API includes a function cublasSetPointerMode which you can use to set the API to assume that all routines which return a scalar value will be passed a device pointer rather than a host pointer. This is discussed in Section 2.4 of the latest CUBLAS documentation. For example:

#include <cuda_runtime.h>
#include <cublas_v2.h>
#include <stdio.h>

int main(void)
{
    const int nvals = 10;
    const size_t sz = sizeof(double) * (size_t)nvals;
    double x[nvals], y[nvals];
    double *x_, *y_, *result_;
    double result=0., resulth=0.;

    for(int i=0; i<nvals; i++) {
        x[i] = y[i] = (double)(i)/(double)(nvals);
        resulth += x[i] * y[i];
    }

    cublasHandle_t h;
    cublasCreate(&h);
    cublasSetPointerMode(h, CUBLAS_POINTER_MODE_DEVICE);
    
    cudaMalloc( (void **)(&x_), sz);
    cudaMalloc( (void **)(&y_), sz);
    cudaMalloc( (void **)(&result_), sizeof(double) );

    cudaMemcpy(x_, x, sz, cudaMemcpyHostToDevice);
    cudaMemcpy(y_, y, sz, cudaMemcpyHostToDevice);

    cublasDdot(h, nvals, x_, 1, y_, 1, result_);

    cudaMemcpy(&result, result_, sizeof(double), cudaMemcpyDeviceToHost);

    printf("%f %f\n", resulth, result);

    cublasDestroy(h);
    return 0;
}

Using CUBLAS_POINTER_MODE_DEVICE makes cublasDdot assume that result_ is a device pointer, and there is no attempt made to copy the result back to the host. Note that this makes routines like dot asynchronous, so you might need to keep on eye on synchronization between device and host.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow