Question

I am writing a program in cuda and I am trying to reduce the overhead of the data transfer. I use cuBLAS library for matrix multiplications and I have to send 30.000.000 numbers, whose values range from 0-255.

Right now I'm sending them as floats, since I want my final product to be a float which ends up to be quite costy considering they can fit into a byte.

Is there a way to send them as bytes and typecast them as floats while using the cuBLAS library or any other fast-Math library? Or tell the gpu to align them as floats somehow?

Was it helpful?

Solution

You could cudaMemcpy an array of unsigned char from host to device, and also allocate an array of float on the device using cudaMalloc. then write a custom kernel that copies from the byte array to the float array:

__global__ void byteToFloat(float *out, unsigned char* in, int n)
{
    int i = threadIdx.x + blockIdx.x * blockDim.x;

    for (; i < n; i += gridDim.x * blockDim.x)
        out[i] = in[i];
}

If your data on the host is already stored as floats, then this might be slower than copying the floats. Try it and see. But if your array is already of unsigned char type, then you will need to do this conversion somewhere anyway, so the above is likely to be efficient.

Note for best performance you should probably try to overlap copy and compute if possible (but that's outside the scope of the question: see the CUDA best practices guide and programming guide for information on cudaMemcpyAsync.)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top