CUDA vs Intel AVX / SSE vector sum performance questions

https://stackoverflow.com/questions/19349295

30-06-2022
|

Question

First of all I'm new to CUDA and I'm trying to learn, so maybe I'm doing something wrong. I wanted to compare the CUDA performance vs the equivalent function implemented with Intel intrinsics expecting that CUDA will yield a better result.

To my surprise though, thats not what I'm seeing. My function is extremely simple, I just add two vectors and store the results in a third one. My CUDA code is as basic as it gets, in the setup function I have :

void cudaAddVectors(float* vectorA, float* vectorB, float* sum, int numElements)
{
//
// Allocate the memory on the device
//
float* dvA;
float* dvB;
float* dvC;

cudaMalloc((void**)&dvA, numElements * sizeof(float));
cudaMalloc((void**)&dvB, numElements * sizeof(float));
cudaMalloc((void**)&dvC, numElements * sizeof(float));

//
// Copy the host vectors to device vectors
//
cudaMemcpy(dvA, vectorA, numElements * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(dvB, vectorB, numElements * sizeof(float), cudaMemcpyHostToDevice);

//
// Perform the sum on the device and time it
//
deviceSumLink(dvA, dvB, dvC, numElements);

//
// Now get the results back to the host
//
cudaMemcpy(sum, dvC, numElements * sizeof(float), cudaMemcpyDeviceToHost);

// Cleanup and go home
cudaFree(dvA);
cudaFree(dvB);
cudaFree(dvC);

}

then the device code is run either with blocks or threads, like so:

void deviceSumLink(float* a, float* b, float* c, int numElements)
{
    //deviceSum<<<numElements, 1>>>(a,b,c);
    deviceSumThreads<<<1, numElements>>>(a,b,c);
}

And the actual code running on the device:

__global__ void deviceSum(float* a, float* b, float* c)
{
    int index = blockIdx.x;
    c[index] = a[index] + b[index];
}

__global__ void deviceSumThreads(float* a, float* b, float* c)
{
    int index = threadIdx.x;
    c[index] = a[index] + b[index];
}

I timed the Intel version of this and the CUDA summing different size vectors and verifying that both produced accurate results. For the CUDA calls, I'm timing only the deviceSumLink call, not the memory setup and everything, but regardless of the method of invoking the kernels, the Intel intrinsics version (using 8-element arrays) is just smoking the CUDA out of the water. Basically, the Intel SIMD version of the function is something like 10x faster!

I did not expect this, so I attribute this to me being a complete newbie in CUDA. So what am I doing wrong? I thought CUDA was supposed to be much faster in those kinds of things, I think I must not be using it right or something.

If you have some insight, I'd appreciate the comments!

Thx!

La solution

Using either only 1 block or 1 thread per block to add vectors won't fully utilize the GPU. And they won't work for large vectors due the the limitation of thread size per block and block size.

To correctly add two large vectors and to get max performance, you need a kernel like this

__global__ void
vectorAdd(const float *A, const float *B, float *C, int numElements)
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;

    if (i < numElements)
    {
        C[i] = A[i] + B[i];
    }
}

and invoke it using the following threads/blocks settings

int threadsPerBlock = 256;
int blocksPerGrid =(numElements + threadsPerBlock - 1) / threadsPerBlock;

vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, numElements);

Please refer to this CUDA sample for more details.

http://docs.nvidia.com/cuda/cuda-samples/#vector-addition

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow