Question

I am just beginning to play with CUDA so I tried out a textbook vector addition code. However, when I specify kernel calls to only add the first half of vector, the second half also gets added! This behavior stops when I include some thrust library header.

I am totally confused. Please see the code below:

#include <iostream>
using namespace std;

__global__ void VecAdd(float *d_dataA, float *d_dataB, float *d_resultC)
{
    //printf("gridDim.x is %d \n",gridDim.x);
    int tid = blockIdx.x * blockDim.x + threadIdx.x;    
//  printf("tid is %d \n",tid);
    d_resultC[tid] = d_dataA[tid] + d_dataB[tid];
}

int main() 
{
    const int ARRAY_SIZE = 8*1024;
    const int ARRAY_BYTES = ARRAY_SIZE * sizeof(float);

    float *h_dataA, *h_dataB, *h_resultC;
    float *d_dataA, *d_dataB, *d_resultC;

    h_dataA     = (float *)malloc(ARRAY_BYTES);
    h_dataB     = (float *)malloc(ARRAY_BYTES);
    h_resultC   = (float *)malloc(ARRAY_BYTES);

    for(int i=0; i<ARRAY_SIZE;i++){
        h_dataA[i]=i+1;
        h_dataB[i]=2*(i+1);
    };

    cudaMalloc((void **)&d_dataA,ARRAY_BYTES);
    cudaMalloc((void **)&d_dataB,ARRAY_BYTES);
    cudaMalloc((void **)&d_resultC,ARRAY_BYTES);

    cudaMemcpy(d_dataA, h_dataA,ARRAY_BYTES, cudaMemcpyHostToDevice);
    cudaMemcpy(d_dataB, h_dataB,ARRAY_BYTES, cudaMemcpyHostToDevice);

        cout << h_resultC[0] << endl;
        cout << h_resultC[ARRAY_SIZE-1] << endl;

    dim3 dimBlock(ARRAY_SIZE/8,1,1);
    dim3 dimGrid(1,1,1);

    VecAdd<<<dimGrid,dimBlock>>>(d_dataA, d_dataB, d_resultC);

        cout << h_resultC[0] << endl;
        cout << h_resultC[ARRAY_SIZE-1] << endl;

        cudaMemcpy(h_resultC,d_resultC ,ARRAY_BYTES,cudaMemcpyDeviceToHost);
        cout << h_resultC[0] << endl;
        cout << h_resultC[ARRAY_SIZE-1] << endl;

    return 0;
}
Was it helpful?

Solution

Have you launched it first with ARRAY_SIZE threads and then with the half of them? (or 1/8)

You are not initializing d_resultC, so it's probably that d_resultC has the result of the previous executions. That would explain that behavior, but maybe it doesn't.

Add a cudaMemset over d_result_C and tell us what happens.

OTHER TIPS

I can't answer for sure why your kernel is processing more elements than expected. It's processing one elements per thread, so the number of elements processed definitely should be blockDim.x*gridDim.x.

I want to point out though, that it's good practice to write kernels that use "grid stride loops" so they aren't so dependent on the block and thread count. The performance cost is negligible and if you are performance-sensitive, the blocking parameters are different for different GPUs.

http://cudahandbook.to/15QbFWx

So you should add a count parameter (the number of elements to process), then write something like:

__global__ void VecAdd(float *d_dataA, float *d_dataB, float *d_resultC, int N)
{
    for ( int i = blockIdx.x*blockDim.x + threadIdx.x;
              i < N;
              i += blockDim.x*gridDim.x ) {
        d_resultC[i] = d_dataA[i] + d_dataB[i];
    }
}

As some guys mentioned above. This may be caused by the remain data from your previous run. You didn't free the memory you allocated may be the reason of this odd situation. I think you should free the allocated arrays on the host using free and also free the memory on the GPU using CudaFree

Also I strongly recommend you to allocate the host memory using CudaMallocHost instead of malloc and free them at the end of the program by CudaFreeHost. This will give you fast copy. See here: CudaMallocHost

Anyway, don't forget to free heap memory on C/C++ program, whether with CUDA or not.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top