How to measure CUDA times correctly?

https://stackoverflow.com/questions/11969994

26-06-2021
|

문제

Im trying to measure correctly the times of parallel and sequential executions, but I am in doubt because of:

Suppose we have the following code:

    //get the time
    clock_t start,finish;
    double totaltime;
    start = clock(); 

    double *d_A, *d_B, *d_X;

    cudaMalloc((void**)&d_A, sizeof(double) * Width * Width);
    cudaMalloc((void**)&d_B, sizeof(double) * Width);
    cudaMalloc((void**)&d_X, sizeof(double) * Width);

    cudaMemcpy(d_A, A, sizeof(double) * Width * Width, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, B, sizeof(double) * Width, cudaMemcpyHostToDevice);  


    do_parallel_matmul<<<dimB, dimT>>>(d_A, d_B, d_X, Width);   
    

    cudaMemcpy(X, d_X, sizeof(double) * Width, cudaMemcpyDeviceToHost);

    finish = clock();
    
    totaltime=(double)(finish-start)/CLOCKS_PER_SEC;   

    printf("%f", totaltime);

This time is much longer than sequential time measured as follows:

clock_t start,finish;
double totaltime;
start = clock(); 

do_seq_matmult();

finish = clock();
    
totaltime=(double)(finish-start)/CLOCKS_PER_SEC;   

printf("%f", totaltime);

So I don't know if I should only measure the CUDA kernel time as follows:

clock_t start,finish;
double totaltime;
start = clock(); 

do_parallel_matmul();

finish = clock();
    
totaltime=(double)(finish-start)/CLOCKS_PER_SEC;   

printf("%f", totaltime);

and avoid memory copies between host and device...

I'm asking the above because I have to submit a comparission between parallel executions and sequential executions... But if I measure memory copies in CUDA there isn't a good difference between CUDA and C...

EDIT:

void do_seq_matmult(const double *A, const double *X, double *resul, const int tam)
{
    *resul = 0;
    for(int i = 0; i < tam; i++)
    {
        for(int  j = 0; j < tam; j++)
        {
            if(i != j)
                *resul += A[i * tam + j] * X[j];
        }
    }
}

__global__ void do_parallel_matmul( double * mat_A, 
                            double * vec, 
                            double * rst, 
                            int dim)
{
     int rowIdx = threadIdx.x + blockIdx.x * blockDim.x; // Get the row Index 
     int aIdx;
     while( rowIdx < dim)
     {
          rst[rowIdx] = 0; // clean the value at first
          for (int i = 0; i < dim; i++)
          {
               aIdx = rowIdx * dim + i; // Get the index for the element a_{rowIdx, i}
               rst[rowIdx] += (mat_A[aIdx] * vec[i] ); // do the multiplication
          }
          rowIdx += gridDim.x * blockDim.x;
     }
     __syncthreads();
}

해결책

Some thoughts:

It is not fair to time the allocation of device memory and compare it with CPU without the host allocation of memory.
If cudaMalloc((void**)&d_A, sizeof(double) * Width * Width); is the first CUDA call it will include the CUDA context creation which could be a significant overhead.
Timing cudamemcpy is not a fair CPU/GPU comparison because this time will depend on the PCI-e bandwidth of the system. On the other hand if you see the kernel as an acceleration from the CPU point of view you will need to include the memcpy. In order to peak PCI-e bandwidth, use page-locked memory.
If your application is going to run the multiplication several times than you have the ability to hide most of the memcpy by overlaping copy with kernel execution. This is even better on a Tesla unit where you have dual DMA engines.
Timing the kernel itself will require you to synchronize the CPU with GPU before stopping the timer, otherwise you will only time the kernel launch itself and not the execution. Calling a kernel from CPU is asynchronous. IF you want to time the kernel execution on the GPU use cudaEvents.
Run many threads on GPU to get a fair comparison.
Improve kernel, you can do better.

다른 팁

You are using the wrong function for your measurements. clock measures the time that your process has spent on your CPU and not the wallclock time.

Take a look at the High Precision Timer lib, it uses OS related timing functions to measure time.

It uses a set of functions which can give you micro-second precision.

If you're on windows, you should use QueryPerformanceFrequency and QueryPerformanceCounter on Linux: gettimeofday()

It's very light and easy to use. Available for windows and linux.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow