Question

Does anybody knows why vector allocation on device takes too much for the first run being compiled in Debug mode? In my particular case (NVIDIA Quadro 3000M, Cuda Toolkit 6.0, Windows 7, MSVC2010) first run for Debug compiled version takes over 40 seconds, next (no recompilation) runs take 10 times less (vector allocation on device for Release version takes over 1 second).

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/sort.h>
#include <thrust/copy.h>
#include <cstdlib>

#include <ctime>

int main(void) {
    clock_t t; 

    t = clock();
    thrust::host_vector<int> h_vec( 100);
    clock_t dt = clock() - t;
    printf ("allocation on host - %f sec.\n", (float)dt/CLOCKS_PER_SEC);

    t = clock();
    thrust::generate(h_vec.begin(), h_vec.end(), rand);
    dt = clock() - t;
    printf ("initialization on host - %f sec.\n", (float)dt/CLOCKS_PER_SEC);

    t = clock();
    thrust::device_vector<int> d_vec( 100); // First run for Debug compiled version takes over 40 seconds...
    dt = clock() - t;
    printf ("allocation on device - %f sec.\n", (float)dt/CLOCKS_PER_SEC);

    t = clock();
    d_vec[0] = h_vec[0];
    dt = clock() - t;
    printf ("copy one to device - %f sec.\n", (float)dt/CLOCKS_PER_SEC);

    t = clock();
    d_vec = h_vec;
    dt = clock() - t;
    printf ("copy all to device - %f sec.\n", (float)dt/CLOCKS_PER_SEC);

    t = clock();
    thrust::sort(d_vec.begin(), d_vec.end());
    dt = clock() - t;
    printf ("sort on device - %f sec.\n", (float)dt/CLOCKS_PER_SEC);

    t = clock();
    thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin());
    dt = clock() - t;
    printf ("copy to host - %f sec.\n", (float)dt/CLOCKS_PER_SEC);

    t = clock();
    for(int i=0; i<10; i++)
        printf("%d\n", h_vec[i]);
    dt = clock() - t;
    printf ("output - %f sec.\n", (float)dt/CLOCKS_PER_SEC);

    std::cin.ignore();
    return 0;
}
Was it helpful?

Solution

Most of the time you are measuring for the first vector instantiation isn't the cost of the vector allocation and initialisation, it is overhead costs associated with the CUDA runtime and driver. I would guess that if you changed your code to something like this:

int main(void) {
    clock_t t; 

    ....

    cudaFree(0); // This forces context establishment and lazy runtime overheads

    t = clock();
    thrust::device_vector<int> d_vec( 100); // First run for Debug compiled version takes over 40 seconds...
    dt = clock() - t;
    printf ("allocation on device - %f sec.\n", (float)dt/CLOCKS_PER_SEC);


    .....

You should see that the time you measure to allocate the vector between first and second runs becomes the same even though the wall clock time to run the program shows a big difference.

I don't have a good explanation as to why there is such a large difference in startup time between first and second runs, but if I was to hazard a guess, it is that there is some driver level JIT recompilation being performed on the first run, and the driver caches the code for subsequent runs. One thing to check is that you are compiling code for the correct architecture for your GPU, that would eliminate driver recompilation as a source of the time difference.

The nvprof utility can provide you with an API trace and timings. You might want to run it and see where in the API call sequence the difference in time is arising from. It isn't beyond the realms of possibility that you are seeing the effects of some sort of driver bug, but without more information it is impossible to say.

OTHER TIPS

It looks like in my case (NVIDIA Quadro 3000M, Cuda Toolkit 6.0, Windows 7, MSVC2010) the problem is solved by changing project CUDA C/C++ / Code Generation option from compute_10,sm_10 to compute_20,sm_20 which states for newer GPU achrchitecture. So I've got happiness for today )

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top