Thrust::Sort very long compile time

https://stackoverflow.com/questions/6061491

cuda
thrust

15-11-2019
|

Question

I'm trying to compile a block of example code using Thrust in an attempt to help learn some CUDA.

I'm using Visual Studio 2010, and I've gotten other examples to compile. However, when I compile this example, it takes upwards of 10 minutes to compile. I've selectively commented out lines and figured out that its the Thrust::sort line that takes forever (with that one line commented out it takes about 5 seconds to compile).

I found a post somewhere that talked about how sort was slow to compile in Thrust and that was a decision that the Thrust development team made (its 3x faster at runtime, but takes longer to compile). But that post was in late 2008.

Any idea why this is taking so long?

Also, I'm compiling on a machine with the following specs, so its not a slow machine

i7-2600k @ 4.5 ghz
16 GB DDR3 @ 1833 mhz
Raid 0 of 6 GB/s 1TB drives

As requested, this is the build string that it looks like Visual Studio is invoking

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v3.2\bin\nvcc.exe" -ccbin "C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\bin" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v3.2\include" -G0 --keep-dir "Debug\" -maxrregcount=32 --machine 64 --compile -D_NEXUS_DEBUG -g -Xcompiler "/EHsc /nologo /Od /Zi /MTd " -o "Debug\kernel.obj" "C:\Users\Rob\Desktop\VS2010Test\VS2010Test\VS2010Test\kernel.cpp" -clean

Example

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/sort.h>
int main(void)
{
    // generate 16M random numbers on the host
    thrust::host_vector<int> h_vec(1 << 24);
    thrust::generate(h_vec.begin(), h_vec.end(), rand);
    // transfer data to the device
    thrust::device_vector<int> d_vec = h_vec;
    // sort data on the device
    thrust::sort(d_vec.begin(), d_vec.end());
    // transfer data back to host
    thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin());
    return 0;
}

Solution

The compiler in CUDA 3.2 was not optimized for compiling long, complex programs like sort using debugging mode (i.e nvcc -G0). You will find that CUDA 4.0 is much faster in this case. Removing the -G0 option should decrease compilation time by a significant fraction as well.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow