nvcc - different block sizes depending on arch at compile time

https://stackoverflow.com/questions/19847756

29-07-2022
|

Question

I have a kernel which shows highest performance for different block sizes when running on Kepler and Fermi hardware. I would like, at compile-time, to check the current architecture target and define a THREADS_PER_BLOCK macro to i) launch the kernel with; ii) determine the number of blocks necessary; iii) statically set the shared memory size in the kernel.

The below demonstrates what I am attempting to do. Suppose I am targeting GK104 hardware, and hence use nvcc -arch=sm_30. This will still result in THREADS_PER_BLOCK = 256 since __CUDA_ARCH__ is not defined for the host code compilation. (I understand, from e.g. this answer, why it can't work this way.)

#if __CUDA_ARCH__ >= 300
#define THREADS_PER_BLOCK 512
#else
#define THREADS_PER_BLOCK 256
#endif

__global__ void some_kernel(int* a, int* b) {
    __shared__ sm_data[THREADS_PER_BLOCK];
    // Do something.
}

int main(void) {
    // Initialize data.
    // Calculate blocks based on THREADS_PER_BLOCK, problem size and some max.
    some_kernel<<blocks, THREADS_PER_BLOCK>>>(d_a, d_b)
    return 0;
}

I could check device properties at run-time and use dynamic shared memory, but would like to know if this can be hard-coded at compile-time without e.g. having to manually add a -DFERMI or -DKEPLER and setting THREADS_PER_BLOCK based on that. NB: Any users of this code will be compiling it themselves, almost certainly for one architecture, so this isn't an unreasonable option. It just seems superfluous in light of passing the -arch= flag.

Solution

nvcc compiler does not detect locally available GPUs, it always targets SM 1.0 by default. Otherwise it could introduce some quite confusing behavior when building on different systems.

To compile for the available device, you either need to ask user to specify the SM version or run some detection code during the build time. I'm convinced that it is easier to put hardware detection code into runtime and then configure your kernel launch as desired.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow