CUDA, little program to test the throughput

https://stackoverflow.com/questions/14673048

06-03-2022
|

Question

I would need to test some GPUs by testing their throughput in different scenarios.

This includes a simple 64b multiplication:

__device__ void add(unsigned int *data, bool flag){
unsigned int index = threadIdx.x;
unsigned int result;

asm ("{\n\t"
     "add.cc.u32    %1, %1, %1;\n\t"
     "addc.u32  %0, 0, 0;\n\t"
     "}"
     : "=r"(result), "+r"(index): );

if(flag)
    data[threadIdx.x] = result;
}

64b modulo:

__device__ void mod(){
    asm ("{\n\t"
         ".reg .u64 t1;\n\t"
         "cvt.u64.u32   t1, %0;\n\t"
         "rem.u64       t1, t1, t1;\n\t" 
         "}"
         : : "r"(index));
}

and 64b mul+mod:

__device__ void mulmod
    asm ("{\n\t"
         ".reg .u64 t1;\n\t"
         ".reg .u64 t2;\n\t"
         "mul.wide.u32  t1, %0, %0;\n\t"
         "cvt.u64.u32   t2, %0;\n\t"
         "rem.u64       t1, t1, t2;\n\t"
         "}"
         : : "r"(index));
}

I think that any memory access would be totally useless for my intent and I would like to use then the thread indexing variables as inputs.

And since I am going to write on no register then I don't need to care about the register usage and I can launch so many threads as possible (allowed by each GPU)

I would like to know:

if this is the proper way to do it
are there any particular thread configuration beyond maximizing the thread number where I can retrieve the best throughput?

Solution

The answer to your first "sub question" is no this isn't the proper way to do it because none of those functions you have written will get emitted by the compiler.

You can see more details in my answer in the question I linked to above, but the short version is that C compiler level dead code optimisation will eliminate any code which doesn't participate in a value which is written to memory. So you must have those functions return a value, and you must use the return value in a such a way that the compiler can't deduce that the call to your device function is redundant and eliminate it.

Beyond that you have to have enough active warps per SM to amortise all the instruction scheduling latency in the architecture and ensure that you measure the instruction throughput of your device functions, not the latency of the instruction scheduler and pipeline.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow