Pregunta

What is the meaning of declaring register arrays in CUDA with volatile qualifier?

When I tried with volatile keyword with a register array, it removed the number of spilled register memory to local memory. (i.e. Force the CUDA to use registers instead of local memory) Is this the intended behavior?

I did not find any information about the usage of volatile with regard to register arrays in CUDA documentation.

Here is the ptxas -v output for both versions

With volatile qualifier

    __volatile__ float array[32];

ptxas -v output

ptxas info    : Compiling entry function '_Z2swPcS_PfiiiiS0_' for 'sm_20'
ptxas info    : Function properties for _Z2swPcS_PfiiiiS0_
88 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 47 registers, 16640 bytes smem, 80 bytes cmem[0], 8 bytes cmem[16]

Without volatile qualifier

    float array[32];

ptxas -v output

ptxas info    : Compiling entry function '_Z2swPcS_PfiiiiS0_' for 'sm_20'
ptxas info    : Function properties for _Z2swPcS_PfiiiiS0_
96 bytes stack frame, 100 bytes spill stores, 108 bytes spill loads
ptxas info    : Used 51 registers, 16640 bytes smem, 80 bytes cmem[0], 8 bytes cmem[16]
¿Fue útil?

Solución

The volatile qualifier specifies to the compiler that all references to a variable (read or write) should result in a memory reference and those references must be in the order specified in the program. The use of the volatile qualifier is illustrated in Chapter 12 of the Shane Cook book, "CUDA Programming".

The use of volatile will avoid some optimizations the compiler can do and so change the number of used registers used. The best way to understand what volatile is actually doing is to disassemble the relevant __global__ function with and without the qualifier.

Consider indeed the following kernel functions

__global__ void volatile_test() {

   volatile float a[3];

   for (int i=0; i<3; i++) a[i] = (float)i;
}

__global__ void no_volatile_test() {

   float a[3];

   for (int i=0; i<3; i++) a[i] = (float)i;
}

Disassembling the above kernel functions one obtains

code for sm_20
      Function : _Z16no_volatile_testv
.headerflags    @"EF_CUDA_SM20 EF_CUDA_PTX_SM(EF_CUDA_SM20)" 
/*0000*/        MOV R1, c[0x1][0x100]; /* 0x2800440400005de4 */
/*0008*/        EXIT ;                 /* 0x8000000000001de7 */


      Function : _Z13volatile_testv
.headerflags    @"EF_CUDA_SM20 EF_CUDA_PTX_SM(EF_CUDA_SM20)"
/*0000*/        MOV R1, c[0x1][0x100]; /* 0x2800440400005de4 */   
/*0008*/        ISUB R1, R1, 0x10;     /* 0x4800c00040105d03 */   R1 = address of a[0]
/*0010*/        MOV32I R2, 0x3f800000; /* 0x18fe000000009de2 */   R2 = 1
/*0018*/        MOV32I R0, 0x40000000; /* 0x1900000000001de2 */   R0 = 2
/*0020*/        STL [R1], RZ;          /* 0xc8000000001fdc85 */
/*0028*/        STL [R1+0x4], R2;      /* 0xc800000010109c85 */   a[0] = 0;
/*0030*/        STL [R1+0x8], R0;      /* 0xc800000020101c85 */   a[1] = R2 = 1;
/*0038*/        EXIT ;                 /* 0x8000000000001de7 */   a[2] = R0 = 2;

As you can see, when NOT using the volatile keyword, the compiler realizes that a is set but never used (indeed, the compiler returns the following warning: variable "a" was set but never used) and there is practically no disassembled code.

Opposite to that, when using the volatile keyword, all references to a are translated to memory references (write in this case).

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top