CUDA stack frame size increase by forceinline

Question

The main reason to reduce the stack frame is that the stack is allocated in local memory which resides in off-chip device memory. This makes the access to the stack (if not cached) slow.

To show this, let me make a simple example. Consider the case:

__device__ __noinline__ void func(float* d_a, float* test, int tid) {
    d_a[tid]=test[tid]*d_a[tid];
}

__global__ void kernel_function(float* d_a) {
    float test[16];
    test[threadIdx.x] = threadIdx.x;
    func(d_a,test,threadIdx.x);
}

Note that the __device__ function is declared __noinline__. In this case

ptxas : info : Function properties for _Z15kernel_functionPf
    64 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas : info : Used 7 registers, 36 bytes cmem[0]

i.e., we have 64 bytes of stack frame. The corresponding disassembled code is

MOV R1, c[0x1][0x100];
ISUB R1, R1, 0x40;
S2R R6, SR_TID.X;                    R6 = ThreadIdx.x
MOV R4, c[0x0][0x20];
IADD R5, R1, c[0x0][0x4];
I2F.F32.U32 R2, R6;                  R2 = R6 (integer to float conversion)              
ISCADD R0, R6, R1, 0x2;
STL [R0], R2;                        stores R2 to test[ThreadIdx.x]                                
CAL 0x50; 
EXIT ;                               __device__ function part
ISCADD R2, R6, R5, 0x2;
ISCADD R3, R6, R4, 0x2;
LD R2, [R2];                         loads d_a[tid]
LD R0, [R3];                         loads test[tid]
FMUL R0, R2, R0;                     d_a[tid] = d_a[tid]*test[tid]
ST [R3], R0;                         store the new value of d_a[tid] to global memory
RET ;

As you can see, test is stored and loaded from global memory, forming the stack frame (it is 16 floats = 64 bytes).

Now change the device function to

__device__ __forceinline__ void func(float* d_a, float* test, int tid) {
    d_a[tid]=test[tid]*d_a[tid];
}

that is, change the __device__ function from __noinline__ to __forceinline__. In this case, we have

ptxas : info : Compiling entry function '_Z15kernel_functionPf' for 'sm_20'
ptxas : info : Function properties for _Z15kernel_functionPf
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

i.e., we have an empty stack frame now. Indeed, the disassembled code becomes:

MOV R1, c[0x1][0x100];               
S2R R2, SR_TID.X;                    R2 = ThreadIdx.x
ISCADD R3, R2, c[0x0][0x20], 0x2;    
I2F.F32.U32 R2, R2;                  R2 = R2 (integer to float conversion)
LD R0, [R3];                         R2 = d_a[ThreadIdx.x] (load from global memory)
FMUL R0, R2, R0;                     d_a[ThreadIdx.x] = d_a[ThreadIdx.x] * ThreadIdx.x
ST [R3], R0;                         stores the new value of d_a[ThreadIdx.x] to global memory
EXIT ;

As you can see, forcing the inlining enables the compiler to perform proper optimizations so that now test is fully discarded from the code.

In the above example, __forceinline__ has an effect that is opposite to what you are experiencing, which also shows that, without any further information, the first question cannot be answered.

CUDA stack frame size increase by __forceinline__

CUDA stack frame size increase by forceinline