max registers per thread is documented here
It is 63 for cc 2.x and 3.0, 128 for cc 1.x and 255 for cc 3.5
The compiler may have decided that 63 registers is enough, and doesn't have use for additional registers. Registers can be reused, so just because you have a lot of local variables, doesn't necessarily mean that the registers per thread has to be high.
My suggestion would be to use the nvcc -maxrregcount
option to specify various limits, and then use the -Xptxas -v
option to have the compiler tell you how many registers it is using when it creates the PTX.