I figured out what the problem was.
After some bug fixes I got the "Too Many Resources Requested for Launch" error. For a loop unroll, extra registers per thread are needed and I was running out of registers, hence the error and the kernel fail. I needed 22 registers per thread, and I have 1024 threads per block.
By inserting my data into the CUDA_Occupancy_calculator it showed me that 1 block per SM is scheduled, leaving me with 32678 registers for a whole block on the compute capability 2.0 device.
22 registers*1024 threads = 22528 registers<32678 which should have worked. But I was compiling with nvcc -arch sm_13 using the C.C. 1.3 characteristic of 16384 registers per SM
I compiled with nvcc -arch sm_20 taking advantage of the 32678 registers, more than enough for the needed 22528, and it works fine now. Thanks to everyone, I learned about kernel errors.