As @sgar91 pointed out, the problem is the compilation target mismatching the actual GPU.
To be specific: you have -code sm_20
in your options which will make the compiler build a binary for sm_20 and there will be no PTX in your binary - that means it cannot be JIT compiled for your devices (compute capability > 2.0) and hence your GPU operations will fail. You should have -code compute_20
or one or more -gencode
arguments (see nvcc manual for more examples).
Some examples:
$ nvcc test.cu -o test -arch compute_20 -code compute_20
$ nvcc test.cu -o test -gencode="arch=compute_20,code=\"compute_20,sm_20,sm_30\""
$ nvcc test.cu -o test -gencode="arch=compute_20,code=\"sm_20,sm_21\"" -gencode="arch=compute_30,code=\"compute_30,sm_30\""
Rather than doing an assert on your CUDA API call, you should report the actual error since that would have been helpful here.