thrust 1.7 tabulate on CUDA device fails

Question

The problem appears to be fixed in the thrust master branch at the moment. This master branch currently identifies itself as Thrust v1.8.

I ran your code with CUDA 6RC (appears to be what you are using) and I was able to duplicate your observation.

I then updated to the master branch, and removed the __CUDA_ARCH__ macro from your code, and I got the expected results (host and device tabulations match).

Note that according to the programming guide, the __CUDA_ARCH__ macro is only defined when it's used in code that is being compiled by the device code compiler. It is officially undefined in host code. Therefore it's acceptable to use it as follows in host code:

#ifdef __CUDA_ARCH__

but not as you are using it. Yes, I understand the behavior is different between thrust v1.7 and thrust master in this regard, but that appears to (also) be a thrust issue, that has been fixed in the master branch.

Both of these issues I expect would be fixed whenever the next version of thrust gets incorporated into an official CUDA drop. Since we are very close to CUDA 6.0 official release, I'd be surprised if these issues were fixed in CUDA 6.0.

Further notes about the tabulate issue:

One workaround would be to update thrust to master
Issue doesn't appear to be specific to thrust::tabulate in my testing. Many thrust functions that I tested seem to fail in that when used with thrust::device and raw pointers, they fail to write values correctly (seem to write all zeroes), but they do seem to be able to read values correctly (e.g. thrust::reduce seems to work)
Another possible workaround is to wrap your raw pointers with thrust::device_ptr<> using thrust::device_ptr_cast<>(). That seemed to work for me as well.