The problem appears to be fixed in the thrust master branch at the moment. This master branch currently identifies itself as Thrust v1.8.
I ran your code with CUDA 6RC (appears to be what you are using) and I was able to duplicate your observation.
I then updated to the master branch, and removed the __CUDA_ARCH__
macro from your code, and I got the expected results (host and device tabulations match).
Note that according to the programming guide, the __CUDA_ARCH__
macro is only defined when it's used in code that is being compiled by the device code compiler. It is officially undefined in host code. Therefore it's acceptable to use it as follows in host code:
#ifdef __CUDA_ARCH__
but not as you are using it. Yes, I understand the behavior is different between thrust v1.7 and thrust master in this regard, but that appears to (also) be a thrust issue, that has been fixed in the master branch.
Both of these issues I expect would be fixed whenever the next version of thrust gets incorporated into an official CUDA drop. Since we are very close to CUDA 6.0 official release, I'd be surprised if these issues were fixed in CUDA 6.0.
Further notes about the tabulate issue:
- One workaround would be to update thrust to master
- Issue doesn't appear to be specific to
thrust::tabulate
in my testing. Many thrust functions that I tested seem to fail in that when used withthrust::device
and raw pointers, they fail to write values correctly (seem to write all zeroes), but they do seem to be able to read values correctly (e.g.thrust::reduce
seems to work) - Another possible workaround is to wrap your raw pointers with
thrust::device_ptr<>
usingthrust::device_ptr_cast<>()
. That seemed to work for me as well.