Question

The new thrust::tabulate function works for me on the host but not on the device. The device is a K20x with compute capability 3.5. The host is an Ubuntu machine with 128GB of memory. Help?

I think that the unified addressing is not the problem since I can sort a unifiedly addressed array on the device.

#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/execution_policy.h>
#include <thrust/tabulate.h> 
#include <thrust/version.h> 

using namespace std;

// Print an expression's name then its value, possible followed by a
// comma or endl.  Ex: cout << PRINTC(x) << PRINTN(y);
#define PRINT(arg)  #arg "=" << (arg)
#define PRINTC(arg)  #arg "=" << (arg) << ", "
#define PRINTN(arg)  #arg "=" << (arg) << endl

//   Execute an expression and check for CUDA errors.
#define CE(exp) {                       \
cudaError_t e; e = (exp);                       \
if (e != cudaSuccess) { \
   cerr << #exp << " failed at line " << __LINE__ << " with error " << cudaGetErrorString(e) << endl; \
   exit(1); \
} \
}

const int N(10);

int main(void) {
  int major = THRUST_MAJOR_VERSION;
  int minor = THRUST_MINOR_VERSION;
  cout << "Thrust v" << major << "." << minor 
   << ", CUDA_VERSION: " << CUDA_VERSION << ", CUDA_ARCH: " << __CUDA_ARCH__ 
   << endl;
  cout << PRINTN(N);
  cudaDeviceProp prop;
  cudaGetDeviceProperties(&prop, 0);
  if (!prop.unifiedAddressing) {
cerr << "Unified addressing not available." << endl;
exit(1);
  }
  cudaGetDeviceProperties(&prop, 0);
  if (!prop.canMapHostMemory) {
cerr << "Can't map host memory." << endl;
exit(1);
  }
  cudaSetDeviceFlags(cudaDeviceMapHost);

  int *p, *q;
  CE(cudaHostAlloc(&p, N*sizeof(int), cudaHostAllocMapped));
  CE(cudaHostAlloc(&q, N*sizeof(int), cudaHostAllocMapped));

  thrust::tabulate(thrust::host, p, p+N, thrust::negate<int>());
  thrust::tabulate(thrust::device, q, q+N, thrust::negate<int>());

  for (int i=0; i<N; i++) 
cout << PRINTC(i) << PRINTC(p[i]) << PRINTN(q[i]);
}

Output:

 Thrust v1.7, CUDA_VERSION: 6000, CUDA_ARCH: 0
 N=10
 i=0, p[i]=0, q[i]=0
 i=1, p[i]=-1, q[i]=0
 i=2, p[i]=-2, q[i]=0
 i=3, p[i]=-3, q[i]=0
 i=4, p[i]=-4, q[i]=0
 i=5, p[i]=-5, q[i]=0
 i=6, p[i]=-6, q[i]=0
 i=7, p[i]=-7, q[i]=0
 i=8, p[i]=-8, q[i]=0
 i=9, p[i]=-9, q[i]=0

The following does not add any info content to my post but is required before stackoverflow will accept it: Much of the program is error checking and version checking.

Was it helpful?

Solution

The problem appears to be fixed in the thrust master branch at the moment. This master branch currently identifies itself as Thrust v1.8.

I ran your code with CUDA 6RC (appears to be what you are using) and I was able to duplicate your observation.

I then updated to the master branch, and removed the __CUDA_ARCH__ macro from your code, and I got the expected results (host and device tabulations match).

Note that according to the programming guide, the __CUDA_ARCH__ macro is only defined when it's used in code that is being compiled by the device code compiler. It is officially undefined in host code. Therefore it's acceptable to use it as follows in host code:

#ifdef __CUDA_ARCH__

but not as you are using it. Yes, I understand the behavior is different between thrust v1.7 and thrust master in this regard, but that appears to (also) be a thrust issue, that has been fixed in the master branch.

Both of these issues I expect would be fixed whenever the next version of thrust gets incorporated into an official CUDA drop. Since we are very close to CUDA 6.0 official release, I'd be surprised if these issues were fixed in CUDA 6.0.

Further notes about the tabulate issue:

  1. One workaround would be to update thrust to master
  2. Issue doesn't appear to be specific to thrust::tabulate in my testing. Many thrust functions that I tested seem to fail in that when used with thrust::device and raw pointers, they fail to write values correctly (seem to write all zeroes), but they do seem to be able to read values correctly (e.g. thrust::reduce seems to work)
  3. Another possible workaround is to wrap your raw pointers with thrust::device_ptr<> using thrust::device_ptr_cast<>(). That seemed to work for me as well.
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top