Question

I've been working with JCuda for some months now and I can't copy a multidimensional array from device memory to host memory. The funny thing is that I have no problems in doing so in the opposite direction (I can invoke my kernel with multidimensional arrays and everything works with the correct values).

In a few words, I put the results of my kernel in a bi-dimensional array of shorts, where the first dimension of such array is the number of threads, so that each one can write in different locations.

Here an example:

CUdeviceptr pointer_dev = new CUdeviceptr();
cuMemAlloc(pointer_dev, Sizeof.POINTER); // in this case, as an example, it's an array with one element (one thread), but it doesn't matter

// Invoke kernel with pointer_dev as parameter. Now it should contain some results

CUdeviceptr[] arrayPtr = new CUdeviceptr[1]; // It will point to the result
arrayPtr[0] = new CUdeviceptr();
short[] resultArray = new short[3]; // an array of 3 shorts was allocated in the kernel

cuMemAlloc(arrayPtr[0], 3 * Sizeof.SHORT);
cuMemcpyDtoH(Pointer.to(arrayPtr), pointer_dev, Sizeof.POINTER); // Its seems, using the debugger, that the value of arrayPtr[0] isn't changed here!
cuMemcpyDtoH(Pointer.to(resultArray), arrayPtr[0], 3 * Sizeof.SHORT); // Not the expected values in resultArray, probably because of the previous instruction

What am I doing wrong?

EDIT:

Apparently, there are some limitations that doesn't allow device allocated memory to be copied back to host, as stated in this (and many more) threads: link

Any workaround? I'm using CUDA Toolkit v5.0

Was it helpful?

Solution

Here we are copying a two dimensional array of integers from the device to host.

  1. First, create a single dimensional array with size equal to size of another single dimension array (here blockSizeX).

    CUdeviceptr[] hostDevicePointers = new CUdeviceptr[blockSizeX];
    for (int i = 0; i < blockSizeX; i++)
    {
        hostDevicePointers[i] = new CUdeviceptr();
        cuMemAlloc(hostDevicePointers[i], size * Sizeof.INT);
    }
    
  2. Allocate device memory for the array of pointers that point to the other array, and copy array pointers from the host to the device.

    CUdeviceptr hostDevicePointersArray = new CUdeviceptr();
    cuMemAlloc(hostDevicePointersArray, blockSizeX * Sizeof.POINTER);
    cuMemcpyHtoD(hostDevicePointersArray, Pointer.to(hostDevicePointers), blockSizeX * Sizeof.POINTER);
    
  3. Launch the kernel.

    kernelLauncher.call(........, hostDevicePointersArray);
    
  4. Transfer the output from the device to host.

    int hostOutputData[] = new int[numberofelementsInArray * blockSizeX];
    cuMemcpyDtoH(Pointer.to(hostOutputData), hostDevicePointers[i], numberofelementsInArray * blockSizeX * Sizeof.INT);
    
    for (int j = 0; j < size; j++)
    {
        sum = sum + hostOutputData[j];
    }
    
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top