JCuda: copy multidimensional array from device to host

https://stackoverflow.com/questions/18469142

cuda
jcuda

26-06-2022
|

Question

I've been working with JCuda for some months now and I can't copy a multidimensional array from device memory to host memory. The funny thing is that I have no problems in doing so in the opposite direction (I can invoke my kernel with multidimensional arrays and everything works with the correct values).

In a few words, I put the results of my kernel in a bi-dimensional array of shorts, where the first dimension of such array is the number of threads, so that each one can write in different locations.

Here an example:

CUdeviceptr pointer_dev = new CUdeviceptr();
cuMemAlloc(pointer_dev, Sizeof.POINTER); // in this case, as an example, it's an array with one element (one thread), but it doesn't matter

// Invoke kernel with pointer_dev as parameter. Now it should contain some results

CUdeviceptr[] arrayPtr = new CUdeviceptr[1]; // It will point to the result
arrayPtr[0] = new CUdeviceptr();
short[] resultArray = new short[3]; // an array of 3 shorts was allocated in the kernel

cuMemAlloc(arrayPtr[0], 3 * Sizeof.SHORT);
cuMemcpyDtoH(Pointer.to(arrayPtr), pointer_dev, Sizeof.POINTER); // Its seems, using the debugger, that the value of arrayPtr[0] isn't changed here!
cuMemcpyDtoH(Pointer.to(resultArray), arrayPtr[0], 3 * Sizeof.SHORT); // Not the expected values in resultArray, probably because of the previous instruction

What am I doing wrong?

EDIT:

Apparently, there are some limitations that doesn't allow device allocated memory to be copied back to host, as stated in this (and many more) threads: link

Any workaround? I'm using CUDA Toolkit v5.0

Solution

Here we are copying a two dimensional array of integers from the device to host.

First, create a single dimensional array with size equal to size of another single dimension array (here blockSizeX).

CUdeviceptr[] hostDevicePointers = new CUdeviceptr[blockSizeX];
for (int i = 0; i < blockSizeX; i++)
{
    hostDevicePointers[i] = new CUdeviceptr();
    cuMemAlloc(hostDevicePointers[i], size * Sizeof.INT);
}

Allocate device memory for the array of pointers that point to the other array, and copy array pointers from the host to the device.

CUdeviceptr hostDevicePointersArray = new CUdeviceptr();
cuMemAlloc(hostDevicePointersArray, blockSizeX * Sizeof.POINTER);
cuMemcpyHtoD(hostDevicePointersArray, Pointer.to(hostDevicePointers), blockSizeX * Sizeof.POINTER);

Launch the kernel.

kernelLauncher.call(........, hostDevicePointersArray);

Transfer the output from the device to host.

int hostOutputData[] = new int[numberofelementsInArray * blockSizeX];
cuMemcpyDtoH(Pointer.to(hostOutputData), hostDevicePointers[i], numberofelementsInArray * blockSizeX * Sizeof.INT);

for (int j = 0; j < size; j++)
{
    sum = sum + hostOutputData[j];
}

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow