문제

I have to port a pre-existing “host-only” backpropagation implementation to CUDA. I think the nature of the algorithm doesn’t matter here, so I won’t give much explanation about the way it works. What I think matter though, is that it uses 3-dimensional arrays, whose all three dimensions are dynamically allocated. I use VS2010, with CUDA 5.0. And my device is a 2.1. The original host-only code can be downloaded here → http://files.getwebb.org/view-cre62u4d.html

Main points of the code:

  1. patterns from adult.data are loaded into memory, using the Data structure, present in “pattern.h”.
  2. several multi-dimensional arrays are allocated
  3. the algorithm is ran over the patterns, using the arrays allocated just before.

If you want to try to run the code don’t forget to modify the PATH constant at the beginning of kernel.cu. I also advise you to use “2” layers, “5” neurons, and a learning rate of “0.00001”. As you can see, this work perfectly. The “MSE” is improving. For those who have no clue about what does this algorithms, let’s simply say that it learns how to predict a target value, based on 14 variables present in the patterns. The “MSE” decrease, meaning that the algorithm makes less mistakes after each “epoch”.

I spent a really long time trying to run this code on the device. And I’m still unsuccessful. Last attempt was done by simply copying the code initializing the arrays and running the algorithm into a big kernel. Which failed again. This code can be downloaded there → http://files.getwebb.org/view-cre62u4c.html

To be precise, here are the differences with the original host-only code:

  • f() and fder(), which are used by the algorithm, become device functions.
  • parameters are hardcoded: 2 layers, 5 neurons, and a learning rate of 0.00001
  • the “w” array is initialized using a fixed value (0.5), not rand() anymore
  • a Data structure is allocated in device’s memory, and the data are sent in device’s memory after they have been loaded from adult.data in host’s memory

I think I did the minimal amount of modifications needed to make the code run in a kernel. The “kernel_check_learningData” kernel, show some informations about the patterns loaded in device’s memory, proving the following code, sending the patterns from the host to the device, did work:

Data data;
Data* dev_data;
int* dev_t;
double* dev_x;
...
input_adult(PathFile, &data);
...
cudaMalloc((void**)&dev_data, sizeof(Data));
cudaMalloc((void**)&dev_t, data.N * sizeof(int));
cudaMalloc((void**)&dev_x, data.N * data.n * sizeof(double));
// Filling the device with t and x's data.
cudaMemcpy(dev_t, data.t, data.N * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_x, data.x, data.N * data.n * sizeof(double), cudaMemcpyHostToDevice);
// Updating t and x pointers into devices Data structure.
cudaMemcpy(&dev_data->t, &dev_t, sizeof(int*), cudaMemcpyHostToDevice);
cudaMemcpy(&dev_data->x, &dev_x, sizeof(double*), cudaMemcpyHostToDevice);
// Copying N and n.
cudaMemcpy(&dev_data->N, &data.N, sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(&dev_data->n, &data.n, sizeof(int), cudaMemcpyHostToDevice);

It apparently fails at the beginning of the forward phase, when reading the “w” array. I can’t find any explanation for that.

I see two possibilities:

  1. the code sending the patterns into device's memory is bugged, despite the fact it seems to work properly, and provoke a bug way further, when beginning the forward phase.
  2. the CUDA API is not behaving like it should!

I’m desperately searching for my mistake for a very long time. So I wondered if the community could provide me with some help.

Thanks.

도움이 되었습니까?

해결책

Here's the problem in your code, and why it works in 64 bit machine mode but not 32 bit machine mode.

In your backpropagation kernel, in the forward path, you have a sequence of code like this:

/*
* for layer = 0
*/
for (i = 0; i < N[0]; i++) {    // for all neurons i of layer 0
a[0][i] = x[ data->n * pat + i];    // a[0][i] = input i
}

In 32 bit machine mode (Win32 project, --machine 32 is being passed to nvcc), the failure occurs on the iteration i=7 when the write of a[0][7] occurs; this write is out of bounds. At this point, a[0][7] is intended to hold a double value, but for some reason the indexing is placing us out of bounds.

By the way, you can verify this by simply opening a command prompt in the directory where your executable is built, and running the command:

cuda-memcheck test_bp

assuming test_bp.exe is the name of your executable. cuda-memcheck conveniently identifies that there is an out of bounds write occurring, and even identifies the line of source that it is occurring on.

So why is this out of bounds? Let's take a look earlier in the kernel code where a[0][] is allocated:

a[0] = (double *)malloc( N[0] * sizeof(double *) );
                                              ^ oops!!

a[0][] is intended to hold double data but you're allocating pointer storage. As it turns out, in a 64 bit machine the two types of storage are the same size, so it ends up working. But in a 32-bit machine, a double pointer is 4 bytes whereas double data is 8 bytes. So, in a 32-bit machine, when we index through this array taking data strides of 8 bytes, we eventually run off the end of the array.

Elsewhere in the kernel code you are allocating storage for the other "layers" of a like this:

a[layer] = (double *)malloc( N[layer] * sizeof(double) );  

which is correct. I see that the original "host-only" code seems to contain this error as well. There may be a latent defect in that code as well.

You will still need to address the kernel running time to avoid the windows TDR event, in some fashion, if you want to run on a windows wddm device. And as I already pointed out, this code makes no attempt to use the parallel capability of the machine.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top