Code running perfectly on host, put in a kernel, fails for mysterious reasons

Question

Here's the problem in your code, and why it works in 64 bit machine mode but not 32 bit machine mode.

In your backpropagation kernel, in the forward path, you have a sequence of code like this:

/*
* for layer = 0
*/
for (i = 0; i < N[0]; i++) {    // for all neurons i of layer 0
a[0][i] = x[ data->n * pat + i];    // a[0][i] = input i
}

In 32 bit machine mode (Win32 project, --machine 32 is being passed to nvcc), the failure occurs on the iteration i=7 when the write of a[0][7] occurs; this write is out of bounds. At this point, a[0][7] is intended to hold a double value, but for some reason the indexing is placing us out of bounds.

By the way, you can verify this by simply opening a command prompt in the directory where your executable is built, and running the command:

cuda-memcheck test_bp

assuming test_bp.exe is the name of your executable. cuda-memcheck conveniently identifies that there is an out of bounds write occurring, and even identifies the line of source that it is occurring on.

So why is this out of bounds? Let's take a look earlier in the kernel code where a[0][] is allocated:

a[0] = (double *)malloc( N[0] * sizeof(double *) );
                                              ^ oops!!

a[0][] is intended to hold double data but you're allocating pointer storage. As it turns out, in a 64 bit machine the two types of storage are the same size, so it ends up working. But in a 32-bit machine, a double pointer is 4 bytes whereas double data is 8 bytes. So, in a 32-bit machine, when we index through this array taking data strides of 8 bytes, we eventually run off the end of the array.

Elsewhere in the kernel code you are allocating storage for the other "layers" of a like this:

a[layer] = (double *)malloc( N[layer] * sizeof(double) );

which is correct. I see that the original "host-only" code seems to contain this error as well. There may be a latent defect in that code as well.

You will still need to address the kernel running time to avoid the windows TDR event, in some fashion, if you want to run on a windows wddm device. And as I already pointed out, this code makes no attempt to use the parallel capability of the machine.