Paralleling a simple algorithm on GPU with CUDA

Question

I would suggest adding proper cuda error checking to your code. I believe your kernel is making out-of-bounds accesses and failing.
run your code with cuda-memcheck, as it will help identify why the kernel is failing.
These are fairly large allocations to make on the stack:
```
const int
  w = 4000,
  h = 2000;
unsigned char
  in[w*h],
  out[w*h];
```
roughly 8MB each. That can be a problem; it may be system-dependent. It's usually better to do large allocations via dynamic allocation e.g. malloc. On my particular system, I get a seg fault due to these large stack variables not being allocated correctly.
Your kernel is missing an appropriate "thread check". At first I thought you were doing a good job with this:
```
if(
     i < w              //first row
  ||   i >= (w * (h - 1)) // last row
  || !(i % w)             // first column
  ||  (i % w + 1 == w)    // last column
)
```
but this is a problem:
```
out[i] = 0;
return;
```
If you comment out the out[i] = 0; line, you'll have better luck. Alternatively, if you don't like commenting it out, you could do:
```
if (i < (w*h)) out[i] = 0;
```
The problem is your grid launch parameters necessarily create "extra threads":
```
dim3 threadsPerBlock(1024); //Max
dim3 numBlocks(w*h/threadsPerBlock.x + 1);
```
If you have a proper thread check (which you almost do...), then it's not a problem. But you can't let those extra threads write to invalid locations.

To explain thread per block and number of blocks, working through the arithmetic may be useful. A cuda kernel launch has an associated grid. The grid is simply all the threads associated with a kernel launch. The threads will be divided into blocks. So the grid is equal to the number of blocks launched times the threads per block. How many is that in your case? This line says you are asking for 1024 threads per block:

    dim3 threadsPerBlock(1024); //Max

The number of blocks you are launching is given by:

    dim3 numBlocks(w*h/threadsPerBlock.x + 1);

The arithmetic is:

    (w=4000)*(h=2000)/1024 = 7812.5 = 7812   (note this is an *integer* divide)

Then we add 1. So you are launching 7813 blocks. How many threads is that?

    (7813 blocks)*(1024 threads per block) = 8000512 threads

But you only need (and only want) 8000000 threads (= w * h) So you need a thread check to prevent the extra 512 threads from trying to access out[i]. But your thread check is broken in this respect.

As a final note, the most obvious way to me to make this code run faster would be to exploit the data-reuse in adjacent operations via shared memory. But get your code working correctly first.