When I compile and run your program I get a segfault. This is due to the last parameter you are passing to the kernel (N_d
):
call pi_darts<<<grid, tblock>>>(x_d, y_d, pi_parts_d, N_d)
Since N
is a scalar quantity, the kernel is expecting to use it directly, rather than as a pointer. So when you pass a pointer to device data (N_d
), the process of setting up the kernel generates a seg fault (in host code!) as it attempts to access the value N
, which should be passed directly as:
call pi_darts<<<grid, tblock>>>(x_d, y_d, pi_parts_d, N)
When I make that change to the code you have posted, I then get actual printed output (instead of a seg fault), which is an array of ones and zeroes (256 ones, followed by 144 zeroes, for a total of N
=400 values), followed by the calculated PI value (which happens to be 2.56 in this case (4*256/400), since you have made the kernel basically a dummy kernel).
This line of code is also probably not doing what you want:
grid = dim3(N/tBlock%x,1,1)
With N
= 400 and tBlock%x
= 256 (from previous code lines), the result of the calculation is 1 (ie. grid
ends up at (1,1,1)
which amounts to one threadblock). But you really want to launch 2 threadblocks, so as to cover the entire range of your data set (N
= 400 elements). There's a number of ways to fix this, but for simplicity let's just always add 1 to the calculation:
grid = dim3((N/tBlock%x)+1,1,1)
Under these circumstances, when we launch grids that are larger (in terms of total threads) than our data set size (512 threads but only 400 data elements in this example) it's customary to put a thread check near the beginning of our kernel (in this case, after the initialization of id
), to prevent out-of-bounds accesses, like so:
if (id .lt. N) then
(and a corresponding endif
at the very end of the kernel code) This way, only the threads that correspond to actual valid data are allowed to do any work.
With the above changes, your code should be essentially functional, and you should be able to revert your kernel code to the proper statements and start to get an estimate of PI.
Note that you can check the CUDA API for error return codes, and you can also run your code with cuda-memcheck
to get an idea of whether the kernel is making out-of-bounds accesses. Niether of these would have helped with this particular seg fault, however.