cuda kernel for conway's game of life

https://stackoverflow.com/questions/4438286

09-10-2019
|

Pergunta

I'm trying to calculate the number of transitions that would be made in a run of Conway's GOL for a pxq matrix for n iterations. For instance, given 1 iteration with the initial state being 1 blinker (as below). there would be 5 transitions (2 births, 1 survival, 2 deaths from underpopulation). I've already got this working, but I'd like to convert this logic to run using CUDA. Below is what I want to port to CUDA.

alt text code:

    static void gol() // call this iterations x's
    {
        int[] tempGrid = new int[rows * cols]; // grid holds init conditions
        for (int i = 0; i < rows; i++)
        {
            for (int j = 0; j < cols; j++)
            {
                tempGrid[i * cols + j] = grid[i * cols + j];
            }
        }

        for (int i = 0; i < rows; i++)
        {
            for (int j = 0; j < cols; j++)
            {
                int numNeighbors = neighbors(i, j); // finds # of neighbors

                if (grid[i * cols + j] == 1 && numNeighbors > 3)
                {
                    tempGrid[i * cols + j] = 0;
                    overcrowding++;
                }
                else if (grid[i * cols + j] == 1 && numNeighbors < 2)
                {
                    tempGrid[i * cols + j] = 0;
                    underpopulation++;
                }
                else if (grid[i * cols + j] == 1 && numNeighbors > 1)
                {
                    tempGrid[i * cols + j] = 1;
                    survival++;
                }
                else if (grid[i * cols + j] == 0 && numNeighbors == 3)
                {
                    tempGrid[i * cols + j] = 1;
                    birth++;
                }
            }
        }

        grid = tempGrid;
    }

Solução

Your main slowdown is going to be main memory access. So I'd suggest that you pick a largish thread block size based on the hardware you have available. 256 (16x16) is a good choice for cross-hardware compatibility. Each of those thread blocks is going to calculate the results for a slightly smaller section of the board -- if you used 16x16, they'll calculate the results for a 14x14 section of the board, since there is a one element border. (The reason to use a 16x16 block to calculate a 14x14 chunk rather than a 16x16 chunk is for memory read coalescing.)

Divide the board up into (say) 14x14 chunks; that is your grid (organized however you see fit, but most likely something like board_width / 14, board_height / 14.

Within the kernels, have each thread load its element into shared memory. Then syncthreads. Then have the middle 14x14 elements calculate the new value (using the values stored in shared memory) and write it back into global memory. The use of shared memory helps minimize global reads and writes. This is also the reason to have your thread block size as big as possible -- the edges and corners are "wasted" global memory accesses, since the values fetched there only get used 1 or 3 times, not 9 times.

Outras dicas

Here's one way you could proceed:

Each thread makes the computation for 1 element of the grid
Each thread first loads up one element from the main grid into shared memory
Threads on the edge of the thread block need also to load up boundary elements
Each thread can then make their survival computation based on the contents of shared memory
Each thread then writes their result back to main memory

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow