Question

I'm writing an openCL kernel to perform a brute force based AI for a puzzle game, but I have a problem with my Kernel code and/or the auxiliary function it calls. Here is my kernel code (I'm confident the inputs are getting passed here correctly): 60 is the global work size set by clEnqueueNDRangeKernel.

The inputs to the kernel are as follows:

__global char * in //dummy input for testing purposes

__global char * board_in, // a large char array containing 60 boards

__global int * lookup, // an array that I use to quickly get the score of scoring moves

Outputs:

__global char * out, //dummy output for testing

__global int * score_out, //an array of 60 scores: one for each board

__global int * row_out, // an array of 60 rows: one for each board evaluated

__global int * col_out // an array of 60 cols: ...

__kernel void helloworld(__global char* in,
                    __global char* board_in,
                    __global int* lookup,
                    __global char* out, 
                    __global int * score_out,
                    __global int * row_out,
                    __global int * col_out)
{

    int num = get_global_id(0);
    char workingBoard[72];
    int scoreMat[64];
//set up the array for each thread to use
    for(int k=0; k< 72; k++)
    {
        workingBoard[k] = board_in[num*BOARDSIZE+k];
    }
// Make a copy of the score matrix for each thread to use
    for(int j=0; j<64; j++)
    {
        scoreMat[j] = lookup[j];
    }
    int s=0;
    int r=0;
    int c=0;
    findBestMove(workingBoard,scoreMat,&s,1,&r,&c);
    col_out[num] = ?????????
    score_out[num] = ???????????
    row_out[num] = ???????????????
}

The function findBestMove works like this (Its pretty well tested. I've used it in a CPU implementation for a while): It takes a Board (char array), a score-lookup array, a pointer to what the move scores, the current depth, and a pointer to the row and column. It is supposed to set the score, row, and column. It calls other functions that I define in the same document.

If I run this code snippet on the CPU, I get the proper output:

// workerBoard and lookuparr are set previous to this to be the same as what
//the kernel thread is supposed to have
int s=0;
int r=0;
int c=0;
findBestMove(workerBoard,lookuparr,&s,1,&r,&c);
cout<<s<<","<<r<<","<<c<<endl;

When I run my kernel code, I don't make it past the function call. The function is defined in the same document as the kernel, and doesn't use dynamic memory, function pointers, recursion, or global memory (outside of the kernel args). I do use some #define statements.

I want to set the ???? sections of my kernel to be r, c and s, but as mentioned, I don't get there. Am I making any critical mistakes (note: the kernel passes my code-checker and AMD's kernel Analyzer). Also, I'm pretty new to openCL, so any tips are welcome as well. If I can provide any more information to help answer this question, let me know!

Was it helpful?

Solution

Based on your comments It seems that the problem is somewhere in your findBestMove function. BTW if you had an infinite loop, in one point the watchdog would trigger and most probably your driver would crash resulting with a black screen or a frozen one.

So I'd suggest that you comment all your code in your function and just assign the r, s, c variables a chosen value like the workitem id that handled these specific variables using the get_global_id function. Of course replacing the ??? with:

  col_out[num] = c;
  score_out[num] = s;
  row_out[num] = r;

If you get the proper value, start debugging your function you'll know for sure your problem is in the function.

Since you asked for some tips here is one that i think will improve the performance (once you fixed your bug :)): instead of using the private memory for your scoreMat array use the local memory. Doing so you will avoid to make each thread accessing the same data in the global memory over and over (which is slow). To fetch the data from the global to the local memory you can use the async_work_group_copy function.

So in your case you'd have something like this:

local int scoreMat[64];
event_t ev = async_work_group_copy(lookup, scoreMat, 64, 0); 
// Wait to make sure everything is copied   
wait_group_events (1, &ev);

You might need to change some more code to take into account that you use now local memory. Basically it works the same way than the global one (from the access point of view) but it is much faster.

Note that the difference with what you have is that only one copy will be made not 60 (the number of workitems). Also this time the data you fetched from global are accessible from all the workitems within a workgroup. Before each workitem had it's own copy. It is also important to highlight the fact that is is within a workgroup. But since you are using only 60 workitems you most probably have only one workgroup.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top