Question

I use pyOpenCl 2013.1 and my code crashes on a nVidia GPU, AMD CPU and AMD GPU but works on a Intel CPU.

Using the nvidia GPU, the error raises on queue.finish after calling the kernel.

LogicError: clFinish failed: invalid command queue

I located the cause at line 48 in the following snippet.

1:  typedef struct {
2:    int global_index;
3:    int local_index;
4:    float speed_limit;
5:    float width;
6:  } segment_t;
7:  
8:  typedef struct {
9:      int item_count;
10:     segment_t first_item;
11: } segment_list_t;
12: 
13: void explode_segment_list_t(segment_list_t* list, segment_t** array)
14: {
15:     array[0] = &(list->first_item);
16: }
17: 
18: 
19: 
20: /*
21:  * ro_data is read-only array of 3316 byte (829 int)
22:  * wo_data is write-only array of 3316 byte (829 int)
23:  */
24: __kernel void test_kernel(global int* ro_data, global int* wo_data)
25: {
26:     unsigned int i = get_global_id(0);
27:     
28:     // copy uncasted, primitive types
29:     for(int index = 0; index < ro_data[0]; index++)
30:         wo_data[index] = ro_data[index];                 // this works
31:     
32:     // access casted local struct
33:     int temp[829] = {0};
34:     segment_list_t* casted_temp_list = (segment_list_t*)temp;
35:     casted_temp_list->item_count = 1337;                 // this works
36:     // do more tests
37:     segment_t* casted_temp_array;
38:     explode_segment_list_t(casted_temp_list, &casted_temp_array);
39:     casted_temp_array[1].global_index = 1;
40:     casted_temp_array[2].global_index = 2;               // this works
41:     
42:     // copy local data to global data
43:     for(int index = 0; index < ro_data[0]; index++)
44:         wo_data[index] = temp[index];                    // this works
45:         
46:     // access casted global memory
47:     segment_list_t* casted_wo_data = (segment_list_t*)wo_data;
48:     casted_wo_data->item_count = 42;                     // this fails on GPU but works on CPU
49:     
50: }

Executable pyopencl.capture_call here

An ugly memory wasting fix would be: allocating an local array, copy the data and then cast it. But i'm sure, i did something wrong here ... but what?

Thanks for your help!

EDIT: On AMD devices (CPU and GPU) it failes with a more informative message:

 *error: invalid type conversion
      segment_list_t* casted_wo_data = (segment_list_t*)wo_data;*
Was it helpful?

Solution

The error message from the AMD SDK helped to find the cause.

As described in a forum posting, the casted type must have the same memory location annotation as the source variable.

In this example:

segment_list_t* casted_wo_data = (segment_list_t*)wo_data;

have to be replaced with:

global segment_list_t* casted_wo_data = (global segment_list_t*)wo_data;

OTHER TIPS

I dunno what you did wrong.

But I can tell you that the way you are doing things is not the proper way to go and it will cause you a lot of trouble.

You should decalre the inputs and outputs of your kernel as segment_t and segment_list_t. You can declare these typedefs in the kernel code and int he Python code, then create the buffer of the appropiate types. This way you don't need to cast anything, and you avoid any possible out of bounds errors.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top