OpenCL: how to copy contiguous chunk of global mem to private mem?

https://stackoverflow.com/questions/21109638

27-09-2022
|

Вопрос

How can I copy a contiguous chunk of constant global memory to (a contiguous chunk of) private memory? I need something like memcpy, but then for copying bytes between the different OpenCL address spaces. I know the size of the chunk and data is stored contiguously in global and local/private mem, so, in general, this should be possible, right?

In my specific problem, I have a constant global array of a struct type containing int's, float's, and even another struct type. To prevent pulling each member of the structs separately from global mem (which is slow) I'd like to have a copy of a complete array element in private memory. Doing something like privatestruct = globalstruct[i] does not result in a deep copy of the complete struct unfortunately.

Of course I'm not the first to ask this, or a similar, question, so there are a couple of threads on stackoverflow discussing related issues. However, practically all answers suggest to use async_work_group_copy which cannot be the generic answer since it is been defined only for clean built-in data types, not mixed structs, structs of structs, or any (eg bit-wise) user-defined memory interpretation. And it's meant for local mem anyway.

Thanks a lot for any suggestions!!

Решение

1) Make your struct size multiple of 4-bytes. For example, if it is 125 bytes long, then you can add a char3 which is three bytes long to have a 128-bytes of chunk of struct.

2) Reorder the struct, biggest or multiple-of-4-bytes elements to "head" and smaller/not multiple ones to "tail". This will make your struct need less memory access operations.

3) As DarkZeros mentioned, you may try to get the struct with _work_group_copy(with casting to a long16 or similar if struct is too big) and then carry the values to private memory element-wise. There are many cache lines for this, so it would be fast enough to copy from local to private. (dont forget synching them before/after the transitions)

4) Pack small variables into bigger ones until it fills a cache lane so that lane's bandwidth is not wasted when under heavy cache usage.

But, if you are to copy "a single" struct to "all cores" of a thread group, you can copy element-wise because some newer GPUs have broadcasting technology which can be fastest in such scenarios. Please indicate any speedup (if algorithm happens to have) in the future, as a multiplier.

5) Sometimes heavy branching can purge performance and hide those memory latencies for benchmarking(in a bad way of course).

Лицензировано под: CC-BY-SA с атрибуция

Не связан с StackOverflow