many, but small, arrays merging on GPU

https://stackoverflow.com/questions/23351772

11-07-2023
|

Question

I have 2D datastructures in the form of:

{{t1,1;...;t1,500}, ..., {t31,1; ...;t31,500}}

thus making it around 15.000 items per data structure. (the range 1-500 can vary a lot and is only indicative, the 31 is fixed). I get 2000 of these datastructures every 100ms and represents measurement data.

The tx,x represent simple integers. It is guaranteed that e.g. t1,1 < t1,2 etc; but not in general true that t1,1 < t2,1.

we want to merge all the tx,x into a (ordered!) 1D structure for every 2000 2D-structures.

Now we currently have a fairly fast sequential CPU merge algorithm, but we're trying to look at GPGPU possibilities, now 15.000 items probably wouldn't be worth the overhead, which I can confirm with intial opencl sorting algorithms. However are there straight-forward strategies of lifting an GPGPU kernel operating on 1 dataset, in this context one 2d datastructure, to do it parallel on 2000 datasets? as to fully utilise the GPU?

I have read about concurrent kernels, but this only allows for up to 4-16 concurrent kernels on current platforms. concurrent kernels are also a lot more of a generalized solution because they allow different kinds of kernels, whereas in this case we need the same kernel done over 2000 datastructures, something like SKMD, single kernel multiple data (*), however I'm unaware of parallelism on this level or am I really missing something obvious in fully utilising a GPU in this context?

I'm desperate for any help, so suggestions based on proprietary or open standards both are welcome.

(*) Yes I made that up to give an idea, hopefully

Solution

answering own Question:

So in GPU jargon, doing many but the same task 'x' is called segmented 'x' *) , in this case I would've liked segmented merge. So far, no production ready algorithms are there, therefor I will be using segmented sort, still very fast, but I'm throwing overboard the fact that the inner arrays are already sorted, but I can now do 2000 of the datasets in parallel.

An alternative would be to use localitySort, which does make use of the fact that there are regions which are already sorted, however lifting this function to operate over 2000 items eludes me for now, perhaps pre and post-processing and using the localitySort over keys and! values (which is also available) will yield results.

*) this information got to me after watching the very helpful udacity lectures about GPU programming.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow