answering own Question:
So in GPU jargon, doing many but the same task 'x' is called segmented 'x' *) , in this case I would've liked segmented merge. So far, no production ready algorithms are there, therefor I will be using segmented sort, still very fast, but I'm throwing overboard the fact that the inner arrays are already sorted, but I can now do 2000 of the datasets in parallel.
An alternative would be to use localitySort, which does make use of the fact that there are regions which are already sorted, however lifting this function to operate over 2000 items eludes me for now, perhaps pre and post-processing and using the localitySort over keys and! values (which is also available) will yield results.
*) this information got to me after watching the very helpful udacity lectures about GPU programming.