I am not familiar with those implementations, but here is the inner function in one of my implementations, for integers only:
//-------------------------------------------------------------------------------------
//! sort the source array based on b-th byte and store the result in destination array
//! and keep index (how to go from the sorted array to the un-sorted)
template<typename T, typename SS, typename SD> inline
void radix_sort_byte(size_t b, array<T, SS>& src, array<T, SD>& dest,
size_array& ind_src, size_array& ind_dest)
{
b *= 8;
size_t B = 256, N = src.size();
size_array bytes = (src >> b) & 0xff; // current byte of each element
size_array count(B, size_t(0)); // occurrences of each element
++count[bytes];
if(count[0] == N) // all bytes are zero; order remains unchanged
{ dest = src; ind_dest = ind_src; return; }
size_array index = shift(cumsum(count), 1); // index-list for each element
size_array pos(N); // position of each element in the destination array
for(size_t i=0; i<N; i++) pos[i] = index[bytes[i]]++;
dest[pos] = src; // place elements in the destination array
ind_dest[pos] = ind_src; // place indices
}
It is not directly readable because it uses lots of auxiliary structures and functions, but the idea is that you keep a separate array with the indices. Once you have the position of elements in the destination array (pos), the last two lines update the value array and index array in exactly the same way.
I guess you can apply the same idea to any implementation, but you'd have to modify the code.