A fast, rank based Radix Sort for floats?

Question 1

You can either

Expand each item to include its original index (this could be done during the first counting pass). Of course, the index digits are ignored for sorting purposes.
Store indices into buckets instead of values. Lookup the value each time the digits are required.

The first takes more space but has better locality of reference, the second saves space.

Question 2

It is fairly straight forward to make any sort index based. Any sort is a series of comparisons and swaps, so do this.

// data to be sorted is in data[ 0 .. n ]
int index[ n + 1 ];
for( int i = 0; i <= n; i++ ) index[i] = i;
// To compare data, compare data[index[j]] < data[index[k]]
// To swap values, swap index[j] <=> index[k]

Question 3

I am not familiar with those implementations, but here is the inner function in one of my implementations, for integers only:

//-------------------------------------------------------------------------------------
//! sort the source array based on b-th byte and store the result in destination array
//! and keep index (how to go from the sorted array to the un-sorted)

template<typename T, typename SS, typename SD> inline
void radix_sort_byte(size_t b, array<T, SS>& src, array<T, SD>& dest,
             size_array& ind_src, size_array& ind_dest)
{
    b *= 8;
    size_t B = 256, N = src.size();

    size_array bytes = (src >> b) & 0xff;  // current byte of each element
    size_array count(B, size_t(0));  // occurrences of each element
    ++count[bytes];

    if(count[0] == N)  // all bytes are zero; order remains unchanged
        { dest = src; ind_dest = ind_src; return; }

    size_array index = shift(cumsum(count), 1);  // index-list for each element
    size_array pos(N);  // position of each element in the destination array
    for(size_t i=0; i<N; i++) pos[i] = index[bytes[i]]++;

    dest[pos] = src;  // place elements in the destination array
    ind_dest[pos] = ind_src;  // place indices
}

It is not directly readable because it uses lots of auxiliary structures and functions, but the idea is that you keep a separate array with the indices. Once you have the position of elements in the destination array (pos), the last two lines update the value array and index array in exactly the same way.

I guess you can apply the same idea to any implementation, but you'd have to modify the code.