How to get a sorted subvector out of a sorted vector, fast

https://stackoverflow.com/questions/4308912

29-09-2019
|

Question

I have a data structure like this:

struct X {
  float value;
  int id;
};

a vector of those (size N (think 100000), sorted by value (stays constant during the execution of the program):

std::vector<X> values;

Now, I want to write a function

void subvector(std::vector<X> const& values, 
               std::vector<int> const& ids, 
               std::vector<X>& out /*, 
               helper data here */);

that fills the out parameter with a sorted subset of values, given by the passed ids (size M < N (about 0.8 times N)), fast (memory is not an issue, and this will be done repeatedly, so building lookuptables (the helper data from the function parameters) or something else that is done only once is entirely ok).

My solution so far:
Build lookuptable lut containing id -> offset in values (preparation, so constant runtime)
create std::vector<X> tmp, size N, filled with invalid ids (linear in N)
for each id, copy values[lut[id]] to tmp[lut[id]] (linear in M)
loop over tmp, copying items to out (linear in N)

this is linear in N (as it's bigger than M), but the temporary variable and repeated copying bugs me. Is there a way to do it quicker than this? Note that M will be close to N, so things that are O(M log N) are unfavourable.

Edit: http://ideone.com/xR8Vp is a sample implementation of mentioned algorithm, to make the desired output clear and prove that it's doable in linear time - the question is about the possibility of avoiding the temporary variable or speeding it up in some other way, something that is not linear is not faster :).

Solution

An alternative approach you could try is to use a hash table instead of a vector to look up ids in:

void subvector(std::vector<X> const& values, 
               std::unordered_set<int> const& ids, 
               std::vector<X>& out) {

    out.clear();
    out.reserve(ids.size());
    for(std::vector<X>::const_iterator i = values.begin(); i != values.end(); ++i) {
        if(ids.find(i->id) != ids.end()) {
            out.push_back(*i);
        }
    }
}

This runs in linear time since unordered_set::find is constant expected time (assuming that we have no problems hashing ints). However I suspect it might not be as fast in practice as the approach you described initially using vectors.

OTHER TIPS

Since your vector is sorted, and you want a subset of it sorted the same way, I assume we can just slice out the chunk you want without rearranging it.

Why not just use find_if() twice. Once to find the start of the range you want and once to find the end of the range. This will give you the start and end iterators of the sub vector. Construct a new vector using those iterators. One of the vector constructor overloads takes two iterators.

That or the partition algorithm should work.

If I understood your problem correctly, you actually try to create a linear time sorting algorithm (subject to the input size of numbers M). That is NOT possible.

Your current approach is to have a sorted list of possible values. This takes linear time to the number of possible values N (theoretically, given that the map search takes O(1) time).

The best you could do, is to sort the values (you found from the map) with a quick sorting method (O(MlogM) f.e. quicksort, mergesort etc) for small values of M and maybe do that linear search for bigger values of M. For example, if N is 100000 and M is 100 it is much faster to just use a sorting algorithm.

I hope you can understand what I say. If you still have questions I will try to answer them :)

edit: (comment) I will further explain what I mean. Say you know that your numbers will range from 1 to 100. You have them sorted somewhere (actually they are "naturally" sorted) and you want to get a subset of them in sorted form. If it would be possible to do it faster than O(N) or O(MlogM), sorting algorithms would just use this method to sort.

F.e. by having the set of numbers {5,10,3,8,9,1,7}, knowing that they are a subset of the sorted set of numbers {1,2,3,4,5,6,7,8,9,10} you still can't sort them faster than O(N) (N = 10) or O(MlogM) (M = 7).

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow