Question

I have two sorted arrays, Haystack and Needles. I need to iterate over Needles, and each time find the first point in Haystack with a value larger than that Needle, in order to perform the next step.

For example:

double [] dHaystack = { 1.2, 2.6, 7.0, 9.3, 19.4 }
double [] dNeedles  = { 1.4, 6.4, 6.5, 7.0, 10.3 }

//  expected indices     0    1    1    2    3    

So the index I should get is the first index equal to or lower than the needle value.

The obvious approach is just to iterate from the beginning of the haystack for each needle, or to iterate onward from the last found index (as Needles is also sorted).

But part of my brain is shouting "bisection!". Would a bisection actually be faster here, since the compiler will find it harder to optimise than a simple block read and iteration? Would it need an incredibly long Haystack to be worthwhile?

Was it helpful?

Solution

You need to consider the scenario,

n*lg(m) < n+m,

Where n is the size of Needle and m is the size of Haystack.

Therefore, it all depends on the various combination of values of n and m.

OTHER TIPS

Use std::upper_bound, which is O(log n) for random access iterators and provides exactly what you need in the shortest and simplest code.

Before you worry about minute performance, test your current code (and maybe test alternatives) instead of making assumptions. In particular, note you can start searching (the first parameter to upper_bound) from the last found index on each iteration.

// Available in Boost, C++0x, and many other places.  Implementation copied
// here for the sake of the example.
template<class T, int N>
T* end(T (&a)[N]) {
  return a + N;
}

void example() {
  double haystack[] = {1.2, 2.6, 7.0, 9.3, 19.4};
  double needles[] = {1.4, 6.4, 6.5, 7.0, 10.3};
  double *begin = haystack;
  for (double *n = needles; n != end(needles); ++n) {
    double *found = std::upper_bound(begin, end(haystack), *n);
    if (found == end(haystack)) break;
    std::cout << *n << " at index " << (found - haystack) << '\n';
    begin = found;
  }
}

The obvious approach is just to iterate... onward from the last found index (as Needles is also sorted).

Yes.

But part of my brain is shouting "bisection!". Would a bisection actually be faster here, since the compiler will find it harder to optimise than a simple block read and iteration? Would it need an incredibly long Haystack to be worthwhile?

I don't think the compiler optimisation's an issue (it just removes unnecessary work) so much as the amount of actual inherent, necessary work. If both sets are similar in size then I'd stick with the obvious approach. If the haystack is massively larger than the needles set, then bisection or even interpolation might yield slightly better performance. Unless this is crucial to your app you're unlikely to notice the differences, and if it is you should benchmark, particularly as you can presumably get a working implementation quickly using std::set and upper or lower bound (I can never remember which I'll need - don't use the often enough), maybe using the last position as a hint at starting location if your library supports that.

std::upper_bound will give you the iterator to first element strictly greater, or "end" of the collection if none of them apply

upper_bound takes iterators for begin and end, end being one past the end of the collection. If you are iterating through an increasing list of search values, you would of course not need to run through the entire collection but your "begin" can shift further to the right.

Of course with a haystack of just 5 elements it doesn't really matter what search algorithm you use, but if it got very large, using a linear search would be potentially very slow, particular if there were very few needles.

This is a situation where it does really matter what both sizes are. For example if your search space N is large but the number of items being searched (M) is small then O(M log N) really is a lot smaller. (eg M = 20, N = 16K then log N = 15 and M log N is 300) compared with O(M + N) which in this case is 16K. If M is approximately the same size as N then O(M log N) is effectively a lot worse than O(N).

Therefore depending on the sizes of your collections you can choose which algorithm to use.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top