I am trying to figure out the most optimal way to compute a top-k query on some aggregation of data, lets say an array. I used to think the best way was to run through the array and maintain a heap or balanced binary tree of size k, leveraging that to compute the top-k value. Now, I have run across the Selection Algorithm which supposedly runs even faster. I understand how the Selection Algorithm works and how to implement it, I am just a little confused as to how it runs in O(n). I feel like in order for it to run in O(n) you would have to be extremely lucky. If you keep picking a random pivot point and partitioning around it, it could very well be the case that you just end up basically sorting almost the entire array before stumbling upon your kth index. Are there any optimizations such as maybe not picking a random pivot? Or is my maintaining a heap/tree method good enough for most cases.

有帮助吗?

解决方案

What you're talking about there is quickselect, also known as Hoare's selection algorithm.

It does have O(n) average case performance, but its worst-case performance is O(n2).

Like quicksort, the quickselect has good average performance, but is sensitive to the pivot that is chosen. If good pivots are chosen, meaning ones that consistently decrease the search set by a given fraction, then the search set decreases in size exponentially and by induction (or summing the geometric series) one sees that performance is linear, as each step is linear and the overall time is a constant times this (depending on how quickly the search set reduces). However, if bad pivots are consistently chosen, such as decreasing by only a single element each time, then worst-case performance is quadratic: O(n2).

In terms of choosing pivots:

The easiest solution is to choose a random pivot, which yields almost certain linear time. Deterministically, one can use median-of-3 pivot strategy (as in quicksort), which yields linear performance on partially sorted data, as is common in the real world. However, contrived sequences can still cause worst-case complexity; David Musser describes a "median-of-3 killer" sequence that allows an attack against that strategy, which was one motivation for his introselect algorithm.

One can assure linear performance even in the worst case by using a more sophisticated pivot strategy; this is done in the median of medians algorithm. However, the overhead of computing the pivot is high, and thus this is generally not used in practice. One can combine basic quickselect with median of medians as fallback to get both fast average case performance and linear worst-case performance; this is done in introselect.

(quotes from Wikipedia)

So you're fairly likely to get O(n) performance with random pivots, but, if k is small and n is large, or if you're just unlikely, the O(n log k) solution using a size k heap or BST could outperform this.

We can't tell you with certainty which one will be faster when - it depends on (1) the exact implementations, (2) the machine it's run on, (3) the exact sizes of n and k and finally (4) the actual data. The O(n log k) solution should be sufficient for most purposes.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top