Question

Consider the task of finding the top-k elements in a set of N independent and identically distributed floating point values. By using a priority queue / heap, we can iterate once over all N elements and maintain a top-k set by the following operations:

  • if the element x is "worse" than the heap's head: discard x ⇒ complexity O(1)

  • if the element x is "better" than the heap's head: remove the head and insert x ⇒ complexity O(log k)

The worst case time complexity of this approach is obviously O(N log k), but what about the average time complexity? Due to the iid-assumption, the probability of the O(1) operation increases over time, and we rarely have to perform the costly O(log k), especially for k << N.

Is this average time complexity documented in any citable reference? What's the average time complexity? If you have a citeable reference for your answer please include it.

Was it helpful?

Solution

Consider the i'th largest element, and a particular permutation. It'll inserted into the k-sized heap if it appears before no more than k-1 of the (i - 1) larger elements in the permutation.

The probability of that heap-insertion happening is 1 if i <= k, and k/i if i > k.

From this, you can compute the expectation of the number of heap adjustments, using linearity of expectation. It's sum(i = 1 to k)1 + sum(i = k+1 to n)k/i = k + sum(i = k+1 to n)k/i = k * (1 + H(n) - H(k)), where H(n) is the n'th harmonic number.

This is approximately k log(n) (for k << n), and you can compute your average cost from there.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top