Question

It's well-known that the worst-case runtime for heapsort is Ω(n lg n), but I'm having trouble seeing why this is. In particular, the first step of heapsort (making a max-heap) takes time Θ(n). This is then followed by n heap deletions. I understand why each heap deletion takes time O(lg n); rebalancing the heap involves a bubble-down operation that takes time O(h) in the height of the heap, and h = O(lg n). However, what I don't see is why this second step should take Ω(n lg n). It seems like any individual heap dequeue wouldn't necessarily cause the node moved to the top to bubble all the way down the tree.

My question is - does anyone know of a good lower-bound proof for the best-case behavior of heapsort?

Was it helpful?

Solution

So I did a bit of digging myself and it looks like this result actually is fairly recent! The first lower-bound proof I can find is from 1992, though heapsort itself was invented in 1964.

The formal lower-bound proof is due to Schaffer and Sedgewick's "The Analysis of Heapsort" paper. Here's a slightly paraphrased version of the proof that omits some of the technical details.

To begin, let's suppose that n = 2k - 1 for some k, which guarantees that we have a complete binary heap. I'll show how to handle this case separately later on. Because we have 2k - 1 elements, the first pass of heapsort will, in Θ(n), build up a heap of height k. Now, consider the first half of the dequeues from this heap, which removes 2k-1 nodes from the heap. The first key observation is that if you take the starting heap and then mark all of the nodes here that actually end up getting dequeued, they form a subtree of the heap (i.e. every node that get dequeued has a parent that also gets dequeued). You can see this because if this weren't the case, then there would be some node whose (larger) parent didn't get dequeued though the node itself was dequeued, meaning that the values are out of order.

Now, consider how the nodes of this tree are distributed across the heap. If you label the levels of the heap 0, 1, 2, ..., k - 1, then there will be some number of these nodes in levels 0, 1, 2, ..., k - 2 (that is, everything except the bottom level of the tree). In order for these nodes to get dequeued from the heap, then they have to get swapped up to the root, and they only get swapped up one level at a time. This means that one way to lower-bound the runtime of heapsort would be to count the number of swaps necessary to bring all of these values up to the root. In fact, that's exactly what we're going to do.

The first question we need to answer is - how many of the largest 2k-1 nodes are not in the bottom level of the heap? We can show that this is no greater than 2k-2 by contradiction. Suppose that there are at least 2k-2 + 1 of the largest nodes in the bottom level of the heap. Then each of the parents of those nodes must also be large nodes in level k - 2. Even in the best case, this means that there must be at least 2k-3 + 1 large nodes in level k - 2, which then means that there would be at least 2k-4 + 1 large nodes in level k - 3, etc. Summing up over all of these nodes, we get that there are 2k-2 + 2k-3 + 2k-4 + ... + 20 + k large nodes. But this value is strictly greater than 2k-1, contradicting the fact that we're working with only 2k-1 nodes here.

Okay... we now know that there are at most 2k-2 large nodes in the bottom layer. This means that there must be at least 2k-2 of the large nodes in the first k-2 layers. We now ask - what is the sum, over all of these nodes, of the distance from that node to the root? Well, if we have 2k-2 nodes positioned somewhere in a complete heap, then at most 2k-3 of them can be in the first k - 3 levels, and so there are at least 2k-2 - 2k-3 = 2k-3 heavy nodes in level k - 2. Consequently, the total number of swaps that need to be performed are at least (k - 2) 2k-3. Since n = 2k-1, k = Θ(lg n), and so this value is Θ(n lg n) as required.

OTHER TIPS

Simple observation answer is this: The items in heap are:

1
2
4
8
...
2^[log(n/4)]
and last level has between (1..2^[log(n/2)]) ==> (1,[n/2]) item, (by [] I mean Ceiling not roof)

for example if you have 7 item:

1
2
4

and if you have 8 item:

1
2
4
1

There is 2 different heap tree, first at least n/4 - 1 items of a heap are in last level, or not, so there is at least n/4 - 1 item in level before last one, In the first case it takes O((n/4 - 1) * log(n/2)) to remove last level items from heap, and in the second case it takes O((n/4 - 1) * log(n/4)) to remove items from pre last level. so in both case it takes Ω(n log(n)) just for n/4 - 1 items, so it's a lower bound (easily can say it's tight lower bound).

Here is a solution that uses CLRS terms:
We start with a max-heap that is a complete binary tree with n elements.
We can say that in a complete binary there are n/2 leaves and n/2 inner nodes.
n/2 iterations of HEAP-SORT remove the largest n/2 elements from the heap.
Let S be the set of the largest n/2 elements.
There can be at most n/4 elements from S in the leaves since there must be additional n/4 of them in the inner nodes.
Let L be these n/4 largest elements from S that are in the leaves.
So if there are n/4 elements from S at level 0 (the leaves level) then there must be at least n/8 of them at level 1.
Let P be these n/8 elements from S that are at level 1.
n/2 iterations of HEAP-SORT may give the elements from L a short cut to the root and then out of the heap, but the elements from P must make all the way to the root before they are removed from the heap.
So there are at least (n/8)(lgn-1) operations, which gives us a running time of Ω(nlgn).
Now for the case of a max-heap that doesn't have all its leaves at level 0.
Let k be the number of its leaves at level 0.
After k iterations of HEAP-SORT, we are left with a max-heap that is a complete binary tree with height lgn-1.
We can continue our proof the same way.
Now for the case when there are less than n/4 leaves from S.
Let k be the number of elements from S that are in the leaves at level 0.
If k <= n/8 then there must be at least n/8 elements from S at level 1.
This is because there can be a total of n/4 elements above level 1.
We continue the proof the same way.
If k>n/8 then there must be at least n/16 elements from S that are at level 1.
We continue the proof the same way.
We conclude that the running time of HEAP-SORT is Ω(nlgn).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top