Complexity of finding the largest $m$ numbers in an array of size $n$

https://cs.stackexchange.com/questions/1485

16-10-2019
|

Question

What follows is my algorithm for doing this in what I believe to be $O(n)$ time, and my proof for that. My professor disagrees that it runs in $O(n)$ and instead thinks that it runs in $\Omega(n^2)$ time. Any comments regarding the proof itself, or the style (i.e. my ideas may be clear but the presentation not).

The original question:

Given $n$ numbers, find the largest $m \leq n$ among them in time $o(n \log n)$. You may not assume anything else about $m$.

My answer:

Sort the first $m$ elements of the array. This takes $O(1)$ time, as this is totally dependent on $m$, not $n$.
Store them in a linked list (maintaining the sorted order). This also takes $O(1)$ time, for the same reason as above.
For every other element in the array, test if it is greater than the least element of the linked list. This takes $O(n)$ time as $n$ comparisons must be done.
If the number is in fact greater, then delete the first element of the linked list (the lowest one) and insert the new number in the location that would keep the list in sorted order. This takes $O(1)$ time because it is bounded by a constant ($m$) above as the list does not grow.
Therefore, the total complexity for the algorithm is $O(n)$.

I am aware that using a red-black tree as opposed to linked list is more efficient in constant terms (as the constant upper bound is $O(m\cdot \log_2(m))$ as opposed to $m$ and the problem of keeping a pointer to the lowest element of the tree (to facilitate the comparisons) is eminently doable, it just didn't occur to me at the time.

What is my proof missing? Is there a more standard way of presenting it (even if it is incorrect)?

Solution

Here is an $O(n)$ algorithm solving the problem.

Use the worst-case $O(n)$ selection algorithm to determine the $n-m+1$-th order statistics. Let $k$ be this number, which is the smallest of the $m$ largest numbers we are trying to determine.
Now partition the array around the pivot $k$ using the QuickSort partition function. This step takes $O(n)$ too.
Output the $m$ largest numbers: these are given by $k$ and all of the numbers in the upper subarray generated by partition in step 2.

OTHER TIPS

Your algorithm takes $\Theta(n + mn)$ time. I suspect that your professor is looking for something that takes $O(n+ n\log m)$ time, which should be possible, maybe by using a heap...

The source of your disagreement with the professor is that he or she doesn't appear to think $m$ is a constant, despite how the question is worded. If it's not, then $\Theta(m)$ is a lot worse than $\Theta(\log m)$.

Correctness
Missing from your presentation is the loop invariant to establish correctness: you maintain the largest $m$ elements encountered so far in a linked list. Thus, by the end of your algorithm, you have tested all the elements, and so you have the largest $m$ elements in the array. Your algorithm is still correct, but stating the purpose of the linked list at the beginning of the description makes its correctness more explicit.

Running Time
$log_{10}( 10 \mbox{ billion}) = 10$, (with base 2, it's about ~33) which is a heck of a lot smaller than 10 million. In the example given, $m$ is many times larger than $\log n$, and so I don't think you can safely assume that $m$ is constant.

You should provide an algorithm that is $o(n \log n)$ as long as $m = o(n)$. Replacing your linked-list and linear search with a balanced binary search tree or a min-heap will achieve this run-time: $O(m \log m + n \log m) = O(n \log m) = o(n \log n)$. (assuming $m = o(n)$, otherwise your run-time is $\Theta(n \log n)$)

In case you're not familiar with the notation, the intuition behind $o(n \log n)$ is $< O(n \log n)$, but see the corresponding question on cs.SE for details of the notation.

Licensed under: CC-BY-SA with attribution

Not affiliated with cs.stackexchange