Question

I'd like to get the largest 100 elements out from a list of at least 100000000 numbers.

I could sort the entire list and just take the last 100 elements from the sorted list, but that would be very expensive in terms of both memory and time.

Is there any existing easy, pythonic way of doing this?

What I want is following function instead of a pure sort. Actually I don't want waste time to sort the elements I don't care.

For example, this is the function I'd like to have:

getSortedElements(100, lambda x,y:cmp(x,y))

Note this requirement is only for performance perspective.

Was it helpful?

Solution

The heapq module in the standard library offers the nlargest() function to do this:

top100 = heapq.nlargest(100, iterable [,key])

It won't sort the entire list, so you won't waste time on the elements you don't need.

OTHER TIPS

Selection algorithms should help here.

A very easy solution is to find the 100th biggest element, then run through the list picking off elements that are bigger than this element. That will give you the 100 biggest elements. This is linear in the length of the list; this is best possible.

There are more sophisticated algorithms. A heap, for example, is very amenable to this problem. The heap based algorithm is n log k where n is the length of the list and k is the number of largest elements that you want to select.

There's a discussion of this problem on the Wikipedia page for selection algorithms.

Edit: Another poster has pointed out that Python has a built in solution to this problem. Obviously that is far easier than rolling your own, but I'll keep this post up in case you would like to learn about how such algorithms work.

You can use a Heap data structure. A heap will not necessarily be ordered, but it is a fairly fast way to keep semi-ordered data, and it has the benefit of the smallest item always being the first element in the heap.

A heap has two basic operations that will help you: Add and Replace.

Basically what you do is add items to it until you get to a 100 items (your top N number per your question). Then after that, you replace the first item with every new item, as long as the new item is bigger than the first item.

Whenever you replace the first item with something bigger, the internal code in the heap will adjust the heap contents so that if the new item is not the smallest, it will bubble up into the heap, and the smallest item will "bubble down" to the first element, ready to be replaced along the way.

The best way to do this is to maintain a heap sorted priority queue that you pop off of once it has 100 entries in it.

While you don't care if the results are sorted it is intuitively obvious you will get this for free. In order to know you have the top 100, you need to order your current list of top numbers in order via some efficient data structure. That structure will know the minimum, the maximum, and the relative position of each element in some natural way that you can assert it's position next to it's neighbors.

As has been mentioned in python you would use heapq. In java PriorityQueue: http://java.sun.com/javase/6/docs/api/java/util/PriorityQueue.html

Here is a solution I have used that is independent of libraries and that will work in any programming language that has arrays:

Initialisation:

Make an array of 100 elements and initialise all elements
with a low value (less than any value in your input list).

Initialise an integer variable to 0 (or any value in
[0;99]), say index_minvalue, that will point to the
current lowest value in the array.

Initialise a variable, say minvalue, to hold the current 
lowest value in the array.

For each value, say current_value, in the input list:

if current_value > minvalue

  Replace value in array pointed to by index_minvalue
  with current_value

  Find new lowest value in the array and set index_minvalue to
  its array index. (linear search for this will be OK as the array
  is quickly filled up with large values)

  Set minvalue to current_value

else
  <don't do anything!>

minvalue will quickly get a high value and thus most values in the input list will only need to be compared to minvalue (the result of the comparison will mostly be false).

For the algorithms weenies in the audience: you can do this with a simple variation on Tony Hoare's algorithm Find:

find(topn, a, i, j)
   pick a random element x from a[i..j]
   partition the subarray a[i..j] (just as in Quicksort) 
     into subarrays of elements <x, ==x, >x
   let k be the position of element x
   if k == 0 you're finished
   if k > topn, call find(topn, a, i, k)
   if k < topn, call find(topn-k, k, j)

This algorithm puts the largest topn elements into the first topn elements of array a, without sorting them. Of course, if you want them sorted, or for sheer simplicity, a heap is better, and calling the library function is better still. But it's a cool algorithm.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top