Pregunta

I have a file with a large amount of data, and I want to sort it holding only a fraction of the data in memory at any given time.

I've noticed that merge sort is popular for external sorting, but I'm wondering if it can be done with a heap (min or max). Basically my goal is to get the top (using arbitrary numbers) 10 items in a 100 item list while never holding more than 10 items in memory.

I mostly understand heaps, and understand that heapifying the data would put it in the appropriate order, from which I could just take the last fraction of it as my solution, but I can't figure out how to do with without an I/O for every freakin' item.

Ideas?

Thanks! :D

¿Fue útil?

Solución

Using a heapsort requires lots of seek operations in the file for creating the heap initially and also when removing the top element. For that reason, it's not a good idea.

However, you can use a variation of mergesort where every heap element is a sorted list. The size of the lists is determined by how much you want to keep in memory. You create these lists from the input file using by loading chunks of data, sorting them and then writing them to a temporary file. Then, you treat every file as one list, read the first element and create a heap from it. When removing the top element, you remove it from the list and restore the heap conditions if necessary.

There is one aspect though that makes these facts about sorting irrelevant: You say you want to determine the top 10 elements. For that, you could indeed use an in-memory heap. Just take an element from the file, push it onto the heap and if the size of the heap exceeds 10, remove the lowest element. To make it more efficient, only push it onto the heap if the size is below 10 or it is above the lowest element, which you then replace and re-heapify. Keeping the top ten in a heap allows you to only scan through the file once, everything else will be done in-memory. Using a binary tree instead of a heap would also work and probably be similarly fast, for a small number like 10, you could even use an array and bubblesort the elements in place.

Note: I'm assuming that 10 and 100 were just examples. If your numbers are really that low, any discussion about efficiency is probably moot, unless you're doing this operation several times per second.

Otros consejos

Yes, you can use a heap to find the top-k items in a large file, holding only the heap + an I/O buffer in memory.

The following will obtain the min-k items by making use of a max-heap of length k. You could read the file sequentially, doing an I/O for every item, but it will generally be much faster to load the data in blocks into an auxillary buffer of length b. The method runs in O(n*log(k)) operations using O(k + b) space.

while (file not empty)

    read block from file

    for (i = all items in block)
        if (heap.count() < k)
            heap.push(item[i])
        else
        if (item[i] < heap.root())
            heap.pop_root()
            heap.push(item[i])
        endif
    endfor

endwhile

Heaps require lots of nonsequential access. Mergesort is great for external sorting because it does a whole lot of sequential access.

Sequential access is a hell of a lot faster on the kinds of disks that spin because the head doesn't need to move. Sequential access will probably also be a hell of a lot faster on solid-state disks than heapsort's access because they do accesses in blocks that are probably considerably larger than a single thing in your file.

By using Merge sort and passing the two values by reference you only have to hold the two comparison values in a buffer, and move throughout the array until it is sorted in place.

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top