Pregunta

I apologize if this has been repeated before, but I couldn't find any posts with the wording I chose. I'm preparing for interviews, and I've been reading up on external sorting. For instance, if you want to sort several hard disks of 32 bit integers, you could make a counting sort and use 64 bit counters to count the 32 bit integers. Then, at every possible 32 bit integer value, you would have a counter representing it. You can also use external merge sort for similar things, taking O(nlogn) time instead of O(1) time. However, I've been thinking about a case that is probably very common, but I can't think of the best way to do it - adding new data to a bunch of sorted files possibly across many hard disks.

If there data were in memory, one could use a heap (priority queue) to accomplish this insert in logn time. However, we can't make a heap from hard disk space. With lists, you would have to use an O(logn) search to find the data's place (for binary search, sorted), then bump the rest of the data backward or forward, or you may not have to shift anything depending on the implementation of the container(arrays, linked lists, etc.). In the hard disk world, though, reads and writes are much more expensive than in RAM, so inserting data somewhere then shifting (rewriting) the rest of the data seems prohibitively expensive. Are there any techniques for this that any of you could recommend to me? I'd be happy to read myself, I just couldn't find the right way to word my question to find any information. Thank you!

¿Fue útil?

Solución

I'd say read up that file of your sorted data, read up the file that you want to be sorted and added to there, buckle up the counters and plain overwrite the sorted data file with newly calculated one. Straight reading is largely less expensive on modern disk systems than random reading, and you will anyway need a position for every int you find, thus a single sequential read of the entire volume will be less time-consuming than ~32 reads of a single sector per number of the file to be sorted.

Also, I'd say sorting 32-bit ints is best done with result already in form of counters, especially at extra-large scale like "several hard disks", you will expect to have at least 1 in almost every bucket in 32-bit space, so storing 64 bits *2^32 could be smaller than say 2^33 32-bit zeroes then 2^32 ones then...

Otros consejos

If you look up 'external sorting' here (or elsewhere) you will find discussions of what you are describing. external-sorting is a tag here as well.

In the hard disk world, though, reads and writes are much more expensive than in RAM, so inserting data somewhere then shifting (rewriting) the rest of the data seems prohibitively expensive.

External sorting is for cases where you do not have sufficient memory (or enough 'per process' in most cases) to do it internally. It's not uncommon to have data sets too large to hold in memory at once. So you accept the higher runtime cost of I/O bound sorting.

If you have space in memory to hold the file, and you have a set of numbers whose smallest element is k, you are going to have to re-write all the numbers in the file that are larger than k. There is no way around this. They are all going to have to shift at least one position.

If you are looking to exploit the fact that most of the array is already sorted, and you have the space in memory to do so, then sorting the inserted elements and merging it with the list of elements that are larger than its smallest member is a good, fast way to do this. EG:

DISK:

1 2 3 4 5 6 8 10 11 12

Insertions: 9 7 13

Sort the insertions:

7 9 13

Find the subset of the sorted list on disk that applies: 8 10 11 12

Merge the elements in (as in Mergesort:)

7 8 9 10 11 12 13

Copy them back to disk:

1 2 3 4 5 6 7 8 9 10 11 12 13

If, on the other hand, your space in memory is prohibitively smaller than the total size of the list, other techniques might be advisable. For example:

1 2 3 4 .. 1000 1002 1003... 999,998, 1,000,000...

as your list on disk and

1001, 999,999

as your insertions. In this situation, you will want to go through each element, calculate the number of elements in the insertion list which are smaller than that element, and then do so. In this simple example the naive counter is very fast - you can see for 1,000,0000 that two jumps are needed. If the number of insertions might be comparatively large, you can sort your insertions and then use binary search on this element to find where each element in your larger array might lie. This will give you information about how many items you can copy over. So, the corresponding values of jump for the top would be:

0 0 0 0 ... 0 1 1 ... 1 2

Hopefully you can see a fairly obvious method by which you might want to decide to write one of your insertion elements to disk.

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top