Pregunta

I have understand what external sort does, what it for; yet I have an issue at my mind about merging extreme case.

external sorting the first answer explains how external sort merging works. But what if:

assume that we have 10 units memory size and we want to sort 50 units file

first we slice the file into 5 runs( each of them 10 units ) and sort them individually

second we have to merge them with 4-way merge

and 10/4 = 2.5 ~ 2; we take 2 units(chunk) from each run, put them into memory and we start merging;

Then the actual question is: what if the second and the third chunk of the (assume) third run have

smaller elements than the first chunks of the other runs ? Will the merging process be successful ?

If I have mistakes about what I understand, any explanation will be helpful.

¿Fue útil?

Solución

Well, there's no problem having smaller/greater elements in any of the files. Here is an example of the external sort process:

Your initial data:

data = [2, 5, 3, 7, 1, 6, 4, 8, 9]

Considering you have only 3 units of memory, you'd have the following shards, and results of sorting:

d1 = [2, 5, 3] -> sorting -> d1 = [2, 3, 5]
d2 = [7, 1, 6] -> sorting -> d2 = [1, 6, 7]
d3 = [4, 8, 9] -> sorting -> d3 = [4, 8, 9]

As you have three available units, you can read from the three shards at the same time, so, you'd have:

d = [], d1 = [2, 3, 5], d2 = [1, 6, 7], d3 = [4, 8, 9] -> min(d1, d2, d3) = 1
d = [1], d1 = [2, 3, 5], d2 = [6, 7], d3 = [4, 8, 9] -> min(d1, d2, d3) = 2
d = [1, 2], d1 = [3, 5], d2 = [6, 7], d3 = [4, 8, 9] -> min(d1, d2, d3) = 3
d = [1, 2, 3], d1 = [5], d2 = [6, 7], d3 = [4, 8, 9] -> min(d1, d2, d3) = 4
d = [1, 2, 3, 4], d1 = [5], d2 = [6, 7], d3 = [8, 9] -> min(d1, d2, d3) = 5
d = [1, 2, 3, 4, 5], d1 = [], d2 = [6, 7], d3 = [8, 9] -> min(d1, d2, d3) = 6
d = [1, 2, 3, 4, 5, 6], d1 = [], d2 = [7], d3 = [8, 9] -> min(d1, d2, d3) = 7
d = [1, 2, 3, 4, 5, 6, 7], d1 = [], d2 = [], d3 = [8, 9] -> min(d1, d2, d3) = 8
d = [1, 2, 3, 4, 5, 6, 7, 8], d1 = [], d2 = [], d3 = [9] -> min(d1, d2, d3) = 9
d = [1, 2, 3, 4, 5, 6, 7, 8, 9], d1 = [], d2 = [], d3 = [] -> []

What you may be concerned about is when you have limitation enough to allow you not to read at least one element from each file, or even if the decision was simply to read more elements from a given file, leaving behind another file to be read.

This is the same as the process above, with the only difference that, after reading, say, two files, and merging the data between them, you'd have to read from the third file and from the last file generated, which is the merge of files 1 and 2.

Since both the third file and the last file generated are surely sorted, you are able to sequentially scan the data from both files, merging the entries into an unique result.

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top