Pregunta

When we externally merge sort a large file, we split it into small ones, sort those, and then merge them back into a large sorted file.

When merging, we can either do many 2-way merge passes, or one multi-way merge.

I am wondering which approach is better? and why?

¿Fue útil?

Solución

One multi-way merge is generally better. Consider three small files:

a1
a2
a3

and

b1
b2
b3

and finally

c1
c2
c3

If you do a merge with a and b, we're left with (say)

a1
b1
a2
b2
b3
a3

and

c1
c2
c3

A final merge would create the sorted list, but notice how in this final merge we have to visit the a and b items again. It's this re-merging that is wasteful in cascading two-way merges.

What you can do instead is a single multi-way merge. However, be careful how you do it. Specifically, avoid the naive double-loop that scans each cursor to see which has the minimum value. Use a min-heap instead. This will bring the complexity back down to O(n log n).

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top