Pregunta

I have to sort a large amount of data that can not fit in memory, and one thing could do this I know is "external sort". But I am wondering is that possible to mmap this large data file, and use 'qsort' as it is a 'normal data array'? If that's feasible, what's the differences with 'external sort'?

¿Fue útil?

Solución

If the file will fit in a contiguous mapping in your address space, you can do this. If it won't, you can't.

As to the differences:

  • if the file just about fits, and then you add some more data, the mmap will fail. A normal external sort won't suddenly stop working because you have a little more data.
  • if you don't map it with MAP_PRIVATE, sorting will mutate the original file. A normal external sort won't (necessarily)
  • if you do map it with MAP_PRIVATE, you could crash at any time if the VM doesn't have room to duplicate the whole file. Again, a strictly external sort's memory requirements don't scale linearly with the data size.

tl;dr

It is possible, it may fail unpredictably and unrecoverably, you almost certainly shouldn't do it.

Otros consejos

It should definitely work if the data fits in address space (almost certainly does on 64-bit machines; might or might not on 32-bit ones), but performance will depend on a lot on the underlying algorithm used by qsort and its data locality properties. One issue to consider is whether it's the number of elements that's huge, or whether each element is large on disk. In the latter case, you'd be better off doing the mmap, but allocating a separate array of pointers to each element, then sorting the pointer array with a comparison function that compares what they point to. This will drastically reduce the number of times data gets moved around in memory, but it will take a little work at the end if you want to store the output back to the same file.

Yes, this is possible as long as you have fixed-length records in the file and the file fits within a range of contiguous VM addresses, and in fact this can be considered a naive approach to external sorting. It may not be the fastest algorithm in town, though, since qsort implementations will not be tuned for this use case.

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top