Question

given large datasets that don't fit in memory, is there any library or api to perform sort in Java? the implementation would possibly be similar to linux utility sort.

Was it helpful?

Solution

Java provides a general-purpose sorting routine which can be used as part of the larger solution to your problem. A common approach to sort data that's too large to all fit in memory is this:

1) Read as much data as will fit into main memory, let's say it's 1 Gb

2) Quicksort that 1 Gb (here's where you'd use Java's built-in sort from the Collections framework)

3) Write that sorted 1 Gb to disk as "chunk-1"

4) Repeat steps 1-3 until you've gone through all the data, saving each data chunk in a separate file. So if your original data was 9 Gb, you will now have 9 sorted chunks of data labeled "chunk-1" thru "chunk-9"

5) You now just need a final merge sort to merge the 9 sorted chunks into a single fully sorted data set. The merge sort will work very efficiently against these pre-sorted chunks. It will essentially open 9 file readers (one for each chunk), plus one file writer (for output). It then compares the first data element in each read file and selects the smallest value, which is written to the output file. The reader from which that selected value came advances to its next data element, and the 9-way comparison process to find the smallest value is repeated, again writing the answer to the output file. This process repeats until all data has been read from all the chunk files.

6) Once step 5 has finished reading all the data you are done -- your output file now contains a fully sorted data set

With this approach you could easily write a generic "megasort" utility of your own that takes a filename and maxMemory parameter and efficiently sorts the file by using temp files. I'd bet you could find at least a few implementations out there for this, but if not you can just roll your own as described above.

OTHER TIPS

The most common way to handle large datasets is in memory (you can buy a server with 1 TB these days) or in a database.

If you are not going to use a database (or buy more memory) you can write it yourself fair easily.

There are libraries which may help which perform Map-Reduce functions but they may add more complexity than they save.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top