Question

In my day-to-day work I'm often working with datasets that contain millions of rows, sometimes 100s of millions, and occasionally over 1 billion. These datasets often need to be sorted. The keys are almost always large integer values (usually 9 digits). Sometimes the datasets have composite keys of a 9-digits and 3-digits.

I was wondering if it would be possible to implement a (LSD-first) radix-sort macro in SAS that could be used instead of PROC SORT to reduce time spent sorting these datasets. I've already tuned the sorts to use compression where appropriate, keep only the relevant fields (or use tagsort), size field length's appropriately, don't sort unnecessarily etc etc...

The hardware I'm using has limitations - let's assume that I only have 2G of memory available to SAS so the solution can't require putting all of the key values in a temporary array in memory (at least not all at one time).

Would the solution offer a performance improvement over proc sort? Has anyone already implemented something like this or had experience with it? Am I wasting my time?

Was it helpful?

Solution

I found some examples of a radix sort implemented (using temporary arrays) here:

http://listserv.uga.edu/cgi-bin/wa?A2=ind9901b&L=sas-l&P=11902

Using this example and some pretty loose assumptions it appears as though the new threaded sort in SAS (versions 9+) will outperform a single threaded radix sort in clock time. In CPU time the radix sort still performs better but that doesn't really count for much. So I'm going to stop pursuing this further.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top