Question

Working in C#, I would like to write an efficient sorting algorithm that would take as input a text file containing unsorted list of server and path combinations and output a sorted file.

As an exercise, I am working under assumption that the input data size will exceed available memory, so I am thinking of reading the file into memory a chunk at a time, doing a Quick sort (or a Heap sort, maybe?), outputting sorted chunks to temporary files, then doing a merge sort to produce the final output.

The format of the input file is up to my discretion. It can be just a list of UNC paths (server and path as single string) or it can be a CSV with servers and paths as separate fields.

My question is whether there is any benefit to be had from having server and path be separate entities in my data structure and evaluating them separately?

Having server and path separate would eliminate having to compare the server names during the path comparison run, but require additional run to sort by server and, given the available memory constraint, would require me to somehow cache the sorted server lists, increasing disk IO overhead.

Is there some technique I can leverage to optimize performance of such an application by providing server and path as separate fields in my input?

Any other optimization techniques that I might consider given the nature of the dataset?

EDIT: This is a one-time task. I do not need to later look up the entries

Was it helpful?

Solution

I am thinking of reading the file into memory a chunk at a time, doing a Quick sort (or a Heap sort, maybe?), outputting sorted chunks to temporary files, then doing a merge sort to produce the final output.

That's a perfectly reasonable plan.

An alternate solution would be: create an on-disk b-tree, and insert all your data one record at a time into the b-tree. You never need to have more than a few pages of the b-tree in memory and you can read the records one at a time from the unsorted list. Once it's in the b-tree, read it back out in order.

Having server and path separate would eliminate having to compare the server names during the path comparison run, but require additional run to sort by server and, given the available memory constraint, would require me to somehow cache the sorted server lists, increasing disk IO overhead.

OK.

My question is whether there is any benefit to be had from having server and path be separate entities in my data structure and evaluating them separately?

You just said what the pros and cons are. You've already listed them. Why are you asking this question if you already know the answer?

Is there some technique I can leverage to optimize performance of such an application by providing server and path as separate fields in my input?

Probably, yes.

How can I know for sure?

Write the code both ways and run it. The one that is better will be observed to be better.

Any other optimization techniques that I might consider given the nature of the dataset?

Your question and speculations are premature.

Start by setting a performance goal.

Then implement the code as clearly and correctly as you can.

Then carefully measure to see if you met your goal.

If you did, knock off early and go to the beach.

If you did not, get a profiler and use it to analyze your program to find the worst-performing part. Then optimize that part.

Keep doing that until either you meet your goal, or you give up.

OTHER TIPS

I'm certainly not going to out-answer Eric Lippert, but from a novice's perspective I wonder if you're not looking for the most complex answer first. You don't need to read the file into memory all at once with File.ReadLines...so your input is one line at a time. Use the Uri object for quick parsing of each string into it's component parts: host and path.

If you are thinking of a OO approach, then how about a 'serverUri' object that implements IComparable and having a SortedList of path strings. Make a SortedList of the serverUri objects, so that part of the string is stored only once, and for each path with that server uri, add it to the sub collection. Viola...its all sorted...spit it out to disk.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top