Question

I have an application which needs to loop through all the lines in text files, over gigabytes in size. Some of these files have 10's or 100's of millions of lines.

An example of my current (and synchronous) reading, looks something like...

  using (FileStream stream = new FileStream(args[0], FileMode.Open, FileAccess.Read, FileShare.Read)) {
    using (StreamReader streamReader = new StreamReader(stream)) {
      string line;
      while (!string.IsNullOrEmpty(line = streamReader.ReadLine())) {           
        //do stuff with the line string...
      }
    }
  }

I have read some stuff about the .Net Asynchronous IO streaming methods, and I am after some help with 2 specific questions regarding this issue.

First, will I get a performance boost by asyncronously reading these files, if I need the entireity of each line, which are ussually short, but of varying lengths (there is no relationship between each of the lines in the file)?

Second, How do I convert the code above, into an async read, so I can process each line-by-line, as I do now?

Was it helpful?

Solution

Instead of making the line reads Async you might try making the file reads Async. That is encompass all of the code in your question in a single worker delegate.

    static void Main(string[] args)
    {
        WorkerDelegate worker = new WorkerDelegate(Worker);
        // Used for thread and result management.
        List<IAsyncResult> results = new List<IAsyncResult>();
        List<WaitHandle> waitHandles = new List<WaitHandle>();

        foreach (string file in Directory.GetFiles(args[0], "*.txt"))
        {
            // Start a new thread.
            IAsyncResult res = worker.BeginInvoke(file, null, null);
            // Store the IAsyncResult for that thread.
            results.Add(res);
            // Store the wait handle.
            waitHandles.Add(res.AsyncWaitHandle);
        }

        // Wait for all the threads to complete.
        WaitHandle.WaitAll(waitHandles.ToArray(), -1, false); // for < .Net 2.0 SP1 Compatibility

        // Gather all the results.
        foreach (IAsyncResult res in results)
        {
            try
            {
                worker.EndInvoke(res);
                // object result = worker.EndInvoke(res); // For a worker with a result.
            }
            catch (Exception ex)
            {
                // Something happened in the thread.
            }
        }
    }

    delegate void WorkerDelegate(string fileName);
    static void Worker(string fileName)
    {
        // Your code.
        using (FileStream stream = new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.Read))
        {
            using (StreamReader streamReader = new StreamReader(stream))
            {
                string line;
                while (!string.IsNullOrEmpty(line = streamReader.ReadLine()))
                {
                    //do stuff with the line string...
                }
            }
        }
    }

OTHER TIPS

The async pattern is BeginRead()/EndRead().

Whether or not you get a boost depends a lot on what else is going on at the time you're doing the reads. Is there something else your app can do while waiting on the reads? If not then going async won't help much...

Asynchronous reads will just end up making the head seek more for each block. You'll get a better performance boost from a good defrag of the files on the filesystem and using synchronous read.

As already pointed out, dispatching the line processing to other threads should give a boost (especially on multi-core CPUs)

If performance is super-critical I would recommend investigating interop for FILE_FLAG_SEQUENTIAL_SCAN See details here

Better still write a tiny C++ app that scans through the file with that flag on to see if it improves performance.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top