Question

my code is :

    int linenumber = File.ReadLines(path).Count();

but it takes long time (about 20 second) for files about 1 gig size .

so does anyone know better way to solve this problem ?

Update 6 :

I have tested your solutions :

for a file about 870 mb size :

method 1 : { my code time(seconds) : 13 }

method 2 : (from MarcinJuraszek & Locke) (the same) {

time(seconds) : 12 }

method 3 : (from Richard Deeming) { time(seconds) : 19 }

method 4 : (from user2942249) { time(seconds) : 13 }

method 5 : (from Locke) { time(seconds) : 13 is the same for lineBuffer = {4096 , 8192 , 16384 , 32768} }

method 6 : (from Locke edition 2) { time(seconds) : 9 for Buffer size = 32KB , time(seconds) : 10 for Buffer size = 64KB }

As i said , in my comment , there is an application (native code) , that opens this file in my pc in 5 second. therefore this is not about h.d.d speed.

By Compiling MSIL to Native Code , the difference was not obvious.

Conclusion : at this time , the Locke method 2 is faster than other method.

So i marked his post as Answer . but this post will be open if any one find better idea.

I gave +1 vote up for dear friends who help me to solve the problem.

Thanks for your help. interesting better idea . Best Regards Smart Man

Was it helpful?

Solution

Here are a few ways this can be accomplished quickly:

StreamReader:

using (var sr = new StreamReader(path))
{
    while (!String.IsNullOrEmpty(sr.ReadLine()))
        lineCount ++;
}

FileStream:

var lineBuffer = new byte[65536]; // 64Kb
using (FileStream fs = new FileStream(path, FileMode.Open, FileAccess.Read,
       FileShare.Read, lineBuffer.Length))
{
    int readBuffer = 0;
    while ((readBuffer = fs.Read(lineBuffer, 0, lineBuffer.Length)) > 0)
    {
        for (int i = 0; i < readBuffer; i++)
        {
            if (lineBuffer[i] == 0xD) // Carriage return + line feed
                lineCount++;
        }
    }
}

Multithreading:

Arguably the number of threads shouldn't affect the read speed, but real world benchmarking can sometimes prove otherwise. Try different buffer sizes and see if you get any gains at all with your setup. *This method contains a race condition. Use with caution.

var tasks = new Task[Environment.ProcessorCount]; // 1 per core
var fileLock = new ReaderWriterLockSlim();
int bufferSize = 65536; // 64Kb

using (FileStream fs = new FileStream(path, FileMode.Open, FileAccess.Read,
        FileShare.Read, bufferSize, FileOptions.RandomAccess))
{
    for (int i = 0; i < tasks.Length; i++)
    {
        tasks[i] = Task.Factory.StartNew(() =>
            {
                int readBuffer = 0;
                var lineBuffer = new byte[bufferSize];

                while ((fileLock.TryEnterReadLock(10) && 
                       (readBuffer = fs.Read(lineBuffer, 0, lineBuffer.Length)) > 0))
                {
                    fileLock.ExitReadLock();
                    for (int n = 0; n < readBuffer; n++)
                        if (lineBuffer[n] == 0xD)
                            Interlocked.Increment(ref lineCount);
                }
            });
    }
    Task.WaitAll(tasks);
}

OTHER TIPS

It is hardware dependent, one question is what is the best buffer size. Perhaps something equal to the disk sector size or greater. After experimenting myself, I've found it's usually best to let the system determine that. If speed really is a concern, you can drop down to the Win32 API ReadFile/CreateFile specifying various flags and parameters such as async IO and no buffering, sequential read, etc... which may or may not help improve performance. You'll have to profile and see what works best on your system. In .NET you may be able to pin the buffer for better performance, of course pinning memory in GC environment has other ramifications, but if you don't keep it around too long, etc...

    const int bufsize = 4096;
    int lineCount = 0;
    Byte[] buffer = new Byte[bufsize];
    using (System.IO.FileStream fs = new System.IO.FileStream(@"C:\\data\\log\\20111018.txt", FileMode.Open, FileAccess.Read, FileShare.None, bufsize))
    {
        int totalBytesRead = 0;
        int bytesRead;
        while ((bytesRead = fs.Read(buffer, 0, buffer.Length)) > 0) {
            int i = 0;
            while (i < bytesRead)
            {
                switch (buffer[i])
                {
                    case 10:
                        {
                            lineCount++;
                            i++;
                            break;
                        }
                    case 13:
                        {
                            int index = i + 1;
                            if (index < bytesRead)
                            {
                                if (buffer[index] == 10)
                                {
                                    lineCount++;
                                    i += 2;
                                }
                            }
                            else
                            {
                                i++;
                            }
                            break;
                        }
                    default:
                        {
                            i++;
                            break;
                        }
                }
            }
            totalBytesRead += bytesRead;
        }
        if ((totalBytesRead > 0) && (lineCount == 0))
            lineCount++;                    
    }

As your tests showed, changes in code aren't going to have a significant affect on the speed. The bottleneck is in your disk reading the data, not the C# code processing it.

If you want to speed up the execution of this task buy a faster/better hard drive, either one with a higher RPM, or even a solid state drive. Alternatively you could consider using RAID0, which could potentially improve your disk read speeds.

Another option would be to have multiple hard drives, and to break up the file so that each drive stores one portion, you can then parallelize the work with one task handling the file on each drive. (Note that parallelizing the work when you only have one disk won't help anything, and is more likely to actually hurt.)

Assuming that building a string to represent each line is what's taking the time, something like this might help:

public static int CountLines1(string path)
{
   int lineCount = 0;
   bool skipNextLineBreak = false;
   bool startedLine = false;
   var buffer = new char[16384];
   int readChars;

   using (var stream = new FileStream(path, FileMode.Open, FileAccess.Read, FileShare.Read, buffer.Length))
   using (var reader = new StreamReader(stream, Encoding.UTF8, false, buffer.Length, false))
   {
      while ((readChars = reader.Read(buffer, 0, buffer.Length)) > 0)
      {
         for (int i = 0; i < readChars; i++)
         {
            switch (buffer[i])
            {
               case '\n':
               {
                  if (skipNextLineBreak)
                  {
                     skipNextLineBreak = false;
                  }
                  else
                  {
                     lineCount++;
                     startedLine = false;
                  }
                  break;
               }
               case '\r':
               {
                  lineCount++;
                  skipNextLineBreak = true;
                  startedLine = false;
                  break;
               }
               default:
               {
                  skipNextLineBreak = false;
                  startedLine = true;
                  break;
               }
            }
         }
      }
   }

   return startedLine ? lineCount + 1 : lineCount;
}

Edit 2:
It's true what they say about "assume"! The overhead of calling .Read() for each character outweighs the savings from not creating a string for each line. Even updating the code to read a block of characters at a time is still slower than the original method.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top