File Chunking Performance in C#

https://stackoverflow.com/questions/11888997

25-06-2021
|

Question

I am trying to empower users to upload large files. Before I upload a file, I want to chunk it up. Each chunk needs to be a C# object. The reason why is for logging purposes. Its a long story, but I need to create actual C# objects that represent each file chunk. Regardless, I'm trying the following approach:

public static List<FileChunk> GetAllForFile(byte[] fileBytes)
{
  List<FileChunk> chunks = new List<FileChunk>();
  if (fileBytes.Length > 0)
  {
    FileChunk chunk = new FileChunk();
    for (int i = 0; i < (fileBytes.Length / 512); i++)
    {
      chunk.Number = (i + 1);
      chunk.Offset = (i * 512);
      chunk.Bytes = fileBytes.Skip(chunk.Offset).Take(512).ToArray();

      chunks.Add(chunk);
      chunk = new FileChunk();
    }
  }
  return chunks;
}

Unfortunately, this approach seems to be incredibly slow. Does anyone know how I can improve the performance while still creating objects for each chunk?

thank you

Solution

I suspect this is going to hurt a little:

chunk.Bytes = fileBytes.Skip(chunk.Offset).Take(512).ToArray();

Try this instead:

byte buffer = new byte[512];
Buffer.BlockCopy(fileBytes, chunk.Offset, buffer, 0, 512);
chunk.Bytes = buffer;

(Code not tested)

And the reason why this code would likely be slow is because Skip doesn't do anything special for arrays (though it could). This means that every pass through your loop is iterating the first 512*n items in the array, which results in O(n^2) performance, where you should just be seeing O(n).

OTHER TIPS

Try something like this (untested code):

public static List<FileChunk> GetAllForFile(string fileName, FileMode.Open)
{
  var chunks = new List<FileChunk>();
  using (FileStream stream = new FileStream(fileName))
  {
      int i = 0;
      while (stream.Position <= stream.Length)
      {
          var chunk = new FileChunk();
          chunk.Number = (i);
          chunk.Offset = (i * 512);
          Stream.Read(chunk.Bytes, 0, 512);
          chunks.Add(chunk);
          i++;
      }
  }
  return chunks;
}

The above code skips several steps in your process, preferring to read the bytes from the file directly.

Note that, if the file is not an even multiple of 512, the last chunk will contain less than 512 bytes.

Same as Robert Harvey's answer, but using a BinaryReader, that way I don't need to specify an offset. If you use a BinaryWriter on the other end to reassemble the file, you won't need the Offset member of FileChunk.

public static List<FileChunk> GetAllForFile(string fileName) {
    var chunks = new List<FileChunk>();
    using (FileStream stream = new FileStream(fileName)) {
        BinaryReader reader = new BinaryReader(stream);
        int i = 0;
        bool eof = false;
        while (!eof) {
            var chunk = new FileChunk();
            chunk.Number = i;
            chunk.Offset = (i * 512);
            chunk.Bytes = reader.ReadBytes(512);
            chunks.Add(chunk);
            i++;
            if (chunk.Bytes.Length < 512) { eof = true; }
        }
    }
    return chunks;
}

Have you thought about what you're going to do to compensate for packet loss and data corruption?

Since you mentioned that the load is taking a long time then I would use asynchronous file reading in order to speed up the loading process. The hard disk is the slowest component of a computer. Google does asynchronous reads and writes on Google Chrome to improve their load times. I had to do something like this in C# in a previous job.

The idea would be to spawn several asynchronous requests over different parts of the file. Then when a request comes in, take the byte array and create your FileChunk objects taking 512 bytes at a time. There are several benefits to this:

If you have this run in a separate thread, then you won't have the whole program waiting to load the large file you have.
You can process a byte array, creating FileChunk objects, while the hard disk is still trying to for-fill read request on other parts of the file.
You will save on RAM space if you limit the amount of pending read requests you can have. This allows less page faulting to the hard disk and use the RAM and CPU cache more efficiently, which speeds up processing further.

You would want to use the following methods in the FileStream class.

[HostProtectionAttribute(SecurityAction.LinkDemand, ExternalThreading = true)]
public virtual IAsyncResult BeginRead(
    byte[] buffer,
    int offset,
    int count,
    AsyncCallback callback,
    Object state
)

public virtual int EndRead(
    IAsyncResult asyncResult
)

Also this is what you will get in the asyncResult:

// Extract the FileStream (state) out of the IAsyncResult object
FileStream fs = (FileStream) ar.AsyncState;

// Get the result
Int32 bytesRead = fs.EndRead(ar);

Here is some reference material for you to read.

This is a code sample of working with Asynchronous File I/O Models.

This is a MS documentation reference for Asynchronous File I/O.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow