Combining FileStream and MemoryStream to avoid disk accesses/paging while receiving gigabytes of data?

StackOverflow https://stackoverflow.com/questions/19687295

سؤال

I'm receiving a file as a stream of byte[] data packets (total size isn't known in advance) that I need to store somewhere before processing it immediately after it's been received (I can't do the processing on the fly). Total received file size can vary from as small as 10 KB to over 4 GB.

  • One option for storing the received data is to use a MemoryStream, i.e. a sequence of MemoryStream.Write(bufferReceived, 0, count) calls to store the received packets. This is very simple, but obviously will result in out of memory exception for large files.
  • An alternative option is to use a FileStream, i.e. FileStream.Write(bufferReceived, 0, count). This way, no out of memory exceptions will occur, but what I'm unsure about is bad performance due to disk writes (which I don't want to occur as long as plenty of memory is still available) - I'd like to avoid disk access as much as possible, but I don't know of a way to control this.

I did some testing and most of the time, there seems to be little performance difference between say 10 000 consecutive calls of MemoryStream.Write() vs FileStream.Write(), but a lot seems to depend on buffer size and the total amount of data in question (i.e the number of writes). Obviously, MemoryStream size reallocation is also a factor.

  1. Does it make sense to use a combination of MemoryStream and FileStream, i.e. write to memory stream by default, but once the total amount of data received is over e.g. 500 MB, write it to FileStream; then, read in chunks from both streams for processing the received data (first process 500 MB from the MemoryStream, dispose it, then read from FileStream)?

  2. Another solution is to use a custom memory stream implementation that doesn't require continuous address space for internal array allocation (i.e. a linked list of memory streams); this way, at least on 64-bit environments, out of memory exceptions should no longer be an issue. Con: extra work, more room for mistakes.

So how do FileStream vs MemoryStream read/writes behave in terms of disk access and memory caching, i.e. data size/performance balance. I would expect that as long as enough RAM is available, FileStream would internally read/write from memory (cache) anyway, and virtual memory would take care of the rest. But I don't know how often FileStream will explicitly access a disk when being written to.

Any help would be appreciated.

هل كانت مفيدة؟

المحلول

No, trying to optimize this doesn't make any sense. Windows itself already caches file writes, they are buffered by the file system cache. So your test is about accurate, both MemoryStream.Write() and FileStream.Write() actually write to RAM and have no significant perf differences. The file system driver lazily writes it to disk in the background.

The RAM used for the file system cache is what's left over after processes claimed their RAM needs. By using a MemoryStream, you reduce the effectiveness of the file system cache. Or in other words, you trade one for the other without benefit. You're in fact off worse, you use double the amount of RAM.

Don't help, this is already heavily optimized inside the operating system.

نصائح أخرى

Since recent versions of Windows enable write caching by default, I'd say you could simply use FileStream and let Windows manage when or if anything actually is written to the physical hard drive.

If these files don't stick around after you've received them, you should probably write the files to a temp directory and delete them when you're done with them.

Use a FileStream constructor that allows you to define the buffer size. For example:

using (outputFile = new FileStream("filename", 
    FileMode.Create, FileAccess.Write, FileShare.None, 65536))
{
}

The default buffer size is 4K. Using a 64K buffer reduces the number of calls to the file system. A larger buffer will reduce the number of writes, but each write starts to take longer. Emperical data (many years of working with this stuff) indicates that 64K is a very good choice.

As somebody else pointed out, the file system will likely do further caching, and do the actual disk write in the background. It's highly unlikely that you'll receive data faster than you can write it to a FileStream.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top