Compression to Improve Hard Disk Write Performance

https://stackoverflow.com/questions/431594

08-07-2019
|

Question

On a modern system can local hard disk write speeds be improved by compressing the output stream?

This question derives from a case I'm working with where a program serially generates and dumps around 1-2GB of text logging data to a raw text file on the hard disk and I think it is IO bound. Would I expect to be able to decrease runtimes by compressing the data before it goes to disk or would the overhead of compression eat up any gain I could get? Would having an idle second core affect this?

I know this would be affected by how much CPU is being used to generate the data so rules of thumb on how much idle CPU time would be needed would be good.

I recall a video talk where someone used compression to improve read speeds for a database but IIRC compressing is a lot more CPU intensive than decompressing.

Solution

This depends on lots of factors and I don't think there is one correct answer. It comes down to this:

Can you compress the raw data faster than the raw write performance of your disk times the compression ratio you are achieving (or the multiple in speed you are trying to get) given the CPU bandwidth you have available to dedicate to this purpose?

Given today's relatively high data write rates in the 10's of MBytes/second this is a pretty high hurdle to get over. To the point of some of the other answers, you would likely have to have easily compressible data and would just have to benchmark it with some test of reasonableness type experiments and find out.

Relative to a specific opinion (guess!?) to the point about additional cores. If you thread up the compression of the data and keep the core(s) fed - with the high compression ratio of text, it is likely such a technique would bear some fruit. But this is just a guess. In a single threaded application alternating between disk writes and compression operations, it seems much less likely to me.

OTHER TIPS

Yes, yes, yes, absolutely.

Look at it this way: take your maximum contiguous disk write speed in megabytes per second. (Go ahead and measure it, time a huge fwrite or something.) Let's say 100mb/s. Now take your CPU speed in megahertz; let's say 3Ghz = 3000mhz. Divide the CPU speed by the disk write speed. That's the number of cycles that the CPU is spending idle, that you can spend per byte on compression. In this case 3000/100 = 30 cycles per byte.

If you had an algorithm that could compress your data by 25% for an effective 125mb/s write speed, you would have 24 cycles per byte to run it in and it would basically be free because the CPU wouldn't be doing anything else anyway while waiting for the disk to churn. 24 cycles per byte = 3072 cycles per 128-byte cache line, easily achieved.

We do this all the time when reading optical media.

If you have an idle second core it's even easier. Just hand off the log buffer to that core's thread and it can take as long as it likes to compress the data since it's not doing anything else! The only tricky bit is you want to actually have a ring of buffers so that you don't have the producer thread (the one making the log) waiting on a mutex for a buffer that the consumer thread (the one writing it to disk) is holding.

Yes, this has been true for at least 10 years. There are operating-systems papers about it. I think Chris Small may have worked on some of them.

For speed, gzip/zlib compression on lower quality levels is pretty fast; if that's not fast enough you can try FastLZ. A quick way to use an extra core is just to use popen(3) to send output through gzip.

For what it is worth Sun's filesystem ZFS has the ability to have on the fly compression enabled to decrease the amount of disk IO without a significant increase in overhead as an example of this in practice.

The Filesystems and storage lab from Stony Brook published a rather extensive performance (and energy) evaluation on file data compression on server systems at IBM's SYSTOR systems research conference this year: paper at ACM Digital Library, presentation.

The results depend on the

used compression algorithm and settings,
the file workload and
the characteristics of your machine.

For example, in the measurements from the paper, using a textual workload and a server environment using lzop with low compression effort are faster than plain write, but bzip and gz aren't.

In your specific setting, you should try it out and measure. It really might improve performance, but it is not always the case.

CPUs have grown faster at a faster rate than hard drive access. Even back in the 80's a many compressed files could be read off the disk and uncompressed in less time than it took to read the original (uncompressed) file. That will not have changed.

Generally though, these days the compression/de-compression is handled at a lower level than you would be writing, for example in a database I/O layer.

As to the usefulness of a second core only counts if the system will be also doing a significant number of other things - and your program would have to be multi-threaded to take advantage of the additional CPU.

Logging the data in binary form may be a quick improvement. You'll write less to the disk and the CPU will spend less time converting numbers to text. It may not be useful if people are going to be reading the logs, but they won't be able to read compressed logs either.

Windows already supports File Compression in NTFS, so all you have to do is to set the "Compressed" flag in the file attributes. You can then measure if it was worth it or not.

If it's just text, then compression could definitely help. Just choose an compression algorithm and settings that make the compression cheap. "gzip" is cheaper than "bzip2" and both have parameters that you can tweak to favor speed or compression ratio.

If you are I/O bound saving human-readable text to the hard drive, I expect compression to reduce your total runtime.

If you have an idle 2 GHz core, and a relatively fast 100 MB/s streaming hard drive, halving the net logging time requires at least 2:1 compression and no more than roughly 10 CPU cycles per uncompressed byte for the compressor to ponder the data. With a dual-pipe processor, that's (very roughly) 20 instructions per byte.

I see that LZRW1-A (one of the fastest compression algorithms) uses 10 to 20 instructions per byte, and compresses typical English text about 2:1. At the upper end (20 instructions per byte), you're right on the edge between IO bound and CPU bound. At the middle and lower end, you're still IO bound, so there is a a few cycles available (not much) for a slightly more sophisticated compressor to ponder the data a little longer.

If you have a more typical non-top-of-the-line hard drive, or the hard drive is slower for some other reason (fragmentation, other multitasking processes using the disk, etc.) then you have even more time for a more sophisticated compressor to ponder the data.

You might consider setting up a compressed partition, saving the data to that partition (letting the device driver compress it), and comparing the speed to your original speed. That may take less time and be less likely to introduce new bugs than changing your program and linking in a compression algorithm.

I see a list of compressed file systems based on FUSE, and I hear that NTFS also supports compressed partitions.

If this particular machine is often IO bound, another way to speed it up is to install a RAID array. That would give a speedup to every program and every kind of data (even incompressible data).

For example, the popular RAID 1+0 configuration with 4 total disks gives a speedup of nearly 2x.

The nearly as popular RAID 5 configuration, with same 4 total disks, gives all a speedup of nearly 3x.

It is relatively straightforward to set up a RAID array with a speed 8x the speed of a single drive.

High compression ratios, on the other hand, are apparently not so straightforward. Compression of "merely" 6.30 to one would give you a cash prize for breaking the current world record for compression (Hutter Prize).

This used to be something that could improve performance in quite a few applications way back when. I'd guess that today it's less likely to pay off, but it might in your specific circumstance, particularly if the data you're logging is easily compressible,

However, as Shog9 commented:

Rules of thumb aren't going to help you here. It's your disk, your CPU, and your data. Set up a test case and measure throughput and CPU load with and without compression - see if it's worth the tradeoff.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow