Possible to implement journaling with a single fsync per commit?

https://stackoverflow.com/questions/3800108

25-09-2019
|

Question

Let's say you're building a journaling/write-ahead-logging storage system. Can you simply implement this by (for each transaction) appending the data (with write(2)), appending a commit marker, and then fsync-ing?

The scenario to consider is if you do a large set of writes to this log then fsync it, and there's a failure during the fsync. Are the inode direct/indirect block pointers flushed only after all data blocks are flushed, or are there no guarantees that blocks are being flushed in order? If the latter, then during recovery, if you see a commit marker at the end of the file, you can't trust that the data between it and the previous commit marker is meaningful. Thus you have to rely on another mechanism (involving at least another fsync) to determine what extent of the log file is consistent (e.g., writing/fsyncing the data, then writing/fsyncing the commit marker).

If it makes a difference, mainly wondering about ext3/ext4 as the context.

Solution

Note that linux's and mac os's fsync and fdatasync are incorrect by default. Windows is correct by default, but can emulate linux for benchmarking purposes.

Also, fdatasync issues multiple disk writes if you append to the end of a file, since it needs to update the file inode with the new length. If you want to have one write per commit, your best bet is to pre-allocate log space, store a CRC of the log entries in the commit marker, and issue a single fdatasync() at commit. That way, no matter how much the OS / hardware reorder behind your back, you can find a prefix of the log that actually hit disk.

If you want to use the log for durable commits or write ahead, things get harder, since you need to make sure that fsync actually works. Under Linux, you'll want to disable the disk write cache with hdparm, or mount the partition with barrier set to true. [Edit: I stand corrected, barrier doesn't seem to give the correct semantics. SATA and SCSI introduce a number of primitives, such as write barriers and native command queuing, that make it possible for operating systems to export primitives that enable write-ahead logging. From what I can tell from manpages and online, Linux only exposes these to filesystem developers, not to userspace.]

Paradoxically, disabling the disk write cache sometimes leads to better performance, since you get more control over write scheduling in user space; if the disk queues up a bunch of synchronous write requests, you end up exposing strange latency spikes to the application. Disabling write cache prevents this from happening.

Finally, real systems use group commit, and do < 1 sync write per commit with concurrent workloads.

OTHER TIPS

There's no guarantee on the order in which blocks are flushed to disk. These days even the drive itself can re-order blocks on their way to the platters.

If you want to enforce ordering, you need to at least fdatasync() between the writes that you want ordered. All a sync promises is that when it returns, everything written before the sync has hit storage.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow