Question

Suppose I'm renaming a file and the electricity goes off right in the middle. Naively, it looks like the file could be in some “half-renamed” state. Maybe the name would have half the old name and half the new name. Or maybe the file could disappear altogether because it no longer exists under the old name but it doesn't exist yet under the new name.

How can a filesystem protect against loss of power while a file is renamed? Are these just theoretical techniques or are they used in practice?

Was it helpful?

Solution

Most filesystems and filesystem implementations are designed to be resistant to a loss of power. When you rename a file, with most operating systems, you can expect to always see the file either under its old name or under its new name. If there is any intermediate stage where the file is so to speak half-renamed, the operating system will hide it from you, even if the machine crashes at a bad time. Renaming a file (with most filesystems) is an atomic operation.

The precise method for ensuring atomicity varies from filesystem to filesystem and may also depend on other system components such as the storage hardware and the disk driver. In this answer I'll describe some common techniques at a high level. If you look at an actual filesystem implementation, you're likely to find that things are a lot more complicated.

Most storage media guarantees that writes are atomic at the level of a block. A typical block size is 512 bytes. This is often guaranteed by the disk firmware. Here are some common techniques for noticing or preventing data corruption due to a power failure:

  • One possibility to protect against power failures is to have a small power reserve that is sufficient to complete the writing of the current block.
  • Another method is that the disk stores the block content followed by a checksum of the block: if the new content of the block is only half-written, the checksum will be wrong, and the block content is lost but the operating system will not re-read bad data.
  • A third method is to maintain a dirty bit: flip it from 0 to 1, do the actual write, then flip the bit back to 0, so that if the dirty bit is 0 then the block content is known to be good.

Let's assume for a moment that the file name is stored entirely within one block. Then what the filesystem driver needs to do to rename the file is to update that block with the new name.

  • If the storage medium guarantees that block writes are atomic, then the filesystem driver simply writes the new block content. If the write goes through, the file will have its new name; if a power failure happens, the file will retain its old name, meaning that the power failure effectively happened before the instruction to rename the file was carried out.
  • If writing a block may damage it, the filesystem driver creates a new block with the new name. Once that block has been successfully created, the filesystem driver updates the “parent” block that contains the location of the block containing the name to reflect the new location. The location of the parent block must be updated in its parent and so on.
    Such a chain has to stop somewhere. One way to stop is to have a block with two possible locations L1 and L2. Maintain a version number for the block, stored inside the block. Increment the version number each time the block is written. At any time, whichever of L1 and L2 has the highest version number is the real location.

In general renaming a file may involve rewriting several blocks. The filesystem must take care in which order these blocks are written. For efficiency, it is often useful to reorder writes. This is often a consequence of caching and buffering: blocks are not actually written to disk each time they are modified, but kept in memory for a little time in case they are modified again. Another reason to reorder writes is that it's more efficient to group consecutive writes together (write block 42 then 43 then 44, even though 43 was modified first, then 42 then 44). The management of this reordering is a very complex part of typical filesystems, because it is very important for both correctness and performance.

One method that is used by many modern filesystems is journaling (in the database world, this is called a log). In addition to the actual data blocks, the filesystem maintains a list of operations that it performs. These operations are stored in strict chronological order. The journal consists of a list of records such as “rename the file from oldname to newname, affecting blocks B1, B2 and B3”. If a power failure interrupts the writing of a journal record, the operation hasn't taken place. The information in the journal duplicates the data which is written as performance permits. If the data isn't written at its final location due to a power failure, the operating system performs a journal replay at boot time: it reads the journal entries, checks whether each operation has taken place and performs the operation if necessary. The operating system also periodically marks journal entries as completed once it knows that the data has been successfully written. In many filesystems, this allows old journal entries to be erased, so that the journal keeps to a fixed size. There are even filesystems that consist solely of a journal, so that a file is defined by pointers to older journal entries that write its contents — this isn't very common though.

OTHER TIPS

Search for something called a "Journaling File System", of which variants are used on many popular operating systems.

The basics of a journaling system are:

  1. Record intended change to the journal
  2. Make the change

This provides basic corruptiong protection in the case of any kind of failure (including the power being lost). If an interruption happens prior/during to Step 1 it is simply as though the operation never took place. If it happens prior or during to 2, the system can simply replay what was recorded in 1 at a later time.

There's a bit of book-keeping here. You may wish to check the ext4 filesystem as a concrete example (though I think it is quite advanced, perhaps something simpler does exist).

RAID is a related concept, but their error correction modes are more dedicated to the loss of a particular node. In this case the drives are capable of rebuilding despite the lost node. Likely the individual disks/logical array still use a form a journaling. You can dig further here for RAID or under the generic term "forward error correction" (FEC).

Licensed under: CC-BY-SA with attribution
Not affiliated with cs.stackexchange
scroll top