Are checksums in file formats obsolete?

https://softwareengineering.stackexchange.com/questions/336347

01-01-2021
|

Pregunta

In the context of a modern filesystem such as btrfs or ZFS, both of which checksum every piece of data written, is there any additional value in a file format storing internal checksums?

I also note the case of where a file is transferred across a network. TCP does its own checksums, so again, is it necessary for the file itself to contain a checksum?

Finally, in the case of backups and archives, it is usual for archive files (tarballs etc.) to be stored with a sidecar file containing a hash. Where the archive file is intended as a distribution method, a cryptographically secure sidecar hash file is required.

So when should a file format do its own checksums?

Solución

The other thing you haven't considered is that files typically don't just exist on disk:

They are copied across networks in various ways and under various circumstances.
They are copied from one storage media to another, or even within a media.

Each time a file is copied, the bits could get corrupted ...

Now some of these representation or data movement schemes have (or can have) mechanisms to detect corruption. But this doesn't apply to all of them, and someone receiving a file cannot tell whether previous storage / movement schemes that touched the file do error detection. Also, you don't know how good the error detection is. For example, will it detect 2 bits flipped?

Therefore, if the file content warrants error detection, including error detection as part of the file format is a reasonable thing to do. (Indeed, if you don't then you ought to use some kind of external checksumming mechanism, independent of the file system's error detection, etcetera.)

The other thing to note is that while disks, networks, network protocols, file systems, RAM and so on often implement some kind of error detection, they don't always do this. And when they do, they tend to use a mechanism that is optimized for speed rather than high integrity. High integrity tends to be computationally expensive.

A file format where integrity matters cannot assume that something else is taking care of the problem.

(Then there is the issue that you may want / need to detect deliberate file tampering. For that you need something more than simple checksums or even (just) cryptohashes. You need something like digital signatures.)

TL;DR - checksums in file formats are not redundant.

Otros consejos

Checksums improve data quality on a statistical basis. So it depends on the factor of security you need for your data. You never can reach 100% since each check sum can alter (though very unlikely) in a way with the data it's going to secure. There's just one rule that the more secured your data needs to be, the more you need to add algorithmic overhead. It's some sigmoid function where to the right you increase the algorithmic effort, but never reach the 100% security on top.

(N.B. I never know when it's safety or security, but you probably guess what I mean.)

Reworked answer following discussion in comments

Checksums in file formats

The checksum in a file format has a different purpose than checksums in the file system. It aims at verifying the integrity of the data at application level. It can detect:

accidental corruption of content (e.g. accidental bit flips in file I/O operations, on the storage device, or during network transfer when file was transferred)
potential inconsistencies (e.g. file was edited manually or modified without sufficient knowledge of its structure)
intentional corruption and fraud (e.g. banking formats foresee more complex checksums, make it more difficult for fraudster hacking in manual changes).

Checksums don't guarantee authenticity of data (for this there are digital signatures), but they reduce risks of altered application data.

Checksums in file systems

In the very large scale (e.g. datacenter), accidental corruption is not a question if it happens, but when it happens :

Hard disk had in 2013 a failure rate of 1 bit every 10^16 bits read/written. RAM similarly have an uncorrected failure every 10^14 bits.
Silent data corruption can also occur due to cosmic radiations affecting the chips, electromagnetic waves that interfere with signal transmission, and other external physical phenomenons.

This explains the rationale for checksums in filesystems:

protect data at storage level against accidental corruption, independently of the content format:

As an example, ZFS creator Jeff Bonwick stated that the fast database at Greenplum, which is a database software company specializing in large-scale data warehousing and analytics, faces silent corruption every 15 minutes
Wikipedia article (link above)
protect file system metadata against accidental corruption (or tampering attempts), because the loss of critical information, such as references to i-nodes or others could have an even more dramatic effect that data in individual files (e.g. instant loss of thousands of files)

Some file systems, such as Btrfs, HAMMER, ReFS, and ZFS, use internal data and metadata checksumming to detect silent data corruption. In addition, if a corruption is detected and the file system uses integrated RAID mechanisms that provide data redundancy, such file systems can also reconstruct corrupted data in a transparent way.
Wikipedia article (link above)

Multilayer protection

The physical protection in the hardware layer (ECC, CRC, RAID...), the filesystem or network protocol checksums in the system layers, and the content embedded checksum in the application layer complement each other and each protect against different phenomenons (e.g. a filesystem checksum does not protect against an intentional write).

Licenciado bajo: CC-BY-SA con atribución

No afiliado a softwareengineering.stackexchange