Question

I am working on transferring files over the network. There is zero tolerance for data loss during the transfers. I've been asked to compute the SHA256 values for the original and the copied file to verify the contents are the same. So far I have made comparisons based on copying and pasting the file, and letting Windows rename the file with the -copy appended to the filename. I have also tried renaming the file after the rename above, as well as removing the file extension. So far they all produce the same hash. I've also coded altering file attributes (just changed lastWrittenTime and fileCreationTime) and this does not seem to have an effect on the hash.

Checksum result of copying and pasting a file(explorer appends "-copy to name):

E7273D248F191A0F914837A21BE39D229D790CA242D38651BAA06DAC9EBB63F7
E7273D248F191A0F914837A21BE39D229D790CA242D38651BAA06DAC9EBB63F7

Checksum result of renaming the -copy in explorer:

E7273D248F191A0F914837A21BE39D229D790CA242D38651BAA06DAC9EBB63F7
E7273D248F191A0F914837A21BE39D229D790CA242D38651BAA06DAC9EBB63F7

Checksum result of changing file extension:

E7273D248F191A0F914837A21BE39D229D790CA242D38651BAA06DAC9EBB63F7
E7273D248F191A0F914837A21BE39D229D790CA242D38651BAA06DAC9EBB63F7

What part/s of the file are used when the hash is created?

Ok, zero tolerance was a bit much, if the hash doesn't match the file will have to be resent.

Was it helpful?

Solution

The entire binary file contents are streamed through the hashing algorithm. File metadata (such as name, date etc) doesn't play a part.

OTHER TIPS

First, a general recommendation: don't do this. Use rsync or something similar to do bulk file transfers. Rsync has years of optimisations and debugging behind it, has countless options to control how (and whether) the copying happens, and is available on Windows. Don't waste time building something that has already been built.

But if you must…

Hashing algorithms generally care about bytes, not files. When applying SHA256 to a file, you are simply reading the bytes and passing them through the algo.

If you want to hash paths, permissions, etc, you should do this at the directory level, because these things constitute the "contents" of a directory. There is no standard byte-level representation of directories, so you'll have make one up yourself. Something that looks like a directory listing in sorted order usually suffices. And make sure that each entry contains the hash of the corresponding thing, be it a file or another directory. This way, the hash of the directory uniquely specifies not only the name and attributes of each child, but, recursively, the entire contents of the subdirectory.

Note: the fact that identical files have the same hash can actually work in your favour, by avoiding transmission of the second file once the system realises that a file with the same hash is already present at the destination. Of course, you would have to code for this explicitly. But also note that doing so can allow super-cheap syncing when files have been moved or copied, since they will have the same hash as before. Only affected directories (from the immediate parent(s) to the root) will have different hash values.

Finally, a minor quibble: there is no such thing as zero tolerance. Forget whether SHA256 collisions will happen in the lifetime of the Universe. A gamma ray can flip the bit that says, "These two files don't match!" Such flippings happen exceedingly rarely, but more often than you might think. In a noisy quantum universe, we should avoid talking in absolutes.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top