Fastest and LightWeight Hashing Algorithm for Large Files & 512 KB Chunks [C,Linux,MAC,Windows]

StackOverflow https://stackoverflow.com/questions/13646574

  •  03-12-2021
  •  | 
  •  

Question

I'm working on a Project which involves computation of Hashes for Files. The Project is like a File Backup Service, So when a file gets uploaded from Client to Server, i need to check if that file is already available in the server. I generate a CRC-32 Hash for the file and then send the hash to server to check if it's already available.

If the file is not in server, i used to send the file as 512 KB Chunks[for Dedupe] and i have to calculate hash for this each 512 KB Chunk. The file sizes may be of few GB's sometimes and multiple clients will connect to the server. So i really need a Fast and LightWeight Hashing algorithm for files. Any ideas ..?

P.S : I have already noticed some Hashing Algorithm questions in StackOverflow, but the answer's not quite comparison of the Hashing Algorithms required exactly for this kind of Task. I bet this will be really useful for a bunch of People.

Was it helpful?

Solution

Actually, CRC32 does not have neither the best speed, neither the best distribution.

This is to be expected : CRC32 is pretty old by today's standard, and created in an era when CPU were not 32/64 bits wide nor OoO-Ex, also distribution properties were less important than error detection. All these requirements have changed since.

To evaluate the speed and distribution properties of hash algorithms, Austin Appleby created the excellent SMHasher package. A short summary of results is presented here. I would advise to select an algorithm with a Q.Score of 10 (perfect distribution).

OTHER TIPS

You say you are using CRC-32 but want a faster hash. CRC-32 is very basic and pretty fast. I would think the I/O time would be much longer than the hash time. You also want a hash that will not have collisions. That is two different files or 512 KB chunks gets the same hash value. You could look at any of the cryptographic hashs like MD5 (do not use for secure applications) or SHA1.

If you are only using CRC-32 to check if a file is a duplicate, you are going to get false duplicates because different files can have the same crc-32. You had better use sha-1, crc-32 and md5 are both too weak.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top