Question

I'm hashing a large number of files, and to avoid hash collisions, I'm also storing a file's original size - that way, even if there's a hash collision, it's extremely unlikely that the file sizes will also be identical. Is this sound (a hash collision is equally likely to be of any size), or do I need another piece of information (if a collision is more likely to also be the same length as the original).

Or, more generally: Is every file just as likely to produce a particular hash, regardless of original file size?

Was it helpful?

Solution

Depends on your hash function, but in general, files that are of the same size but different content are less likely to produce the same hash as files that are of different size. Still, it would probably be cleaner to simply use a time-tested hash with a larger space (e.g. MD5 instead of CRC32, or SHA1 instead of MD5) than bet on your own solutions like storing file size.

OTHER TIPS

Hash functions are generally written to evenly distribute the data across all result buckets.

If you assume that your files are evenly distributed over a fixed range of available sizes, lets say that there are only 1024 (2^10) evenly distributed distinct sizes for your files. Storing file size at best only reduces the chance of a collision by the number of distinct file sizes.

Note: we could assume it's 2^32 evenly distributed and distinct sizes and it still doesn't change the rest of the math.

It is commonly accepted that the general probability of a collision on MD5 (for example) is 1/(2^128).

Unless there is something that is specifically built into a hash function that says otherwise. Given any valid X such that Probability of P(MD5(X) == MD5(X+1)) remains the same as any two random values {Y, Z} That is to say that P(MD5(Y) == MD5(Z)) = P(MD5(X) == MD5(X+1)) = 1/(2^128) for any values of X, Y and Z.

Combining this with the 2^10 of distinct files means that by storing file size you are at most getting an additional 10 bits that signify if items are different or not (again this is assuming your files are evenly distributed for all values).

So at the very best all you are doing is adding another N bytes of storage for <=N bytes worth of unique values (it can never be >N). Therefore you're much better off to increase the bytes returned by your hash function using something such as SHA-1/2 instead as this will be more likely to give you an evenly distributed data of hash values than storing the file size.

In short, if MD5 isn't good enough for collisions use a stronger hash, if the stronger hashes are too slow then use a fast hash with low chance of collisions such a as MD5, and then use a slower hash such as SHA-1 or SHA256 to reduce the chance of a collision, but if SHA256 is fast enough and the doubled space isn't a problem then you probably should be using SHA256.

Hash functions are designed the way that it's very difficult to get the collision, otherwise they won't be effective.
If you have hash collision that is absolutely unbelievable about 1 : number_of_possible_hashes probability that says nothing about file size.

If you really want to be double-sure about hash collisions, you can calculate two different hashes for the same file - it will be less error-prone than saving hash + file size.

The size of the hash is the same regardless of the size of the original data. As there is only a limited number of possible hashes it is theoretically possible that two files with different sizes may have the same hash. However, this means that it is also possible that two files with the same size may have the same hash.

The whole point of the family of cryptographic hashes (MD5, SHA-x, etc) is to make collisions vanishingly unlikely. The notion is that official legal processes are prepared to depend on it being impractical to manufacture a collision on purpose. So, really, it's a bad use of space and CPU time to add a belt to the suspenders of these hashes.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top