如何在不与彼此进行比较的情况下发现相同的文件？

https://stackoverflow.com/questions/5016947

14-11-2019
|

题

我正在构建用户可以上传内容的网站。一如既往，我的目标是世界统治，所以我想避免两次存储同一文件。例如，如果用户尝试将同一文件上传两次（通过重命名或简单地忘记她在过去所做的事情）。

我的目前的方法是让数据库跟踪每个上载的文件存储有关每个文件的以下信息：

文件大小以字节为单位
MD5文件内容的总和
sha1文件内容的总和然后在这三列上的唯一索引。使用两个哈希度为最小化误报的风险。
所以，我的问题实际上是：两个不同（“真实世界”）文件的概率相同的尺寸，具有相同的MD5 和 SHA1哈希？
或：是否有更智能的方法（UN）复杂性？
（我理解概率可能取决于文件大小）。
谢谢！

解决方案

The probability of two real-world files of the same size having the same SHA1 hash is zero for all practical purposes. Some weaknesses in SHA1 have been found, but creating a file from a SHA1 hash and a size (1) is incredibly expensive in terms of computing power and (2) produces either garbage or the original file.

Adding MD5 to the mix is total overkill. If you don't trust SHA-1, then a better option is to switch to SHA-2.

If you're really paranoid, try comparing files with identical (size, SHA1) signatures. That will, however, have to read both the files entirely if they are equal.

其他提示

I believe storing MD5 and SHA1 hashes is adding unnecessary complexity and not good design. I would say storing the tuple of (SHA1, file size) would be by far good enough. Especially if you're starting a new community site, I'd safely use that solution and only create something more clever once it becomes a problem. As the saying goes, premature optimization is the root of all evil, and it's arguable if it'll be `optimizing'.

edit: I did not quantify the odds of you getting a MD5+SHA1 collision. I'd say it's zero. By a crude, back of the envelope calculation, the odds of two different files of arbitrary file sizes having identical (SHA1,MD5) tuple is 2^-288, which is zero as far as I'm concerned. Having to require identical file size reduces that even further.

You can use Broders implementation of the Rabin fingerprinting algorithm. It is faster to compute than sha1 and md5 and it is proven to be collision resistant. However, it is not considered to be safe against malicious attacks, it is possible fot someone to purposefuly alter the file in question sithout changing the fingerprint itself. If you just want to check the similarity of files, it is s pretty good solution.

C# implementation, not tested:

http://www.developpez.net/forums/d863959/dotnet/general-dotnet/contribuez/algorithm-rabin-fingerprint/

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow