Checking for document duplicates and similar documents in a document management application

https://stackoverflow.com/questions/1728977

19-09-2019
|

Question

Update: I have now written a PHP extension called php_ssdeep for the ssdeep C API to facilitate fuzzy hashing and hash comparisons in PHP natively. More information can be found over at my blog. I hope this is helpful to people.

I am involved in writing a custom document management application in PHP on a Linux box that will store various file formats (potentially 1000's of files) and we need to be able to check whether a text document has been uploaded before to prevent duplication in the database.

Essentially when a user uploads a new file we would like to be able to present them with a list of files that are either duplicates or contain similar content. This would then allow them to choose one of the pre-existing documents or continue uploading their own.

Similar documents would be determined by looking through their content for similar sentances and perhaps a dynamically generated list of keywords. We can then display a percentage match to the user to help them find the duplicates.

Can you recommend any packages for this process and any ideas of how you might have done this in the past?

The direct duplicate I think can be done by getting all the text content and

Stripping whitespace
Removing punctuation
Convert to lower or upper case

then form an MD5 hash to compare with any new documents. Stripping those items out should help prevent dupes not being found if the user edits a document to add in extra paragraph breaks for example. Any thoughts?

This process could also potentially run as a nightly job and we could notify the user of any duplicates when they next login if the computational requirement is too great to run in realtime. Realtime would be preferred however.

Solution

I have found a program that does what its creator, Jesse Kornblum, calls "Fuzzy Hashing". Very basically it makes hashes of a file that can be used to detect similar files or identical matches.

The theory behind it is documented here: Identifying almost identical files using context triggered piecewise hashing

ssdeep is the name of the program and it can be run on Windows or Linux. It was intended for use in forensic computing, but it seems suited enough to our purposes. I have done a short test on an old Pentium 4 machine and it takes about 3 secs to go through a hash file of 23MB (hashes for just under 135,000 files) looking for matches against two files. That time includes creating hashes for the two files I was searching against as well.

OTHER TIPS

I'm working on a similar problem in web2project and after asking around and digging, I came to the conclusion of "the user doesn't care". Having duplicate documents doesn't matter to the user as long as they can find their own document by its own name.

That being said, here's the approach I'm taking:

Allow a user to upload a document associating it with whichever Projects/Tasks they want;
The file should be renamed to prevent someone getting at it via http.. or better stored outside the web root. The user will still see their filename in the system and if they download it, you can set the headers with the "proper" filename;
At some point in the future, process the document to see if there are duplicates.. at this point though, we are not modifying the document. After all, there could be important reasons the whitespace or capitalization is changed;
If there are dupes, delete the new file and then link to the old one;
If there aren't dupes, do nothing;
Index the file for search terms - depending on the file format, there are lots of options, even for Word docs;

Throughout all of this, we don't tell the user it was a duplicate... they don't care. It's us (developers, db admins, etc) that care.

And yes, this works even if they upload a new version of the file later. First, you delete the reference to the file, then - just like in garbage collection - you only delete the old file if there are zero references to it.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow