Question

I have large postgresql database, containing documents. Every document represented as a row in the table. When new document added to the database I need to check for duplicates. But I can't just use select to find exact match. Two documents can vary slightly and still can be considered as a duplicates, for example if some minor fields are different and all other fields are equal.

I research this problem and find method to solve this problem. It is possible to calculate MinHash signature for every document and construct inverted index, to query similar documents from the database. But I can't understand how to map MinHash to relational database.

As I understand, MinHash signature is a list of N hashes, where N is a number of attributes. Similarity calculated as follows:

# Given 2 signatures Sa and Sb containing N hashes.
# Calculate number of equal hashes Neq.
number_of_equal_hashes = 0
for ix in range(0, N):
    if Sa[ix] == Sb[ix]:
        number_of_equal_hashes += 1
similarity = float(number_of_equal_hashes)/N

This is simple if you already have two signatures, the problem is to find all documents (with corresponding signatures) in the database with similarity less or equal some value.

For example, I can create table with multiple columns like this:

| minhash0 | minhash1 | minhash3 | docid |

Each minhashX column corresponds to minhash of the one of the document's attribute and docid is a document's identifier. I can query similar records this way:

select * from invidx
where ((case when minhash0=minhash2search0 then 1 else 0 end) +
       (case when minhash1=minhash2search1 then 1 else 0 end) +
       (case when minhash2=minhash2search2 then 1 else 0 end))/N > THRESHOLD

where minhash2searchX is minhashes of new document and THRESHOLD is minimal similarity. But this approach require full scan. Is there any method to speedup this algorithm?

Was it helpful?

Solution

To use use advantages of inverted index, I'd suggest you full-text search engine for your purposes, e.g. Lucene or Solr (which is based on Lucene)

You can construct "document" (in terms of Lucene), which would contain fields, which associated with MinHashes of your documents (db records). Note, that you're able to index numeric fields as well (you're just need to describe field types in scheme). Also, you have to store primary key of each document, to map Lucene "documents" on records from your db.

Index entire collection of your documents.

For finding similar documents to given document - you're have to calculate MinHashes for each field, and query Lucene for similar documents:

field1:MinHash1 OR field2:MinHash2 OR ...

As more fields matched document - the higher rank it would have. So, you can take few documents with highest rank, and make a decision - if they are really similar in your case

Also, boosting of fields may be useful for you

OTHER TIPS

Your hash table should contain two columns:

| minhash | docid |

It should be indexed on minhash.

When a new document arrives, you search on each of its minhashes in turn, querying the table to find prior documents sharing that minhash. You build up a tally of how many minhashes are shared by these prior documents, and then discard all those with fewer than (e.g.) 50% of the minhashes shared. This efficiently yields the set of all documents that are at least (estimated) 50% similar.

Finally you insert new rows for each of the new document's minhashes.

Using Lucene or Solr is a bad solution. It will require a lot more storage, will be more complex to implement, and vastly less efficient. Yes, you could get Lucene to index your minhashes and run a query as stemm suggests. This will return every document that shares even a single minhash, which could be tens or hundreds of thousands, depending on your data size. You then have to individually compare each one of these to your incoming document using the "Similarity" feature, which would be super slow.

Lucene does offer a "MoreLikeThis" feature to find documents sharing certain keywords, but this would miss many similar documents that a minhash approach would find.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top