Fast and scalable similarity detection

Question 1

To use use advantages of inverted index, I'd suggest you full-text search engine for your purposes, e.g. Lucene or Solr (which is based on Lucene)

You can construct "document" (in terms of Lucene), which would contain fields, which associated with MinHashes of your documents (db records). Note, that you're able to index numeric fields as well (you're just need to describe field types in scheme). Also, you have to store primary key of each document, to map Lucene "documents" on records from your db.

Index entire collection of your documents.

For finding similar documents to given document - you're have to calculate MinHashes for each field, and query Lucene for similar documents:

field1:MinHash1 OR field2:MinHash2 OR ...

As more fields matched document - the higher rank it would have. So, you can take few documents with highest rank, and make a decision - if they are really similar in your case

Also, boosting of fields may be useful for you

Question 2

Your hash table should contain two columns:

| minhash | docid |

It should be indexed on minhash.

When a new document arrives, you search on each of its minhashes in turn, querying the table to find prior documents sharing that minhash. You build up a tally of how many minhashes are shared by these prior documents, and then discard all those with fewer than (e.g.) 50% of the minhashes shared. This efficiently yields the set of all documents that are at least (estimated) 50% similar.

Finally you insert new rows for each of the new document's minhashes.

Using Lucene or Solr is a bad solution. It will require a lot more storage, will be more complex to implement, and vastly less efficient. Yes, you could get Lucene to index your minhashes and run a query as stemm suggests. This will return every document that shares even a single minhash, which could be tens or hundreds of thousands, depending on your data size. You then have to individually compare each one of these to your incoming document using the "Similarity" feature, which would be super slow.

Lucene does offer a "MoreLikeThis" feature to find documents sharing certain keywords, but this would miss many similar documents that a minhash approach would find.