Question

I am working on building a database of timing and address information of restaurants those are extracted from multiple web sites. As information for same restaurants may be present in multiple web sites. So in the database I will have some nearly duplicate copies.

As the number of restaurants is large say, 100000. Then for each new entry I have to do order of 100000^2 comparison to check if any restaurant information with nearly similar name is already present. So I am asking whether there is any efficient approach better than that is possible. Thank you.

Was it helpful?

Solution 2

One approach is to structure your similarity function such that you can look up a small set of existing restaurants to compare your new restaurant against. This lookup would use an index in your database and should be quick.

How to define the similarity function is the tricky part :) Usually you can translate each record to a series of tokens, each of which is looked up in the database to find the potentially similar records.

Please see this blog post, which I wrote to describe a system I built to find near duplicates in crawled data. It sounds very similar to what you want to do and since your use case is smaller, I think your implementation should be simpler.

OTHER TIPS

Basically, you're looking for a record linkage tool. These tools can index records, then for each record quickly locate a small set of potential candidates, then do more detailed comparison on those. That avoids the O(n^2) problem. They also have support for cleaning your data before comparison, and more sophisticated comparators like Levenshtein and q-grams.

The record linkage page on Wikipedia used to have a list of tools on it, but it was deleted. It's still there in the version history if you want to go look for it.

I wrote my own tool for this, called Duke, which uses Lucene for the indexing, and has the detailed comparators built in. I've successfully used it to deduplicate 220,000 hotels. I can run that deduplication in a few minutes using four threads on my laptop.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top