Question

I was going thru all the existing questions posts but couldn't get something much relevant.

I have file with millions of records for person first name, last name, address1, address2, country code, date of birth - I would like to check my list of customers with above file on daily basis (my customer list also get updated daily and file also gets updated daily).

For first name and last name I would like fuzzy match (may be lucene fuzzyquery/levenshtein distance 90% match) and for remaining fields country and date of birth I wanted exact match.

I am new to Lucene, but by looking at number of posts, looks like its possible.

My questions are:

  • How should I index my input file? I need to build index on combination of FN, LN, country, DOB and use the index for search
  • How I can use Fuzzy query of Lucene here?

Is there any other way I can implement the same?

Was it helpful?

Solution

Rushik, here are a few ideas:

  • Consider using Solr. It is much easier to start using it, rather than bare Lucene.
  • Build a Lucene/Solr index of the file. It appears that a document per customer is enough, if you use a multi-valued field or two different fields for addresses.
  • Do you have a unique id per person? To use Solr, you need one. In Lucene, you can get away without using a unique id.
  • Store the country code as a "keyword". If you only require exact match for date of birth, you may do the same. For range queries, you will need another representation.
  • I assume your customer list is smaller than the file. A possible policy would be to daily index the changes in the file (Here a unique id is really handy - otherwise you need to delete by query, which may miss the mark). Then you can optimize the index, and after that run a search for your updated customer list.
  • What you describe is a BooleanQuery, Whose clauses are fuzzy queries for the first and last names and term queries for the other fields. You can create the query programmaticaly or using the query parser.
  • Consider using soundex for names as described here.

OTHER TIPS

Some academic papers on this subject are well worth reading (google for the free PDFs):

  • A Comparison of Personal Name Matching: Techniques and Practical Issues (2006)
  • Overview of Record Linkage and Current Research Directions (2006)
  • A Parallel Open Source Data Linkage System (2004)

You should also consider the following libraries/frameworks:

(Answered for future visitors.)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top