Pergunta

I've been tasked with identifying a solution to help identify possible duplicates when inserting new user records into a MySQL database. Does anyone know of any cheap, preferably free, solutions?

The sort of duplicates I need to detect are common forms of shortened English names - such as Bill for William, Bob for Robert etc - along with misspellings as well as plain duplicates. I've read a bit about using Lucene but it seems to be more for full text searching and I'm unsure if it supports the dupliatcae name matching I'm after?

Foi útil?

Solução

This might be better placed on stack overflow. This isn't something you'll accomplish with MySQL. What you're talking about is referred to as 'stemming' in search. Similar to matching different conjugations of a regular word e.g. run => runs ,ran.

I don't know of any such applications for proper names off hand but when you find one that will sit alongside your primary application to "normalize" the name before inserting the record into your database. Mysql, sqlserver, mongo, whatever. The DB technology is irrelevant as your task is out side the scope of storing data/documents.

Lucene would be a better tool for your task. But I couldn't speak to it's prepackaged ability to stem names like you want.

Edit

After thinking about it I think I misspoke when I said Lucene would be a "better" approach in of itself for what you want. My understanding is stemmers exist outside of core lucene and then proxy a search for "bob" into ("bob" or "robert") to feed into the lucene engine.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a dba.stackexchange
scroll top