Question

I have a big city database which was compiled from many different sources. I am trying to find a way to easily spot duplicates based on city name. The naive answer would be to use the levenshtein distance. However, the problem with cities is that they often have prefixes and suffixes which are common to the country they are in.

For example:

Boulleville vs. Boscherville

These are almost certainly different cities. However, because they both end with "ville" (and both begin with "Bo") they have a rather small Levenstein distance.

*I am looking for a string distance algorithm that takes into account the position of the character to minimize the effect of prefixes and suffixes by weighting letters in the middle of the word higher than letters at the ends of the word. *

I could probably write something myself but I would find it hard to believe that no one has yet published a suitable algorithm.

Was it helpful?

Solution 2

A pretty simple way to do it would be to just remove the common prefix and suffix before doing the distance calculation. The absolute distance between the resulting strings will be the same as with the full strings, but when the shorter length is taken into account the distance looks much greater.

Also keep in mind that in general even grevious misspellings get the first letter right. It's highly likely, then, that Cowville and Bowville are different cities, even though their L. distance is only 1.

You can make your job a lot easier by, at least at first, not doing the distance calculation if two words start with the different letters. They're likely to be different. Concentrate first on removing duplicates of words that start with the same letters. If, after that, you still have a large number of potential duplicates, you can refine your distance threshold to more closely examine words that start with different letters.

OTHER TIPS

This is similar to stemming in Natural Language Programming.

In that field, the stem of a word is found before performing further analysis, e.g.

run => run
running => run
runs => run

(of course things like ran do not stem to run. For that one can use a lemmatizer. But I digress...). Even though stemming is far from perfect in NLP, it works remarkably well.

In your case, it may work well to stem the city using rules specific to city names before applying Levenstein. I'm not aware of a stemmer implementation for cities, but the rules seem on the surface to be fairly simple.

You might start with a list of prefixes and a list of suffixes (including any common variant / typo spellings) and simply remove such a prefix / suffix before checking the Levenstein distance.

On a side note, if you have additional address information (such as a street address or zip/postal code), there exists address normalization software for many countries that will find the best match based on address-specific algorithms.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top