Question

When you misspell a word in Google ("appples" for example), it comes up with the now familiar, "Did you mean: apples" suggestion for you.

Excluding Google's ability to guess your intentions based on relevance of search results, how can I develop a list of words that sound the same?

The words don't have to be English and also do not have to exist. So, for example, if I give the input "hole", I would get back a list including words like: "whole" "hola" "whore" "role" "molar", etc...

I am guessing there might be something online that can develop this list, but I couldn't find anything. If there is not a site and if it can be done using Perl, is there a CPAN module that can help me do this?

Was it helpful?

Solution

You can start by learning about the module Text::Soundex . It is a simple algorithm that maps words to 4 byte codes. I got Soundex out of Sedgewick (ex Knuth) long ago, used it to generate longer keys (not truncated) and suggested lists of corrections for 0 and 1-letter substitutions. I applied this to large databases of census and postal data.

OTHER TIPS

If you are truly looking for words that sound the same, and not just search suggestions - you can look at phonetic algorithms. Soundex and Metaphone/Double Metaphone are two very common ones and there are implementations of each in any popular language.

These algorithms reduce a word down to a "key" that indicates its pronunciation. If you took a corpus of words to start and built a data structure mapping these keys to words that evaluate to them- you could take an arbitrary string, evaluate it down to its "key" and then look up other words that evaluate to the same key in your data structure (probably a hash table of lists or similar).

This isn't perfect, because you'd need to find a big corpus of words to seed your dataset with, but it would work.

On the other hand, if you simply want search suggestions/alternate spellings there are easier ways to go about it.

Hope that was helpful.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top