English misspeling correction sequences

https://stackoverflow.com/questions/12451097

02-07-2021
|

Question

I am doing a bit of search engine. One of the features is an attempt to correct spelling in nothing is found. I replace the following phonetic sequences: ph<->f, ee <-> i, oo<->u, ou<->o (colour<->color). Where I can find a full list of things like that for English? Thank you.

Solution

You might want to start here (Wikipedia on Soundex) and then start tracing through the "see also" links. (Metaphone has a list of replacements, for example.)

OTHER TIPS

If you're creating search engine, you have to realize that there are plenty of web pages, which contains incorrectly spelled words. But, of course, you need any strategy to make that pages searchable too. So there are no general rules to implement spelling corrector (because of correctness becomes relative concept in web). But there are some tricks how to do that in practice :-)

I'd suggest you to use n-gram index + Levenstein distance (or any similar distance) to correct spelling.

Strings with small Levenstein distance are presumably the variations of the same word.

Assume you want to correct word "fantoma". If you have large amount of words - it would be very cost to iterate through dictionary and calculate distance to each word. So you have to find words with presumably small distance to "fantoma" very fast.

The main idea is while crawling and indexing web-pages - perform indexing of n-grams (for example - bigrams) into separate index. Split each word into n-grams, and add it to n-gram index:

1) Split each word from dictionary, 
   for example: "phantom" -> ["ph", "ha", "an", "nt", "to", "om"]

2) Create index:
   ...
   "ph" -> [ "phantom", "pharmacy", "phenol", ... ]
   "ha" -> [ "phantom", "happy" ... ]
   "an" -> [ "phantom", "anatomy", ... ]
   ...

Now - you have index, and you may quickly find candidates for your words.

For example:

1) "fantoma" -> ["fa", "an", "nt", "to", "om", "ma"]
2) get lists of words for each n-gram from index, 
   and extract most frequent words from these lists - these words are candidates
3) calculate Levenstein distance to each candidate, 
   the word with smallest distance is probably spell-corrected variant of searched word.

I'd suggest you to look through the book "Introduction to information retrieval".

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow