Question

look at this example in ms word; I deliberately misspelled the word 'complemet' to show you what I mean;

enter image description here

I wonder how ms word chooses the most similar words to what I typed(algorithm I mean)

this is not the case of spell checking but finding the most similar words(like the results in picture)

I want to implement an algorithm so that I can find the most similar words to what user already typed;

Was it helpful?

Solution

This is called the Levensthein Distance. It is a measure of how much two random words differ in terms of deletion, insertion, and replacements of single characters.

(For efficiency's sake, you probably don't want to compare any word to your entire word list. You might want to do some quick culling with one or more other methods to cull down the possible alternatives.)

(Edit)

That was fun! :) Just to see how it worked I implemented this in C, using OSX's default words list and the C version of the algorithm on Wikibooks. Here are the top 10 hits for your 'complment':

'complment' -> LD=compliment(1)
   LD=complement(1)
   LD=component(2)
   LD=couplement(2)
   LD=comment(2)
   LD=compellent(2)
   LD=competent(2)
   LD=compilement(2)
   LD=complacent(2)
   LD=complaint(2)

The comparison routine kept a small array of 'best so far' matches, and when the array filled up, the highest values were discarded. A full distance calculation against every word in the list (235,886 words) took 0.370s.

I added a quick culling routine that checked if each letter in the input occurred at least once in the compare word (a simple bit test), as well as each other letter in turn. That slashed the time taken into a third: 0.150s.

A few random other words I tried (not all possible solutions shown):

'unforutntately' -> LD=unfortunately(3) LD=infortunately(4) LD=fortunately(5)
'abcacadabra' -> LD=abracadabra(1) LD=barracuda(7)
'athtahn' -> LD=Ethan(3) LD=thawn(3) LD=Pathan(3) LD=attaghan(3)
'jongware' ->

... the last one produced no matches at all. Only after removing my One-Character-Off routine I got

'jongware' -> LD=nonglare(2)
   LD=congiary(3)
   LD=henware(3)
   LD=hogward(3)
   LD=honeyware(3)

Oh well.

(Further edit) Since you wrote

this is not the case of spell checking but finding the most similar words

I ran it again with 'compliment' spelled correctly. This is the result for that:

'compliment' -> LD=compliment(0)
  LD=complement(1)
  LD=complimenter(2)
  LD=compliant(2)
  LD=complicant(2)
  LD=complacent(2)
  LD=couplement(2)

As you see, the first value is '0' -- an exact match -- and the other words are 'similar'.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top