Question

I have a trie that I've built from a dictionary of words. I want to use this for spell checking( and suggest closest matches in the dictionary , maybe for a given number of edits x). I'm thinking I'd use levenshtein distance between the target word and words in my dictionary, but is there a smart way to traverse the trie without actually running the edit distance logic over each word separately? How should I do the traversal and the edit distance matching?

For e.g, if I have words MAN, MANE, I should be able to reuse the edit distance computation on MAN in MANE. Otherwise the Trie wouldnt serve any purpose

Was it helpful?

Solution 2

Try computing for each tree node an array A where A[x] the smallest edit distance to be at that position in the trie after matching the first x letters of the target word.

You can then stop examining any nodes if every element in the array is greater than your target distance.

For example, with a trie containing MAN and MANE and an input BANE:

Node 0 representing '', A=[0,1,2,3,4]
Node 1 representing 'M', A=[1,1,2,3,4]
Node 2 representing 'MA', A=[2,1,1,2,3]
Node 3 representing 'MAN' A=[3,2,2,1,2]
Node 4 representing 'MANE' A=[4,3,2,2,1]

The smallest value for A[end] is 1 reached with the word 'MANE' so this is the best match.

OTHER TIPS

I think you should instead give a try to bk-trees; it's a data structure that fits well spell-checking as it will allow you to compute efficiently the edit distance with the words of your dictionary.

This link gives a good insight into BK-trees applied to spell-checking

There is a smart way to get every element that is not quite a Levenstein distance since the following algorithm does not incorporate transpositions.

Assuming we have the Tree structure, we can implement a recursive search of the tree. Your recursive search assumes we start with a cost-row representing the cost of deleting every letter. As we recursively search the tree, the information we have is

  • You are at node n, that has been indexed in your Trie structure by letter l.
  • You are considering a distance from a word w
  • Your current path assumes a previous cost-row up to this point, we wish to update this to form a new cost row for this node n.

We want to update our cost-row at the letter you are considering in accordance with 4 situations; l is the next letter in the word (cost row remains the same), the letter needs to be inserted (new cost +1), a letter has been deleted (cost of previous step +1), and the letter replaces a previous word (new cost +1).

The cost of proceeding down this path on your tree is the minimum of these costs. At this point, if your at a point in the Trie structure defining a word, append it to a list, and then recursively search all children for more words assuming the current cost is within a defined maximum cost. An implementation in Python can be found in another post:

https://stackoverflow.com/a/62823597/8249836

I also have this in C for piping. Since the algorithm is pretty fast even for high edit distances (< len of word) one may use a fast efficient implementation of the Levenstein distance to correct this method.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top