Question

I am designing a text processing program that will generate a list of keywords from a long itemized text document, and combine entries for words that are similar in meaning. There are metrics out there, however I have a new issue of dealing with words that are not in the dictionary that I am using.

I am currently using nltk and python, but my issues here are of a much more abstracted nature. Given a word that is not in a dictionary, what would be an efficient way of resolving it to a word that is within your dictionary? My only current solution involves running through the words in the dictionary and picking the word with the shortest Levenshtein distance (editing distance) from the inputted word.

Obviously this is a very slow and impractical method, and I don't actually need the absolute best match from within the dictionary, just so long as it is a contained word and it is pretty close. Efficiency is more important for me in the solution, but a basic level of accuracy would also be needed.

Any ideas on how to generally resolve some unknown word to a known one in a dictionary?

Was it helpful?

Solution

Looks like you need a spelling corrector to match words in your dictionary. The code below works and taken directly from this blog http://norvig.com/spell-correct.html written by Peter Norvig,

import re, collections

def words(text): return re.findall('[a-z]+', text.lower()) 

def train(features):
    model = collections.defaultdict(lambda: 1)
    for f in features:
        model[f] += 1
    return model

NWORDS = train(words(file('big.txt').read()))

alphabet = 'abcdefghijklmnopqrstuvwxyz'

def edits1(word):
    splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]
    deletes    = [a + b[1:] for a, b in splits if b]
    transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
    replaces   = [a + c + b[1:] for a, b in splits for c in alphabet if b]
    inserts    = [a + c + b     for a, b in splits for c in alphabet]
    return set(deletes + transposes + replaces + inserts)

def known_edits2(word):
    return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)

def known(words): return set(w for w in words if w in NWORDS)

def correct(word):
    candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
    return max(candidates, key=NWORDS.get)

big.txt is your dictionary containing known words.

OTHER TIPS

Your task sounds like it's really just non-word spelling correction, so a relatively straight-forward solution would be to use an existing spell checker like aspell with a custom dictionary.

A quick and dirty approach would be to just use a phonetic mapping like metaphone (which is one the algorithms used by aspell). For each possible code derived from your dictionary, choose a representative word (e.g., the most frequent word in the group) to suggest as the correction, and pick a default correction for the case where no matches are found. But you'd probably get better results using aspell.

If you do want to calculate edit distances, you can do it relatively quickly by storing the dictionary and possible edit operations in tries, see Brill and Moore (2000). If you have a decent-sized corpus of spelling errors and their corrections and can implement Brill and Moore's whole approach, you would probably beat aspell by quite a bit, but it sounds like aspell (or any spell checker that lets you create your own dictionary) is sufficient for your task.

Hopefully this answer is not too vague:

1) It sounds like you might need to look at your tokenisation and phrase chunking layers first. This is is where you should discard symbolic phrase chunks before submitting them to any fuzzy spell checking.

2) I would still recommend edit distance to come up with alternatives to any 'mis-spelt' tokens after that, but this may return a list of equally close possibles.

3)When you have your list of possibles, you could then use co-occurence algorithms to select the most likey candidate from this list. I only have a java example of some software that could help ( http://www.linguatools.de/disco/disco_en.html#was ) . You can submit a word, and this will return the difinitive co-occuring words for that word. You can then compare this list to the context of your 'mis-spelt' word, and select the one with the most overlap from all potential edit distance words.

I do not see a reason to use Levenshtein distance to find a word similar in meaning. LD looks at form (you want to map "bus" to "truck" not to "bush").

The correct solution depends on what you want to do next.

Unless you really need the information in those unknown words, I would simply map all of them to a single generic "UNKNOWN_WORD" item.

Obviously you can cluster the unknown words by their context and other features (say, do they start by a capital letter). For context clustering: since you are interested in meaning, I would use a larger window for those words (say +/- 50 words) and probably use a simple bag of words model. Then you simply find a known word whose vector in this space is closest to the unknown word using some distance metrics (say, cosine). Let me know if you need more information about this.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top