문제

I am using NLTK to analyze a corpus that has been OCRed. I'm new to NLTK. Most of the OCR is good -- but sometimes I come across lines that are plainly junk. For instance: oomfi ow Ba wmnondmam BE wBwHo<oBoBm. Bowman as: Ham: 8 ooww om $5

I want to identify (and filter out) such lines from my analysis.

How do NLP practitioners handle this situation? Something like: if 70 % of the words in the sentence are not in wordnet, discard. Or if NLTK can't identify the part of speech for 80% of the word, then discard? What algorithms work for this? Is there a "gold standard" way to do this?

도움이 되었습니까?

해결책

Using n-grams is probably your best option. You can use google n-grams, or you can use n-grams built into nltk. The idea is to create a language model and see what probability any given sentence gets. You can define a probability threshold, and all sentences with scores below it are removed. Any reasonable language model will give a very low score for the example sentence.

If you think that some words may be only slightly corrupted, you may try spelling correction before testing with the n-grams.

EDIT: here is some sample nltk code for doing this:

import math
from nltk import NgramModel
from nltk.corpus import brown
from nltk.util import ngrams
from nltk.probability import LidstoneProbDist

n = 2
est = lambda fdist, bins: LidstoneProbDist(fdist, 0.2)
lm = NgramModel(n, brown.words(categories='news'), estimator=est)

def sentenceprob(sentence):
    bigrams = ngrams(sentence.split(), n)
    sentence = sentence.lower()
    tot = 0
    for grams in bigrams:
        score = lm.logprob(grams[-1], grams[:-1])
        tot += score
    return tot

sentence1 = "This is a standard English sentence"
sentence2 = "oomfi ow Ba wmnondmam BE wBwHo<oBoBm. Bowman as: Ham: 8 ooww om $5"

print sentenceprob(sentence1)
print sentenceprob(sentence2)

The results look like:

>>> python lmtest.py
  42.7436688972
  158.850086668

Lower is better. (Of course, you can play with the parameters).

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top