How to check for unreadable OCRed text with NLTK

Question

Using n-grams is probably your best option. You can use google n-grams, or you can use n-grams built into nltk. The idea is to create a language model and see what probability any given sentence gets. You can define a probability threshold, and all sentences with scores below it are removed. Any reasonable language model will give a very low score for the example sentence.

If you think that some words may be only slightly corrupted, you may try spelling correction before testing with the n-grams.

EDIT: here is some sample nltk code for doing this:

import math
from nltk import NgramModel
from nltk.corpus import brown
from nltk.util import ngrams
from nltk.probability import LidstoneProbDist

n = 2
est = lambda fdist, bins: LidstoneProbDist(fdist, 0.2)
lm = NgramModel(n, brown.words(categories='news'), estimator=est)

def sentenceprob(sentence):
    bigrams = ngrams(sentence.split(), n)
    sentence = sentence.lower()
    tot = 0
    for grams in bigrams:
        score = lm.logprob(grams[-1], grams[:-1])
        tot += score
    return tot

sentence1 = "This is a standard English sentence"
sentence2 = "oomfi ow Ba wmnondmam BE wBwHo<oBoBm. Bowman as: Ham: 8 ooww om $5"

print sentenceprob(sentence1)
print sentenceprob(sentence2)

The results look like:

>>> python lmtest.py
  42.7436688972
  158.850086668

Lower is better. (Of course, you can play with the parameters).