Dictionary-based keyword detection

https://stackoverflow.com/questions/10417156

05-06-2021
|

سؤال

I'd like to identify keywords from scanned documents with possible OCR errors. Based on a list of keywords and confidence values for each character and its alternatives of the scanned document, how can I develop an algorithm for reliable identifying keywords?

For OCR I am using Tesseract, which provides confidence values for each character and its best alternatives. So for every word I have a list like this:

 Word=order
 [0] o (93%) [alts: 0 (90%), c (83%), e (82%)]
 [1] r (96%)
 [2] d (96%)
 [3] e (90%) [alts: a (75%)]
 [4] r (95%)

another example including OCR errors:

 Word=PaYmeHI (Payment would be correct)
 [0] P (81%) [alts: p (78%), D (68%)]
 [1] a (76%) [alts: 3 (73%), e (63%), ö (61%)]
 [2] Y (87%) [alts: V (86%)]
 [3] m (83%) 
 [4] E (71%) [alts: € (79%), 9 (72%), % (67%), e (65%), B (64%), G (64%)]
 [5] H (76%) [alts: n (83%), D (74%), N (70%), ü (69%)]
 [6] I (75%) [alts: t (69%), { (67%), 1 (65%), i (61%)]

As you can see, tesseract doensn't always choose the result with the highest percentage (4, 5).

From skimming through the result it appears, that most characters having a value above 90% are correct. However, bad results don't necessarily contain the correct character in the list of alternatives (see [2], which should be a lower case y.

Currently I am getting a list of candidates by using Levenshtein distance and String length. Furthermore, I am excluding keywords where lev2 > 3. This is just hardcoded, as I am still looking for a good way to determine a threshold.

      int lev = getLevenshteinDistance(keyword, s);
      int lev2 = getLevenshteinDistance(keyword.toLower(), s.toLower());
      int len = Math.abs(keyword.length - s.length); 
      int x = lev + lev2 + len;

I am sorting the list of keywords by x, to get the most likely results.

So first, I am looking for a way to determine a good threshold based on the OCR result and string length. A short string will require a lower threshold than a larger one and a solid OCR result too. Take the example from above: For the word order lev2 <= 1, would be sufficient, whereas for payment at least lev2 <= 3 should be calculated.

Second, how can I decide if one of the candidates left actually matches the word? In case lev == 0 and when the confidence values of all characters are >= 90 it is obvious. But considering bad OCR results, what algorithm can I develop that also includes alternative OCR choices?

المحلول

I have been thinking about something similar for a project of mine; I haven't got any good answers yet, but here are some thoughts:

I think the question we're trying to answer is this:

Does this document (the OCR result) contain the term 'order'?

Idea 1

The OCR documents contains terms with some 'score' ...

So in your example, the document contains:

order with score = sum(93,96,96,90,95)/5 = 94
0rder with score = sum(90,96,96,90,95)/5 = 93
crder with score = sum(83,96,96,90,95)/5 = 92
erder with score = sum(82,96,96,90,95)/5 = 91
ordar with score = sum(93,96,96,75,95)/5 = 91
0rdar with score = sum(90,96,96,75,95)/5 = 90
crdar with score = sum(83,96,96,75,95)/5 = 89
erdar with score = sum(82,96,96,75,95)/5 = 88

Now that we have a score for each candidate, we can get a score for document, given some query (using levenshtein distance for now...)

score for doc given keyword "order" is the average of

(3-min(lev(order, order),3)*0.33) * 94,
(3-min(lev(0rder, order),3)*0.33) * 93,
(3-min(lev(crder, order),3)*0.33) * 92,
...,
...

If this score is higher than some threshold the document is deemed to match 'order'

Idea 2

We can improve the OCR results with some language models

Compute score for each term as follows:

term        | ocr_score   |ngram score            |combined score
------------+-------------+-----------------------+---------------
order   | 94          |score(ord, rde, der)   |ocr*ngram
0rder   | 93          |score(0rd, rde, der)   |ocr*ngram
crder   | 92          |score(crd, rde, der)   |ocr*ngram
erder   | 91          |score(erd, rde, der)   |...
ordar   | 91          |score(ord, rda, der)   |...
0rdar   | 90          |score(0rd, rda, der)   |...
crdar   | 89          |score(crd, rda, der)   |...
erdar   | 88          |score(erd, rda, der)   |...

Where score(ord) = trigram probability of 'ord'

Google Books for example gives trigram probability for any trigrams (see: http://books.google.com/ngrams/chart?content=ord&corpus=0&smoothing=3&year_start=1970&year_end=2000)

We could also compute unigram, bigram, quadgrams ...; then we can compute score based on "unigram" probability of words themselves; bigrams of words and so on...; then we could also apply some purely analytic language models

So we now have more scores for each 'candidate term' and we combine them all with some weights for each score to get a combined score for the term

Idea 3

Ok, so the above leads to an explosion of terms / scores ... which is compute intensive; so we use some magic to build a probabilistic DFA for each term based on ideas 1 & 2. The document now contains probabilistic DFAs rather than terms. The Lucene guys have done some work to build Levenshtein DFAs and allow checking if DFA1 and DFA2 match quickly...

نصائح أخرى

First of all, I think your program is giving you P(observation|symbol), not P(symbol|observation). P(symbol|observation) \proportional P(observation|symbol)*P(symbol) .

For example, for that e in payment, although probability of the observed pattern give symbol was highest for euro, the probability of observing euro is very small. Therefore, it is most likely that it is 'e', not euro.

Therefore, my suggestion would be to sum log( P(observation|symbol)*P(symbol) ) for all possible words and pick the one that maximizes this value.

Furthermore, instead of using P(symbol), you can use more accurate estimate by using the context.

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow