سؤال

I have an image that contains textual information and:

  1. I extract/crop a small image from it
  2. I am using an OCR to extract the text from the small image
  3. Check if the extracted value matches a pattern ( Float , date ...) if so
  4. I store the value in database

the problem is : sometimes the ocr extracts a value with some symbols in it so it doesn't match the pattern example : for the pattern date i have :

pattern = "(0[1-9]|[12][0-9]|3[01])/(0[1-9]|1[012])/(19|20)\d\d"

the value from the image is

12/02/2014

but the OCR extracted :

12? /02 -2014

i want to get the similarity between the pattern and the value extracted ( to treat it lately ) is there a way to do that without changing the pattern ?

هل كانت مفيدة؟

المحلول

A specific regex cannot be used to match a pattern with ambiguities without modifications that allow such ambiguities. For example, if you would like to allow an insertion of extra characters in arbitrary spots of the string being matched, the regex pattern would need to have provisions that much these arbitrary characters. This makes the pattern ugly very quickly: for example, while a pattern for matching an int is very simple,

\\d+

the same pattern that allows non-digits in between would look like this:

(\\d\\D*)+

This gets uglier and uglier as the pattern gets larger, so this approach is not very good.

I would recommend replacing a pattern-based matching with an algorithm that implements a variation of Levenshtein distance.

The original Levenshtein distance algorithm takes two strings, and returns the number of modifications that need to be done on one string in order to get the other one. Your algorithm should take a string and a pattern. The pattern should use some sort of a designator for digits (say, #) and treat all other characters "literally", as string characters. You would modify the indicator function used in the algorithm to return zero when you send it a # and any digit, and 1 otherwise.

Take a look at the implementation with two matrix rows, it is the most space-efficient. The indicator function is implemented on this line:

var cost = (s[i] == t[j]) ? 0 : 1;

Changing it to

int cost = (s[i] == t[j] || (Character.isDigit(s[i]) && t[j] == '#')) ? 0 : 1;

would allow you to "match" digits. Your code could also remove all whitespace from the string before doing the match.

You could decide on the quality of the match by checking the Levenshtein distance. A distance of zero shows a perfect match; a distance of one or two is pretty good for short patterns; a distance of five or more is probably unacceptable.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top