Question

I am using Microsoft MODI in VB6 to OCR an image. (I know about other OCR tools like tesseract etc but I find MODI more accurate than other)

The image to be OCRed is like this

enter image description here

and, the text the I get after OCR is like below

Text1
Text2
Text3
Number1
Number2
Number3

The problem here is that corresponding text from opposite column is not maintained. How can I map Number1 with Text1?

I can only think of a solution like this.

MODI provides co-ordinates of all the OCRed words like this

LeftPos = Img.Layout.Words(0).Rects(0).Left
TopPos = Img.Layout.Words(0).Rects(0).Top

So to align words in same line, we can match TopPos of each word and then sort them by LeftPos. We will get the complete line. So I looped through all the words and stored their text as well as left and top in a mysql table. then ran this query

SELECT group_concat(word ORDER BY `left` SEPARATOR ' ')
FROM test_copy
GROUP BY `top`

My problem is, That Top positions are not exact same for each word, Obviously there will be couple of pixel differences.

I tried adding DIV 5, for merging words that are in 5 pixels range but that doesn't work for some cases. I also tried doing it in node.js by calculating tolerance for each word and then sorting by LeftPos but I still feel this is not the best way to do it.

Update: The js code does the job but except for the case where Number1 has 5 pixel difference and Text2 has no corresponding in that line.

Is there any better idea to do this?

Was it helpful?

Solution

I'm not 100% sure how you identify those words that are in your "left" column, but once you have that word identified you can find other words in it line by projecting not just the Top coordinate but the the whole rectangle across (both top and bottom). Determine the overlap (intersection) with the other words. Note the area marked in red below.

Horizontal projection

This is the tolerance you can use to detect if something is in the same line. If something overlaps by only a pixel then it is probably from a lower or higher line. But if it overlaps by, say, 50% or more of the height `Text1, then it is likely on the same line.


Example SQL to find all words in the "line" based on atop and bottom coord

select 
    word.id, word.Top, word.Left, word.Right, word.Bottom 
from 
    word
where 
    (word.Top >= @leftColWordTop and word.Top <= @leftColWordBottom)
    or (word.Bottom >= @leftColWordTop  and word.Bottom <= @leftColWordBottom)

Example psuedo VB6 code to calculate the lines as well.

'assume words is a collection of WordInfo objects with an Id, Top, 
'   Left, Bottom, Right properties filled in, and a LineAnchorWordId 
'   property that has not been set yet.

'get the words in left-to-right order
wordsLeftToRight = SortLeftToRight(words) 

'also get the words in top-to-bottom order
wordsTopToBottom = SortTopToBottom(words) 

'pass through identifying a line "anchor", that being the left-most 
'   word that starts (and defines) a line
for each anchorWord in wordsLeftToRight

    'check if the word has been mapped to aline yet by checking if 
    '   its anchor property has been set yet.  This assumes 0 is not 
    '   a valid id, use -1 instead if needed
    if anchorWord.LineAnchorWordId = 0 then 

        'not locate every word on this line, as bounded by the 
        '   anchorWord.  every word determined to be on this line 
        '   gets its LineAnchorWordId property set to the Id of the 
        '   anchorWord
        for each lineWord in wordsTopToBottom

            if lineWord.Bottom < anchorWord.Top Then

                'skip it,it is above the line (but keep searching down
                '   because we haven't reached the anchorWord location yet)

            else if lineWord.Top > anchorWord.Bottom Then

                'skip it,it is below the line, and exit the search 
                '   early since all the rest will also be below the line
                exit for

            else if OverlapsWithinTolerance(anchorWord, lineWord) then

                lineWord.LineAnchorWordId = anchorWord.Id

            endif

        next

    end if

next anchorWord

'at this point, every word has been assigned a LineAnchorWordId, 
'   and every word on the same line will have a matching LineAnchorWordId
'   value.  If stored in a DB you can now group them by LineAnchorWordId 
' and sort them by their Left coord to get your output.
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top