Ocred Unstructured 텍스트를 적절한 텍스트로 변환하십시오

https://stackoverflow.com//questions/22039207

21-12-2019
|

문제

MODI에서 Microsoft VB6를 사용하여 이미지를 OCR에 사용하고 있습니다. (Tesseract 등 다른 OCR 도구에 대해 알고 있지만 다른 것보다 modi가 더 정확하게 발견)

ocred 할 이미지는 다음과 같습니다

여기에 이미지 설명

, OCR 이후에 얻은 텍스트는 아래

과 같습니다.

Text1
Text2
Text3
Number1
Number2
Number3

여기서는 반대 컬럼의 해당 텍스트가 유지되지 않습니다. Text1로 Number1을 어떻게 매핑 할 수 있습니까?

나는 이와 같은 해결책을 생각할 수 있습니다.

modi는이

와 같은 모든 ocred 단어의 co-or-undinates를 제공합니다.

LeftPos = Img.Layout.Words(0).Rects(0).Left
TopPos = Img.Layout.Words(0).Rects(0).Top

동일한 줄에 단어를 정렬하려면 각 단어의 toppos와 일치시킨 다음 LeftPos로 정렬 할 수 있습니다. 우리는 완전한 라인을 얻을 것입니다. 그래서 나는 모든 단어를 반복하고 텍스트를 MySQL 테이블에 왼쪽 및 맨뿐만 아니라 텍스트를 저장했습니다. 그런 다음이 쿼리를 실행했습니다

SELECT group_concat(word ORDER BY `left` SEPARATOR ' ')
FROM test_copy
GROUP BY `top`

내 문제는 각 단어에 대해 상위 위치가 똑같지는 않지만 분명히 몇 개의 픽셀 차이가있을 것입니다.

5 픽셀 범위의 단어를 병합하지만 일부 경우에는 작동하지 않는 DIV 5를 추가하려고 시도했습니다. 나는 또한 각 단어에 대한 공차를 계산 한 다음 leastpos로 정렬하여 공차를 계산하여 노드 .js에서 그것을 시도했다. 그러나 나는 이것이 가장 좋은 방법이 아니다.

update : JS 코드는 숫자 1이 5 픽셀 차이가 있고 Text2가 해당 줄에 해당하지 않는 경우를 제외하고는 작업을 수행합니다.

이 작업을 수행하는 더 좋은 아이디어가 있습니까?

해결책

나는 당신이 "왼쪽"열에있는 단어를 식별하는 방법을 100 % 확실하지 않지만, 일단 밝은 좌표뿐만 아니라 전체를 투사함으로써 다른 단어를 찾을 수 있습니다.직사각형 (위쪽과 하단 모두).다른 단어로 겹침 (교차점)을 결정하십시오.아래에 빨간색으로 표시된 영역에 유의하십시오.

가로 투영

이렇게하면 무언가가 같은 줄에 있는지 탐지하는 데 사용할 수있는 공차입니다.픽셀 만 겹치는 것이 있으면 아마도 낮거나 높은 선에서 일 것입니다.그러나 그것이 겹치는 경우, 높이를 50 % 이상`text1, 그 다음 같은 줄에있을 가능성이있다.

예제 SQL atop 및 맨 아래 율을 기반으로 "줄"에서 모든 단어를 찾으려면

select 
    word.id, word.Top, word.Left, word.Right, word.Bottom 
from 
    word
where 
    (word.Top >= @leftColWordTop and word.Top <= @leftColWordBottom)
    or (word.Bottom >= @leftColWordTop  and word.Bottom <= @leftColWordBottom)

예제 PSUEDO VB6 코드도 계산합니다.

'assume words is a collection of WordInfo objects with an Id, Top, 
'   Left, Bottom, Right properties filled in, and a LineAnchorWordId 
'   property that has not been set yet.

'get the words in left-to-right order
wordsLeftToRight = SortLeftToRight(words) 

'also get the words in top-to-bottom order
wordsTopToBottom = SortTopToBottom(words) 

'pass through identifying a line "anchor", that being the left-most 
'   word that starts (and defines) a line
for each anchorWord in wordsLeftToRight

    'check if the word has been mapped to aline yet by checking if 
    '   its anchor property has been set yet.  This assumes 0 is not 
    '   a valid id, use -1 instead if needed
    if anchorWord.LineAnchorWordId = 0 then 

        'not locate every word on this line, as bounded by the 
        '   anchorWord.  every word determined to be on this line 
        '   gets its LineAnchorWordId property set to the Id of the 
        '   anchorWord
        for each lineWord in wordsTopToBottom

            if lineWord.Bottom < anchorWord.Top Then

                'skip it,it is above the line (but keep searching down
                '   because we haven't reached the anchorWord location yet)

            else if lineWord.Top > anchorWord.Bottom Then

                'skip it,it is below the line, and exit the search 
                '   early since all the rest will also be below the line
                exit for

            else if OverlapsWithinTolerance(anchorWord, lineWord) then

                lineWord.LineAnchorWordId = anchorWord.Id

            endif

        next

    end if

next anchorWord

'at this point, every word has been assigned a LineAnchorWordId, 
'   and every word on the same line will have a matching LineAnchorWordId
'   value.  If stored in a DB you can now group them by LineAnchorWordId 
' and sort them by their Left coord to get your output.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow