将 ORed 非结构化文本转换为正确的文本
-
21-12-2019 - |
题
我正在使用微软 MODI
在 VB6
OCR 图像。(我知道其他 OCR 工具,如 tesseract 等,但我发现 MODI 比其他工具更准确)
待OCRed的图像是这样的
并且,OCR 后我得到的文本如下所示
Text1
Text2
Text3
Number1
Number2
Number3
这里的问题是相对列的相应文本没有得到维护。如何将 Number1 与 Text1 映射?
我只能想到这样的解决方案。
MODI 提供所有 OCRed 单词的坐标,如下所示
LeftPos = Img.Layout.Words(0).Rects(0).Left
TopPos = Img.Layout.Words(0).Rects(0).Top
所以为了对齐同一行的单词,我们可以匹配每个单词的TopPos,然后按LeftPos排序。我们将得到完整的生产线。所以我循环遍历所有单词并将它们的文本以及 left 和 top 存储在 mysql 表中。然后运行这个查询
SELECT group_concat(word ORDER BY `left` SEPARATOR ' ')
FROM test_copy
GROUP BY `top`
我的问题是,每个单词的顶部位置并不完全相同,显然会有一些像素差异。
我尝试添加 DIV 5
, ,用于合并 5 像素范围内的单词,但在某些情况下不起作用。我还尝试在 node.js 中计算每个单词的容差,然后按 LeftPos 排序,但我仍然觉得这不是最好的方法。
更新: js 代码完成了这项工作,但除了 Number1 有 5 像素差异而 Text2 在该行中没有对应的情况之外。
有更好的主意吗?
解决方案
我不是 100% 确定如何识别“左”列中的那些单词,但是一旦识别了该单词,您可以通过投影顶部坐标以及整个矩形来找到该行中的其他单词(顶部和底部)。确定与其他单词的重叠(交叉)。注意下面用红色标记的区域。
您可以使用此容差来检测某些内容是否位于同一行中。如果某些东西仅重叠一个像素,那么它可能来自较低或较高的线。但如果它重叠了,比如说,高度“Text1”的 50% 或更多,那么它很可能在同一行上。
示例 SQL 根据顶部和底部坐标查找“行”中的所有单词
select
word.id, word.Top, word.Left, word.Right, word.Bottom
from
word
where
(word.Top >= @leftColWordTop and word.Top <= @leftColWordBottom)
or (word.Bottom >= @leftColWordTop and word.Bottom <= @leftColWordBottom)
示例伪 VB6 代码也用于计算行数。
'assume words is a collection of WordInfo objects with an Id, Top,
' Left, Bottom, Right properties filled in, and a LineAnchorWordId
' property that has not been set yet.
'get the words in left-to-right order
wordsLeftToRight = SortLeftToRight(words)
'also get the words in top-to-bottom order
wordsTopToBottom = SortTopToBottom(words)
'pass through identifying a line "anchor", that being the left-most
' word that starts (and defines) a line
for each anchorWord in wordsLeftToRight
'check if the word has been mapped to aline yet by checking if
' its anchor property has been set yet. This assumes 0 is not
' a valid id, use -1 instead if needed
if anchorWord.LineAnchorWordId = 0 then
'not locate every word on this line, as bounded by the
' anchorWord. every word determined to be on this line
' gets its LineAnchorWordId property set to the Id of the
' anchorWord
for each lineWord in wordsTopToBottom
if lineWord.Bottom < anchorWord.Top Then
'skip it,it is above the line (but keep searching down
' because we haven't reached the anchorWord location yet)
else if lineWord.Top > anchorWord.Bottom Then
'skip it,it is below the line, and exit the search
' early since all the rest will also be below the line
exit for
else if OverlapsWithinTolerance(anchorWord, lineWord) then
lineWord.LineAnchorWordId = anchorWord.Id
endif
next
end if
next anchorWord
'at this point, every word has been assigned a LineAnchorWordId,
' and every word on the same line will have a matching LineAnchorWordId
' value. If stored in a DB you can now group them by LineAnchorWordId
' and sort them by their Left coord to get your output.