Get text position with tesseract 2.04 and Java

https://stackoverflow.com/questions/8390413

28-10-2019
|

Question

I'm performing OCR using Tesseract 2.04 in some images, and now i've to get the precise position of the text ocearized. But this version don't return this information.

I need this to generate a searchable pdf file. I already learned how to stamp a text in a under layer of the pdf, but i need the position to stamp this text. My first idea is perform ocr in the pdf, getting the text and position of text, to stamp in the pdf with iText api.

Solution

Internally at iText we have also looked into OCR. And it is possible (using Tesseract).

workflow:

extract all images from the pdf using iText
extract the text (and coordinates, font, etc) using Tesseract
apply coordinate transformations (since tesseract coordinate system and iText coordinate system are not the same)
add a layer to the pdf (canvas.beginLayer)
draw all text in this layer on the correct position

There are many more optimizations you could do. A short list of suggestions:

correct baseline
correct font
correct spelling mistakes
estimate color
estimate background color

This is not an easy task. But certainly possible.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow