Question

I'm performing OCR using Tesseract 2.04 in some images, and now i've to get the precise position of the text ocearized. But this version don't return this information.

I need this to generate a searchable pdf file. I already learned how to stamp a text in a under layer of the pdf, but i need the position to stamp this text. My first idea is perform ocr in the pdf, getting the text and position of text, to stamp in the pdf with iText api.

Was it helpful?

Solution

Internally at iText we have also looked into OCR. And it is possible (using Tesseract).

workflow:

  1. extract all images from the pdf using iText
  2. extract the text (and coordinates, font, etc) using Tesseract
  3. apply coordinate transformations (since tesseract coordinate system and iText coordinate system are not the same)
  4. add a layer to the pdf (canvas.beginLayer)
  5. draw all text in this layer on the correct position

There are many more optimizations you could do. A short list of suggestions:

  • correct baseline
  • correct font
  • correct spelling mistakes
  • estimate color
  • estimate background color

This is not an easy task. But certainly possible.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top