Question

I am using tesseract ocr to extract text from an image. Preserving the structure of the document is very important to me. Currently tesseract does not preserve the structure, infact it changes the order of text. My input is the image below.

input

and the output I am getting is as follows:

Someto the left
Someto the left

Some in the middle
Some in the middle

Some with some tab
Some with some tab

Some with some space between them
Some with some space between them

Sometext here
Sometext here

this much
this much

How do I get the desired output as of the same structure in image?

i.e. as follows:

                                                 Some text here
                                                 Some text here

Some to the left
Some to the left

                    Some in the middle
                    Some in the middle

        Some with some tab
        Some with some tab

Some with some space between them                       this much
Some with some space between them                       this much
Was it helpful?

Solution

Newer versions of tesseract (3.04) have an option called preserve_interword_spaces which should do what you want.

Note that the number of spaces tesseract detects between words may not always be the same between similar lines. So words that are left-aligned with a run of spaces preceding them (as in your example) may not be output this way -- the preserve_interword_spaces option does not attempt to do anything fancy, it merely preserves the spaces extraction found. By default tesseract collapses runs of spaces into one.

Details on this option are here.

OTHER TIPS

The only reliable way would be enabling hOCR output and parsing it. It will contain positions of each word on the page in pixels, as in the original image.

You can do it by specifying tessedit_create_hocr 1 in Tesseract's config file, or in whatever API you use.

hOCR is a subset of HTML, and what Tesseract generates isn't always a valid XML, so you can either use an HTML parser or write your own, but you can't use reliably an XML parser.

For multicolumn documents in which one wants to preserve a single column of continuous text (e.g., read column 1, then column 2) or documents with photos (e.g., newspaper articles) it's probably worth looking at adjusting the page segmentation method. The default page segmentation method with tesseract is to only do "Automatic page segmentation" but NOT "Orientation and script detection (OSD)."

Putting the psm setting to 1 tells tesseract to use "Automatic page segmentation with OSD." This allows tesseract to recognize a multicolumn document (rather than treating the page as a single block of text) and helps tesseract avoid trying to OCR non-text blocks like photographs.

For more on page segmentation methods, see: https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality

Here is a sample of the command line syntax to adjust the page segmentation method

tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfile...]

For more on the syntax, see: https://github.com/tesseract-ocr/tesseract/wiki

Tesseract code compresses spaces in output. You will need to change the code to preserve them. See Tesseract - ambiguity in space and tab post.

Adding --psm 6 option works in my case (command line)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top