Question

I've been trying to use plain tesseract 3 OCR using different options to get the data from a table of letters where my students marked one as answers for multiple choice questions, as seen below:

image with the table of letters used in tesseract

One of the best outputs was:

EEEEEEEEEEEEEEEEEEEEEEEEE
DDDDDDDDDDDDDDDDDDDDDDDDD
CCCCCCCCCCCCCCCCCCCCCCCCC
BBBBBBBEBBBBBBBBBBBBBBBBB
AAAAAAAAAAAAAAAAAAAAAAAAA
6789012345678901234567890
2222333333333344444444445
EEEEE EEEE EE EEE EEEEEEE
DDDDDD DDD DDDDDDDDDDDD
CCCCCCCCCCCCCCCCCC CCCCC
B BEBE BB BBBBBBBBBBBBBBB
AA AAA AAAAA AAAAAAAA
1234567890123455789012345
OOOOOOOOO1111111111222222

I know I can parse that .txt and have a better result, but it missed a lot of information and got the letters from some of the painted blocks.

I wanted to know what can I do to get better result for this case.

I would also like to have a table with the painted blocks appearing as a different character, for example, for the first and second lines of the image:

01 A B C - E   26 A B C D E
02 A - C D E   27 A B C D E

If you guys have some similar experience, any information will be appreciated! Thanks in advance!

Was it helpful?

Solution

First, I suggest you preprocess your image, for example making the dark parts darker, blur it a little. Feel free to experiment until Tesseract stops seeing letters in the filled-in squares.

Second, you have two options:

  • One, you can enable hOCR output and try to parse the layout of the scanned letters yourself. hOCR is a subset of HTML and it contains coordinates of all recognized words. Try figuring out where the rows and columns are.

  • Alternatively, try making Tesseract recognise the layout properly, not rotated 90°.

Anyway, this is what I did:

1. I ran the image through ImageMagick:

$ convert CDZjN.png -deskew 40% -contrast-stretch 7%x10% -filter lanczos -resize 250% ooo.png

2. I created a config file t.conf for Tesseract, disabling vertical text detection and English dictionary:

textord_tabfind_vertical_text 0 load_system_dawg 0 load_freq_dawg 0 load_punc_dawg 0 load_number_dawg 0 load_unambig_dawg 0 load_bigram_dawg 0 load_fixed_length_dawgs 0

3. I simply ran it:

$ tesseract ooo.png ooo t.conf ; cat ooo.txt Tesseract Open Source OCR Engine v3.02 with Leptonica 01ABC-E 26ABCDE 02A CDE 27ABCDE o3 BCDE 28ABCDE o4 BCDE 29ABCDE o5 BCDE 30ABCDE 06ABCD. 31ABCDE 07A-CDE 32ABCDE 08ABC.E 33ABCDE o9 BCDE 34ABCDE 10A CDE 35ABCDE 11ABCD 36ABCDE 12ABC E 37ABCDE 13ABC E 38ABCDE 14ABCD 39ABCDE 15 BCDE 40ABCDE 1s BCDE 41ABCDE 17 BCDE 42ABCDE 18ABCD_ 43ABCDE 19AB DE 44ABCDE 20AB DE 45ABCDE 21ABCDE 46ABCDE 22ABCDE 47ABCDE 23ABCDE 48ABCDE 24ABCDE 49ABCDE 25ABCDE 50ABCDE

Not perfect, but passable.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top