Python Tesseract can't recognize this font

https://stackoverflow.com/questions/1762565

21-09-2019
|

Question

I have this image:

I want to read it to a string using python, which I didn't think would be that hard. I came upon tesseract, and then a wrapper for python scripts using tesseract.

So I started reading images, and it's done great until I tried to read this one. Am i going to have to train it to read that specific font? Any ideas on what that specific font is? Or is there a better ocr engine I could use with python to get this job done.

Edit: Perhaps I could make some sort of vector around the numbers, then redraw them in a larger size? The larger images are the better tesseract ocr seems to read them (no surprise lol).

Solution

Just train the engine for the 10 digits and a '.' . That should do it. And make sure you change your image to grayscale before OCRing it.

OTHER TIPS

Training is hard and is not what is really needed here. The distinction between O and 0 and l and 1 are going to be hard, no matter the script. Limiting the OCR to choose only between numerical digits greatly simplifies the problem, if the context permits it.

My interest in tesseract is in processing lots of numbers, from old government reports. In this case and in the case in question, the character set will be something like '0123456789.' Following a comment in the old (sourceforge) newsgroup for tesseract, by eric_taj on 2007-03-21, you can modify Templates->IndexFor and Templates->ClassIdFor in classify/intproto.cpp to mask off characters which are not to be allowed. I modified that approach a bit to read in the allowed character set at runtime in an environment variable, so that I can adjust the permitted set on the fly.

There has been a lot of traffic on this topic in the tesseract OCR discussion group lately. You will need to use a "language" of just numbers. Many people have trained the engine that way before. It looks like you're trying to outwit a captcha data protection scheme... tsk, tsk.

Recognizing small screen font may be hard for the general-purpose OCR which is optimized for reading large smooth font scanned from paper.

You may better try special screenshot OCR like Textract SDK. It will collect all local fonts and provide 100% precise recognition by simply matching character to character.

That looks like Eurostile font. Yes, you will have to train with each different font that is being used in your source images.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow