Adding Blackletter Font Support to Tesseract OCR Engine

https://stackoverflow.com/questions/9107020

21-04-2021
|

Question

I'm working on getting the Lincoln font to work in Tesseract, and I'm getting abysmal results, even after going through the wildly complicated training process.

This is what the font looks like, so yeah, it's a bit tricky:

Lincoln sample

I've carefully made a training image, and then used that to make a box file. The training image is here (25MB!). The image is 300 DPI, and has representative characters nicely spaced out vertically and horizontally.

I made a box file for the training image, and it worked properly. I've verified that it's correct using a box file editor.

I took this box file/tif file, and used it to create training data. I did likewise with the 30 or so other sample images/fonts provided by Tesseract.

I created the unicharset file.

I created a font_properties file. There's no guidance on the site about when fraktur should be used. So I've tried it both this way (fraktur on for Lincoln):

eng.lincoln.box 0 0 0 0 1

And this way (fraktur off):

eng.lincoln.box 0 0 0 0 0

And finally, I've tried this with and without dictionary files. When I used dictionary files, they were the wordmap from my search engine, Sphinx, and they have about 15K common words and about 20K uncommon ones.

In all cases, when I try to OCR the first couple lines of this file (3MB), the quality is abysmal. Rather than getting:

United States Court of Appeals 
for the Federal Circuit

I get:

OniteiJ %tates C0urt of QppeaIs
for the jfeI1eraICircuit

Why?

Solution

I think you'll need a lot more samples (letters) and better training images (clean background, grayscale, 300 DPI, etc.). And try to train with only one font (for instance, Lincoln) first. You can use jTessBoxEditor tool to generate your training images and edit the box files.

Once you master the training process, you can add other fonts to your training. You can test the success of the resultant language data by using it in performing OCR on the training image itself -- the recognition rates should be high.

The font names in font_properties should be like:

lincoln 0 0 0 0 1

OTHER TIPS

I am not a Tesseract expert but I have evaluated nearly every OCR engine available and my comments are based on my experience over the years of analysing OCR errors.

Just wondering why your image has speckles in the background and not a pure white background. I don't know how Tesseract or the training tool works but the background could be causing some problems.

Just reading the sample page is difficult and requires a large amount of concentration. Characters such as F and I are very similar as are U and N. Tesseract like many OCR engines would be using many different techniques to recognise a character and there is not a whole lot difference between many of these characters in terms of the strokes and curves used in the font.

These characters, especially the uppercase characters would confuse many different matching algorithms just because they are so different to standard Latin / Roman type characters. This shows through in your results ie. All capital letters have an OCR error.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow