Question

I have to OCR table from PDF document. I wrote simple Python+opencv script to get individual cells. After that new problem arose. Text is antialiased and not good-quality. Recognition rate of tesseract is very low. I've tried to preprocess images with adaptive thresholding but results weren't much better. I've tried trial version of ABBYY FineReader and indeed it gives fine output, but I don't want to use non-free software. I wonder if some preprocessing would solve issue or is it nessesary to write and learn other OCR system.

Was it helpful?

Solution

If you look closely at your antialiased text samples, you'll notice that the edges contain a lot of red and blue:

enlarged view of antialiased text

This suggests that the antialiasing is taking place inside your computer, which has used subpixel rendering to optimise the results for your LCD monitor.

If so, it should be quite easy to extract the text at a higher resolution. For example, you can use ImageMagick to extract images from PDF files at 300 dpi by using a command line like the following:

convert -density 300 source.pdf output.png

You could even try loading the PDF in your favourite viewer and copying the text directly to the clipboard.


Addendum:

I tried converting your sample text back into its original pixels and applying the scaling technique mentioned in the comments. Here are the results:

Original image:
original image

After scaling 300% and applying simple threshold:
scaled and thresholded image

After smart scaling and thresholding:
smart scaled and thresholded image

As you can see, some of the letters are still a bit malformed, but I think there's a better chance of reading this with Tesseract.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top