OCR: How to improve accuracy - existing libraries for removing non-text 'furniture', shapes, etc to avoid confusing OCR?

StackOverflow https://stackoverflow.com/questions/2448106

Question

I want to remove rectangles etc that enclose text in a screenshot image, so that I can perform optical character recognition to get accurate text from the screenshot.

Background:

I doing this to extract data from a legacy application for use with other applications. This is the only way to get at this data as associated files are in a closed, proprietary, binary format.

I will be using AutoItScript to drive the application to show data in its UI, then I will screenshot this and feed this to tesseract.

I've already had some success in automating the UI, and have been able to use tesseract to get plain ascii text out of the bitmap.

There are several AutoItScripr forum articles discussing its use with tesseract/OCR but not specifically for my question. http://www.autoitscript.com/forum/index.php?s=6c32c3ece12756e635a619cdf175eff9&showforum=2

What I need to do

There are thin, 1-pixel wide rectangles that closely enclose some text, when fed to tesseract, it sees them as I for example for a verticle line of the rectangle.

Any thoughts on how to remove the rectangles, or best practices?

I'm asking if there is a generic command line based toolset to overwrite rectangles, for example, in .png files. I could then pass the .png through this, then pass it to tesseract.

Details on the tesseract release/setup I've used are as follows:

Go here: http://code.google.com/p/tesseract-ocr/downloads/list - For the basic english generic character set to get Tesseract up and running and recognising your bitmapped text into ascii text, use tesseract-2.00.eng.tar.gz (current version at time of writing is: "English language data for Tesseract (2.00 and up) Jul 2007 989 KB 84845")

Related questions I have already looked at on Stack Overflow

In these, my question is not completely answered or a commercial solution is being sold. I do not want to consider a commercial solution at this stage.

Was it helpful?

Solution

There's probably not going to be a free off the shelf solution for this, but coding your own shouldn't be too hard since it's probably safe to assume that a rectangle will never be a valid character in your font's alphabet and can therefore be removed safely. It also helps that all your rectangle borders are exactly one pixel wide.

So search for a contiguous horizontal line that is joined to another, parallel line of the same length by exactly two vertical lines. Repeat the search until you find all the rectangles in the image then render them all transparent with Graphics.DrawRectangle and Pens.Transparent. Don't render a rectangle transparent until you've finished searching else you risk wiping out parts of overlapped rectangles before you've found them. This is just a starter suggestion, I haven't implemented or debugged this algorithm.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top