Question

I don't want to know what it says, and it will not be dealing with any distortion like a CAPTCHA, I just want to know if a bunch of images contain any text.

This is something that will be running on a couple of idle Linux servers, and a cron job will process a large batch of images multiple times a day.

One of the things I want to do in the process, is discard any images with text in them. I don't mind some false positives, but I would like to get as close to a zero-percent fail rate when it comes to identifying images with text that should be discarded as possible.

Was it helpful?

Solution

The Tesseract-OCR is what google use for Google Books. Give it a try.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top