Question

I recently put together an interface for scanning and uploading searchable documents to KnowledgeTree, our document management system. We have access to plenty of separate tools for different parts of this process, but I wanted to combine everything into one interface to keep things simple for the users.

Here's the platform:

#    OS: Ubuntu Desktop 10.04
#    GUI Toolkit: wxPython
#    OCR package: Tesseract 3.00 (compiled executable)

And here is the basic process:

#    1. Retrieve individual page images from scanner
#    2. Call Tesseract OCR executable to produce HOCR data for each page
#    3. Run extracted words against English dictionary to guess if page orientation is correct
#        3a. If word matches are below threshold, rotate page 90 degrees and try again
#    4. Detect document type and retrieve metadata from HOCR data
#    5. Merge scanned pages and HOCR data into a finished PDF
#    6. Upload PDF and attached metadata to document management system through KnowledgeTree's API

It works beautifully, except that step 2 is extremely slow on certain types of documents. It rolls right through basic fixed-width text reports, but throw a few logos, lines, and other unreadable content in there, and it can sometimes spend minutes on a single page. Not to mention the fact that it could repeat that up to 4 times if it tries to reorient it. In comparison, the software packaged with the scanner uses ABBYY OCR, and can crunch 50+ pages in less than a minute, taking care of page layout and text orientation almost perfectly (I realize that's why ABBYY costs money). Unfortunately, using this scanning software is more complex for the users, and only covers steps 1-3 on its own.

My question is whether I should be approaching this differently, maybe by separating the OCR/upload from the scanning interface completely, of if there are any OCR packages or other solutions I'm overlooking that could be integrated into a Python application. Would the fact that I'm calling an external application to do the work cause performance issues?

Whatever I do here, it's important that I have control over step 4, since requiring the users to manually set the type and metadata for each uploaded document could be a problem.

Was it helpful?

Solution

The problem you are having is that Tesseract is an OCR engine, not page layout analysis software. The tesseract website says that version 3.0 will probably include page layout analysis.

I know in previous versions it only responds well if there is a single column of text.

I think you need to put in a step 1.5 that would do some layout analysis and try to find blocks of images, logos, illegible text.

You might want to look at OCRfeeder, to see his solution.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top