OCR correction with prior transcription?

https://stackoverflow.com//questions/21028274

21-12-2019
|

Question

I have a range of documents imaged adn available in tiff, jpeg and pdf.

Many have been transcribed and the transcriptions checked for accuracy.

I want to create pdfs and wonder if there is a way to OCR the images and to correct to the verified transcriptions or to 'insert' verified transcription during the OCR process?

I have access to Omnipage, Abbyy Finereader and Tesseract but I don't know if what I want to do is at all possible.

Solution

Jack. Thanks for the clarification.

In short, the transcribed data has little-to-no benefit to any OCR process you are able run easily, with an exception of a highly customized custom-developed application that will do fuzzy per-word lookups from OCRed text in specific places of your transcribed data. In that custom application, you would use regular OCR (any one you named), but preferably some kind of OCR that provides you with coordinates of processed text (OCR-IT API with export to XML), or some kind of SDK that gives you object-based access to text. Then as part of post-processing your application could refer back to transcribed data, assuming you have a way to identify where in transcribed data you are at any moment, or at least performing full text search and being able to identify correct instance in case multiple instances are found. Your transcribed data probably does not have coordinates to link text back to original images where text came from. If similar data is found, and there is character difference, your application could take transcribed data and replace (i.e. correct) OCR-ed data with it. This most likely will not work for hand-written text as regular OCR will produce noise from it, not sufficient for even fuzzy lookup. Once all data replacement has been done, then your application will need PDF export creation capability, for which again some library could be used.

The whole process is complex, and hit-or-miss in some cases, especially around hand-written text. If you had a huge amount of these images+data, then it may be worthwhile to spend days (if not weeks) on developing such specialized application to crunch all that data. Cost analysis needs to be performed.

Aside from hand-writing, modern top-quality OCR (ABBYY, Nuance, OCR-IT) should produce high quality text if your images are of high quality. With PDF Text Under Image any OCR errors will be invisible to readers. I would say expectation of 95-99% accuracy out-of box is realistic. This out-of-box option may provide you high enough accuracy with little time or expenses.

There is one benefit that your transcribed data can provide, especially it that data contains specialized or industry-specific words or proper names that may not be found in a common English dictionary (already included with ABBYY and other OCR software). By making a custom dictionary out of your transcribed data, that dictionary can be used by ABBYY OCR to further increase recognition of those special words using out-of-box processing.

Ilya Evdokimov

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow