Is OCR no longer an issue?
-
22-09-2019 - |
Question
According to Wikipedia, "The accurate recognition of Latin-script, typewritten text is now considered largely a solved problem on applications where clear imaging is available such as scanning of printed documents." However, it gives no citation.
My question is: is this true? Is the current state-of-the-art so good that - for a good scan of English text - there aren't any major improvements left to be made?
Or, a less subjective form of this question is: how accurate are modern OCR systems at recognising English text for good quality scans?
Solution
Considered narrowly as breaking up a sufficiently high-quality 2d bitmap into rectangles, each containing an identified latin character of one of a set of well-behaved, prespecified fonts (cf. Omnifont), it is a solved problem.
Start to play about with those parameters, e.g., eccentric unknown fonts, noisy scans, asian characters, it starts become somewhat flaky or require additional input. Many well-known Ominfont systems do not handle ligatures well.
And the main problem with OCR is making sense of the output. If this was a solved problem, Google Books would give flawless results.
OTHER TIPS
I think that it is indeed a solved problem. Just have a look on the plethora of OCR technology articles for C#, C++, Java, etc.
Of course the article does stress that the script needs to be typewritten and clear. This makes recognition a relatively trivial task, whereas if you need to OCR scanned pages (noise) or handwriting (diffusion), it can get trickier as there are more things to tune correctly.