Is OCR no longer an issue?

https://stackoverflow.com/questions/1587761

22-09-2019
|

Question

According to Wikipedia, "The accurate recognition of Latin-script, typewritten text is now considered largely a solved problem on applications where clear imaging is available such as scanning of printed documents." However, it gives no citation.

My question is: is this true? Is the current state-of-the-art so good that - for a good scan of English text - there aren't any major improvements left to be made?

Or, a less subjective form of this question is: how accurate are modern OCR systems at recognising English text for good quality scans?

Solution

Considered narrowly as breaking up a sufficiently high-quality 2d bitmap into rectangles, each containing an identified latin character of one of a set of well-behaved, prespecified fonts (cf. Omnifont), it is a solved problem.

Start to play about with those parameters, e.g., eccentric unknown fonts, noisy scans, asian characters, it starts become somewhat flaky or require additional input. Many well-known Ominfont systems do not handle ligatures well.

And the main problem with OCR is making sense of the output. If this was a solved problem, Google Books would give flawless results.

OTHER TIPS

I think that it is indeed a solved problem. Just have a look on the plethora of OCR technology articles for C#, C++, Java, etc.

Of course the article does stress that the script needs to be typewritten and clear. This makes recognition a relatively trivial task, whereas if you need to OCR scanned pages (noise) or handwriting (diffusion), it can get trickier as there are more things to tune correctly.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow