Question

How would one go about creating an ocr software for indic languages?

How can one analyze the character's? How to manipulate them into font data?

I believe that I would need to use some form of tracing the line patterns and analyzing those patterns to a font character.

No correct solution

OTHER TIPS

OCR (Optical Character Recognition) is not an ordinary programming task. Indeed it not just about your programming skills and demands a good understanding of a chunk of scientific issues.
Here I draw a general picture of the steps required to accomplish such a task and mention the skills required, so you can follow them if you insisted yet;

  1. Preprocessing: An OCR program almost always performs a pre-processing on the image to improve it's quality as an input to its Recognition part. (Skill:Image Processing)
  2. Character Recognition: After applying the required changes to the input image (like removing some parts, scaling, applying some filters, ...) the program should recognize the character using a tool from the vast number of tools that exist (like Neural Networks, SVM, KNN ...).(Skill: Machine Learning-at least enough knowledge about one of the pre-mentioned tools)
  3. Postprocessing: The accuracy of the output from the previous step can further improve most of the time specially if you inject a domain knowledge into the problem, like forcing the output to an existing lexicon. (skill: ML again- KNN, CBR, ...)

I hope this general explanation guides you well. And believe that I tried to keep it as simple as possible.

Gujarati script could be tricky for many existing OCR libraries. A few questions:

  • Do you want to read machine-printed text, or handwriting? These are two separate problem domains.
  • Do you intend to develop/apply the OCR algorithm to a specific set of images/texts? If so, could you post some sample images?
  • What is your end goal? Do you want to scan handwritten texts for machine processing, or read text for industrial applications, or scan forms?
  • What read rate (accuracy) would be acceptable?

I'd suggest that textbooks are still a better starting point than reading a smattering of posts, articles, and papers online. There are two books I recommend for anyone interested in OCR:

Reading in the Brain by Stanislas Dehaene

Character Recognition Systems by Cheriet et al.

The Dehaene book is quite readable, and when reading it you will develop certain notions about how OCR might be developed for your particular application. I think it's typically best, no matter your level of experience, to try to solve a problem with whatever skills you have before you spend too much time reading the work of others. Spend a few days or a few weeks writing a bit of code or at least writing down ideas.

The Cheriet book gives a relatively current overview of work in the field. Even if the math isn't familiar to you, you'll get some idea of what research has been done.

Try first to get a broad overview of what has been done in the field, and what techniques have been tested for scripts similar to Gujarati. Stroke extraction techniques tested for Japanese, Chinese, and related scripts would likely be relevant to Gujarati. To my knowledge the number of existing OCR solutions for scripts such as Gujarati is relatively limited. However, some neural network-based methods could be used to train software on Gujarati characters (or any arbitrary symbols) and then recognize them, assuming the characters to be read are machine printed.

See if you can find a set of sample images for Gujarati. For a number of languages there are standard image sets or at least common image sets used to test the accuracy of OCR algorithms. If possible, get the original, raw, color or 8-bit grayscale images rather than images that have already been binarized to black and white (0 and 1).


As a start, I'd recommend finding at least one software package that at least partly solves your problem. Some OCR algorithms recognize outlines, others use neural networks to recognize grayscale patterns, and so on. Once you find a software package that has an algorithm that is somewhat successful with your image samples, you can identify what type of algorithm is used and proceed from there.

Tesseract is mentioned frequently. Free's a good price, so you might want to give it a try. https://code.google.com/p/tesseract-ocr/

It's been a few years since I looked at the following, but one of these may have a user-trainable font that you could try on machine-printed Gujarati:

  • FineReader by ABBYY.
  • OmniPage by Nuance.

Companies in industrial image processing (a.k.a. "machine vision") offer software packages that implement a variety of OCR algorithms. Although these software packages are typically designed to read a few lines of text on silicon wafers, product packaging, or the like, they may be useful to you because (a) the simple user interfaces may help you test ideas quickly, (b) the packages include many additional image processing tools, (c) there are few limitations on the characters, symbols, or image features you can train, and (d) you may be able to download trial versions that have fully functional OCR tools.

  • Cognex
  • Microscan
  • MvTec (product: HALCON)
  • National Instruments LabVIEW

For machine-printed text, image capture is also important. A good optical system can help improve read accuracy: that could mean selecting a good camera + lens + light, or perhaps choosing a high-quality flatbed scanner.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top