Tesseract OCR Library - Learning Font

https://stackoverflow.com/questions/4908919

29-10-2019
|

문제

Well I'm using a complied .NET version of this OCR which can be found @ http://www.pixel-technology.com/freeware/tessnet2/

I have it working, however the aim of this is to translate license plates, sadly the engine really doesn't accurately translate some letters, for example here's an image I scanned to determine the character problems

enter image description here

Result:

12345B7B9U ABCDEFGHIJKLMNUPIJRSTUVHXYZ

Therefore the following characters are being translated incorrectly:

1, O, Q, W

This doesn't seem too bad, however on my license plates, the result isn't so great:

enter image description here = H4 ODM

enter image description here = LDH IFW

Fake Test

enter image description here = NR4 y2k

As you might be able to tell, I've tried noise reduction, increasing contrast, and remove pixels that aren't absolute black, with no real improvements.

Apparently you can 'learn' the engine new fonts, but I think I would need to re-compile the library for .NET, also it seems this is performed on a Linux OS which I don't have.

http://www.scribd.com/doc/16747664/Tesseract-Trainingfor-Khmer-LanguageFor-Posting

So I'm stuck as what to try next, I've wrote a quick console application purely for testing purposes if anyone wants to try it. If anyone has any ideas/graphic manipulation/library thoughts, I'd appreciate hearing them.

해결책

I used Tesseract via Tessnet2 recently (Tessnet2 is a VS2008 C++ wrapper around Tesseract 2.0 made by Rémy Thomas, if I remember well). Let me try to help you with the little knowledge I have concerning this tool:

1st, as I said above, this wrapper is only for Tesseract 2.0, and the newest Tesseract version on Google Code is 3.00 (the code is no longer hosted on Source Forge). There are regular contributors: I saw that version 3.01 or so is planned. So you don't benefit from the last enhancements, including page layout analysis which may help when your license plates are not 100% horizontal.
I asked Rémy for a Tessnet2 .NET wrapper around version 3, he doesn't plan any for now. So as I did, you'll have to do it by yourself !
So if you want to get the latest version of the sources, you can download them from the Subversion repository (everything's described on the dedicated site page) and you'll be able to compile them if you have Visual Studio 2008, since they sources contain a VS2008 solution in the vs2008 sub-folder. This solution is made of VS2008 C++ projects, so to be able to get results in C# you'll have to use .NET P/Invoke with the tessDll built by the project. Again if you need this, I have code examples that may interest you, but you may want to stay with C++ and do your own new WinForm projects, for instance !
When you have achieved to compile (there should not be major problems for that, but tell me if you meet some, I may have met them too :-) ), you'll have in output several binaries that will allow you to do a specific training ! Again, there is a page specially dedicated to Tesseract 3 training. Thanks to this training, you can:
- restrain your set of characters, which will automatically remove the punctuation ('/-\' instead of 'A', for instance)
- indicate the ambiguities you have detected ('D' instead of 'O' as you could see, 'B' instead of '8' etc) that will be taken into account when you will use your training.
I also saw that Tesseract results are better if you restrain the image to the zone where the letters are located (i.e. no face, no landscape around): in my case, I needed to recognize only a specific zone of cards photos taken from a webcam, so I used image processing to restrain the zone. That was long, of course, but my images came from many different sources so I had no choice. If you can get images that are restrained to the minimum, that will be great !

I hope it was of any help, do not hesitate to give me your remarks and questions !

다른 팁

Hi I've done lots of ocr with tesseract, and I have had some of your problems, too. You ask about IMAGE PROCESSING tools, and I'd recommend "unpaper" (there are windows ports too, see google) That's a nice de-skew, unrotate, remove-borders-and-noise and-so-on program. Great for running before ocr'ing.

If you have a (somewhat) variable background color on your images, I'd recommend the "textcleaner" imagemagick script I think it's edge detecting and whitening out all non-edgy stuff.

And if you have complex text then "ocropus" could be of use. Syntax is (on linux): "ocroscript rec-tess "

My setup is 1. textcleaner 2. unpaper 3. ocroups

With these three steps I can read almost anything. Even quite blurry+noisy images taken in uneven lighting, with two columns of tightly packed text comes out very readable. OK maybe your needs aren't that much text, but step 1) & 2) could be of use to you.

I'm currently building a license plate recognition engine for ispy - I got much better results from tesseract when I split the license plate into individual characters and built a new image displayed vertically with white space around them like:

I think a big problem that tesseract has is it tries to make words out of the horizontal letters and numbers and in the case of license plates with letters and numbers mixed up it will decide that a number is a letter or vice versa. Entering an image with the characters spaced vertically makes it treat them as individual characters instead of text.

A great read! http://robotics.usc.edu/publications/downloads/pub/635/

About your skew problem for license plates:

Issue: When OCR input is taken from a hand-held camera or other imaging device whose perspective is not fixed like a scanner, text lines may get skewed from their original orientation [13]. Based on our experiments, feeding such a rotated image to our OCR engine produces extremely poor results. Proposed Approach: A skew detection process is needed before calling the recognition engine. If any skew is detected, an auto-rotation procedure is performed to correct the skew before processing text further. While identifying the algorithm to be used for skew detection, we found that many approaches, such as the one mentioned in [13], are based on the assumptions that documents have s et margins. However, this assumption does not always hold in our application. In addition, traditional methods based on morphological operations and projection methods are extremely slow and tend to fail in presence of camera-captured images. In this work, we choose a more robust approach based on Branchand- Bound text line finding algorithm (RAST algorithm) [25] for skew detection and auto-rotation. The basic idea of this algorithm is to identify each line independently and use the slope of the best scoring line as the skew angle for the entire text segment. After detecting the skew angle, rotation is performed accordingly. Based on our experiments, we found this algorithm to be highly robust and extremely efficient and fast. However, it suffered from one minor limitation in the sense that it failed to detect rotation greater than 30. We also tried an alternate approach, which could detect any angle of skew up to 90. However, this approach was based on presence of some sort of cross on the image. Due to the lack of extensibility, we decided to stick with RAST algorithm.

Tesseract 3.0x, by default, penalizes combinations that aren't words and aren't common words. The FAQ describes a method to increase its aversion to such nonsense. You might find it helpful to turn off the penalty for rare or nonexistent words, as described (inversely) here: http://code.google.com/p/tesseract-ocr/wiki/FAQ#How_to_increase_the_trust_in/strength_of_the_dictionary?

ABCocr .NET uses Tesseract3 so that might be appropriate if you need the latest code under .NET.

If anyone from the future comes across this question, there is a tool called jTessBoxEditor that makes teaching Tesseract a breeze. All you do is point it at a folder containing sample images, then click a button and it creates your *.learneddata file for you.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow