Restrict Preprocessing of Tesseract

Question 1

Well I was feeding grayscale(8bpp) image to tesseract after preprocessing so after getting that grayscale image tesseract is trying to binarize i.e. convert it to black and white, that was giving me bad results, I still don't know why.

But after that I tried to first convert my scale image in to b/w or 1bpp image and then I fed that image to tesseract I got relatively much better results.

Question 2

Regarding your question why tesseract delivers better results when using a binary image instead of a gray image as input for tesseract:

Tesseract will do an internal binarization of the gray scale image with various methods (haven't figured out right know what method for binarization is used exactly, some times local adaptiv threshold, some times global OTSU threshold is mentioned in the internet). Sure is, that tesseract performs character recognition on a binary image and that the preprocessing of tesseract can still fail at specific problems (hasn't got good layout analyzes for example). So if you do the preprocessing part yourself and give tesseract as input image only a binary image with text and disable all layout analyzes in tesseract you could achieve better results than letting tesseract doing all for you. Since it is an open source free utility, it has some known drawbacks, which has to be accepted.

If you use tesseract as command line tool, this thread is very useful for the parameter. tesseract command line page segmantation

If you use the source code of tesseract in developing your own C++ Code, you have to initialze tesseract with the correct parameter. Parameter are described here at the tesseract API side. tesseract API