I am integrating Tesseract OCR in an app. Unfortunately the quality of the recognition is... not that great. The answer seems to be doing some very basic image cleaning before sending the image off for OCR.

Basically I plan to build a small pipeline that does the following:

  1. Crop to a white bounding box on the assumption that most users will try to do recco of ordinary black print on white background (optional)
  2. Convert to black/white
  3. Despeckle to remove artifacts caused by step 2.

I have 2. down (the easy part), and am looking for input on how to do 3 and optionally 1.

有帮助吗?

解决方案

Well... It turns out that Martin's suggestion of using ImageMagick is probably the best option in my case.

There's a CI filter that does noise removal, but it's not available in iOS, and I will have to use ImageMagick to convert a PDF to TIFF for OCR anyway, so ImageMagick it is.

An alternative is the small image processing library that Chris Greening made. If you don't need the full force of ImageMagick it will do most of the light lifting for you, and some of the heavy lifting too.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top