Extracting and parsing specific layout info from OCR engine

https://stackoverflow.com/questions/8367641

27-10-2019
|

题

I'm attempting to parse layout information from OCR engines with PHP, except they are not giving any details.

I have both Tesseract (with Leptonica) and Cuneiform installed. Supposedly Cuneiform is excellent at detecting layout (i.e. what is text, what is a picture, etc.) Input are PNG files with both text and images (obviously the text is part of the image.)

They all seem to think I want the output as txt or html or hocr... when what I want are the coordinates of what it thinks is text and what it thinks is an image.

Cuneiform has a "native" output option which is Cuneiform 2000 format, opening it up in Notepad++ I can see that it's compressed. I've tried extracting it with zip and gzip but neither recognize it. No info on Google about the native Cuneiform format either.

Anyone got any idea how to extract the layout information from Tesseract or Cuneiform... or got any better ideas to figure out the layout of an image containing text blocks and pictures?

解决方案

Have a look at ABBYY FineReader Engine. It has a very smart API that provides maximum information about the recoggnized text, including its coordinates. It's not free, but when it comes to business software – ABBYY OCR technologies can add a serious value to your product.

Since you are working on a web application in PHP, you may want to use ABBYY OCR Engine web API at www.ocrsdk.com. It's now in closed beta, so for now it's free to use.

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow