C# solution for rendering PDFs and OCRing the resulting images? [closed]

https://stackoverflow.com/questions/10831963

11-06-2021
|

Question

I'm looking for is a C# solution to import data from PDF documents into our database, in a commercial application. Our customers will be looking to import any arbitrary document. Ordinarily I'd write this off as a complete impossibility, but the documents they're importing will be in their own set layout.

My plan is to have the PDFs rendered to static images, then allow the users to set up their own templates, which essentially pull out text at predefined pixel-offsets in the PDF, using OCR. For tables, they define a location of the table and a bunch of further values for column and row sizes. We can then apply the template onto that document type.

So, what I'm really looking for is two libraries: one to convert PDFs to images, another to OCR those images.

Requirements:

Is pure-C# or has a supported C# wrapper onto a native DLL.
Doesn't fork out processes - wrappers that essentially just create command line parameters and launch an external executable aren't allowed in this case.
In the case of FOSS, allows us to exempt ourselves from normal FOSS license requirements (i.e. publishing our sourcecode) by paying a license fee.

We certainly don't mind paying for a commercial solution, but we'd rather not get stuck with paying a fee per individual distribution of the software.

I know this is quite a specific requirement set - perhaps enough for some people to deem this question too localised, but I'm hoping that someone can suggest an approach and some libraries that can be helpful to me, as well as others in the future.

Stuff I've looked into for the PDF side:

iTextSharp - Documentation is a book you have to buy, not a good start. Doesn't seem to be much useful documentation regarding turning PDFs into images in the public domain. Licensing is opaque, looks like we have to pay per client we distribute to.
Docotic.Pdf - Text only, no use to us.
pdftohtml - Again, doesn't produce images. Would be a mess to port to C# too.
PdfFileParser - Still not what we need.
GhostScript - Pretty much exactly what we want, but requires forking out to a program.

For the OCR side, I'll probably end up using Tesseract, since the Apache license is permissive and it's got good reviews. If there's an alternative, I'd be interested in that too.

Solution

I think you might want to give Docotic.Pdf another chance.

The library can extract text chunks, words and even individual characters with their bounding rectangles. Please have a look at the sample for extraction of words from PDFs.

Also, Docotic.Pdf can create images from PDFs and draw pages on a System.Drawing.Graphics. Please have a look at Draw and print Pdf group of samples.

Disclaimer: I am one of developers of the library.

OTHER TIPS

I would like to recommend Amyuni PDF Creator .Net for this task.

1st Scenario:
If your PDF files are well defined (no missing font information etc) you could directly extract the text from the PDF by specifying a rectangular region in the method GetObjectsInRectangle. You should also use the option acGetRectObjectsOptimize:

Optimize text objects before returning them. That is, combine text objects that are close to each other into a single text object.

2nd Scenario:
If there are images involved that also contain text, rendering the whole page into an image and then applying OCR might be a better choice. You can do this with Amyuni PDF Creator .Net by using the methods ExportToTiff, ExportToJPeg, or RasterizePageRange.

From the documentation:

IacDocument.RasterizePageRange Method
The RasterizePageRange method converts page contents into a color or grey scale image. When archiving documents or performing OCR, it is sometimes preferable for all pages to be stored as images rather than complex text and graphic operations.

Then you can use our OCR add-in that integrates with Tesseract OCR and finally we fall again into the 1st Scenario (GetObjectsInRectangle). In order to apply OCR to your files you can use the method OCRPageRange.

void OCRPageRange(int startPage, int EndPage, string Language, acOCROptions Options)

About licensing, Amyuni PDF Creator .Net provides a (per application) royalty free license.

Usual disclaimer applies

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow