Performing Optical Character Recognition on PDF's from ColdFusion using a Java or .NET Library?

https://stackoverflow.com/questions/496875

20-08-2019
|

Question

I am looking to take a PDF and extract any text from it. I then want to make it available using ColdFusion's available Verity search to search the contents.

Are there any libraries out there that do this quite well already? I am including Java or .NET (Java prefered) libraries in the scope since they can be called from CF.

Any insights or experiences would be greatly appreciated... thanks!

Edit: Indexing PDF files works when the text is embedded in the PDF as far as I know with CF. The PDFs I'm having to deal with have the text scanned as an image.

Solution

If you have the ability to run your own software (i.e. Dedicated/VPS) then you could investigate using Tesseract OCR with cfexecute to convert the PDFs to text?

OTHER TIPS

Verity should be able to index PDF files by default:

http://livedocs.adobe.com/coldfusion/6/Developing_ColdFusion_MX_Applications_with_CFML/indexSearch2.htm#1142322

Ray Camden has an eight-part series on working with PDFs in ColdFusion 8.

Part 7 of the series covers using DDX to get text out of a PDF.

Not sure this will work with your OCR needs though, but may still be worth looking at.

On a semi related note, I found a very neat post about encoding and reading 2D Matrix barcodes in coldfusion.

http://www.stillnetstudios.com/2007/12/15/2d-barcodes-coldfusion/

This might solve some of my issues in needing to extract encoded information, but I am still after the body of the text.

Regarding tessnet, found a .net version too. http://www.pixel-technology.com/freeware/tessnet2/ If I could natively feed in PDF's instead of TIFFs.. :)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow