Performing Optical Character Recognition on PDF's from ColdFusion using a Java or .NET Library?
-
20-08-2019 - |
Question
I am looking to take a PDF and extract any text from it. I then want to make it available using ColdFusion's available Verity search to search the contents.
Are there any libraries out there that do this quite well already? I am including Java or .NET (Java prefered) libraries in the scope since they can be called from CF.
Any insights or experiences would be greatly appreciated... thanks!
Edit: Indexing PDF files works when the text is embedded in the PDF as far as I know with CF. The PDFs I'm having to deal with have the text scanned as an image.
Solution
If you have the ability to run your own software (i.e. Dedicated/VPS) then you could investigate using Tesseract OCR with cfexecute
to convert the PDFs to text?
OTHER TIPS
Verity should be able to index PDF files by default:
Ray Camden has an eight-part series on working with PDFs in ColdFusion 8.
Part 7 of the series covers using DDX to get text out of a PDF.
Not sure this will work with your OCR needs though, but may still be worth looking at.
On a semi related note, I found a very neat post about encoding and reading 2D Matrix barcodes in coldfusion.
http://www.stillnetstudios.com/2007/12/15/2d-barcodes-coldfusion/
This might solve some of my issues in needing to extract encoded information, but I am still after the body of the text.
Regarding tessnet, found a .net version too. http://www.pixel-technology.com/freeware/tessnet2/ If I could natively feed in PDF's instead of TIFFs.. :)