Question

Any java library? How to make searchable text using any java library? Open source or Paid.

how to apply OCR to pdf using PDFBox? how to make pdf text searchable programmatically using pdfbox I searched alot. Didn't find any solution. Can anyone paste code for OCR PDFBox.

Was it helpful?

Solution

Try Apache PDFBox.

To extract text: Textextraction.

OTHER TIPS

Any java library? How to make searchable text using any java library? Open source or Paid.

You can achieve this using Gnostice XtremeDocumentStudio for Java. For more details, follow the link below.

http://www.gnostice.com/nl_article.asp?id=289&t=How_to_convert_scanned_images_to_searchable_PDF_in_Java

FYI, in the article, we have demonstrated how to convert scanned image to searchable PDF. In fact the input can be any scanned document (images, PDF or DOCX).

Disclaimer: I work for Gnostice.

You can use PDFBox to extract images from a PDF file, and then use the OCR system of your choice (for example, Tesseract) to obtain the text. Alternatively, if the PDF is mixed text and images, you can use Ghostscript to create an image of each PDF page, and then run OCR.

If you then need a searchable PDF file, build a new PDF by writing the text first, and then drawing the image over top of the text. The text will be searchable, but you will only see the image.

Note that OCR engines like Tesseract and Google Vision will return positional information for each word, so you will be able to place the text in the correct position.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top