How to Make Existing PDF Text Searchable using any Java Library? With OCR

Question 1

Question 2

Any java library? How to make searchable text using any java library? Open source or Paid.

You can achieve this using Gnostice XtremeDocumentStudio for Java. For more details, follow the link below.

http://www.gnostice.com/nl_article.asp?id=289&t=How_to_convert_scanned_images_to_searchable_PDF_in_Java

FYI, in the article, we have demonstrated how to convert scanned image to searchable PDF. In fact the input can be any scanned document (images, PDF or DOCX).

Disclaimer: I work for Gnostice.

Question 3

You can use PDFBox to extract images from a PDF file, and then use the OCR system of your choice (for example, Tesseract) to obtain the text. Alternatively, if the PDF is mixed text and images, you can use Ghostscript to create an image of each PDF page, and then run OCR.

If you then need a searchable PDF file, build a new PDF by writing the text first, and then drawing the image over top of the text. The text will be searchable, but you will only see the image.

Note that OCR engines like Tesseract and Google Vision will return positional information for each word, so you will be able to place the text in the correct position.