How can I extract raw text from PDFs using Apache POI?

https://stackoverflow.com/questions/16910731

30-05-2022
|

Pergunta

I need to extract raw text from several files, some of which are PDF and some of which are DOC file formats.

I have to use Apache POI to do this. Now, there is a lot of documentation I have found on dealing with word files (extracting and writing to etc.) but I am unable to find any documentation on extracting from a PDF.

Am I wrong in believing that Apache POI has this capability?

If so, can anyone recommend similar Java programs that allow text extraction from multiple file formats?

If not, can anyone point me to the documentation and/or the classes/methods that I should be looking at to do this?

Thank you in advance for any help.

Solução

Yes, you are wrong in believing that POI will do that. Apache POI works with Microsoft Office file formats, which PDF isn't.

You'll either want to use Apache PDFBox directly, or us Apache Tika which will do both Microsoft Office and PDF file formats (amongst many others).

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow