Pergunta

I'm using Python PDFTK as part of a PDF text extraction project I'm working on. Does anyone know of any better text extraction libraries I can use?

I'm using Python, but these days anything is possible.

I'm also looking for alternatives - basically anything that can run equal or better. A few of my PDFs (not encrypted, etc.) just aren't being identified by the PDFTK extractor, and I'm not getting the progress I'm looking for.

Thanks for your time.

Foi útil?

Solução

Try PDFMiner. This is a PDF library that supports a lot of features. Basically, it also has a tool named pdf2text.py where they have provided an example of extracting contents from an encrypted PDF file to a plain text document. Refer to the section of pdf2text.py on the page.

Also has support for CJK languages (subject to installation of some dependencies)

Also has support for CJK characters

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top