Alternatives to (Python) PDFtk? [closed]

https://stackoverflow.com/questions/17896951

04-06-2022
|

Pergunta

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.

Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.

Closed 8 years ago.

Improve this question

I'm using Python PDFTK as part of a PDF text extraction project I'm working on. Does anyone know of any better text extraction libraries I can use?

I'm using Python, but these days anything is possible.

I'm also looking for alternatives - basically anything that can run equal or better. A few of my PDFs (not encrypted, etc.) just aren't being identified by the PDFTK extractor, and I'm not getting the progress I'm looking for.

Thanks for your time.

Solução

Try PDFMiner. This is a PDF library that supports a lot of features. Basically, it also has a tool named pdf2text.py where they have provided an example of extracting contents from an encrypted PDF file to a plain text document. Refer to the section of pdf2text.py on the page.

Also has support for CJK languages (subject to installation of some dependencies)

Also has support for CJK characters

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow