I'm using Python PDFTK as part of a PDF text extraction project I'm working on. Does anyone know of any better text extraction libraries I can use?

I'm using Python, but these days anything is possible.

I'm also looking for alternatives - basically anything that can run equal or better. A few of my PDFs (not encrypted, etc.) just aren't being identified by the PDFTK extractor, and I'm not getting the progress I'm looking for.

Thanks for your time.

有帮助吗?

解决方案

Try PDFMiner. This is a PDF library that supports a lot of features. Basically, it also has a tool named pdf2text.py where they have provided an example of extracting contents from an encrypted PDF file to a plain text document. Refer to the section of pdf2text.py on the page.

Also has support for CJK languages (subject to installation of some dependencies)

Also has support for CJK characters

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top