I've used a linux function to convert a list of PDF files to text.

Command:

pdftotext -htmlmeta

This work well for most of my files.

but for a small amount of them, this return me a blank text file.

My unsuccesssfull pdf files were not encrypted, not securised by user / password and they were not read only.

有帮助吗?

解决方案

Converting PDFs to text is not a well-defined process. It can work awesome or not at all, depending on the PDF input.

Why is this? Because a PDF's task is mainly to represent the optics of a document, not the textual contents. PDFs can be everything from a pure text with positional information up to a pure graphics of the glyphs of the letters of the text. In the latter case one would need to run an OCR on the input in order to receive text information. This is not done by tools like pdftotext.

Sometimes the text in the PDF is scattered throughout the file, e. g. because first all standard-font letters are mentioned in the PDF, then, later in the file, all the italics-font letters are mentioned (of course with positional information, so a reader of the optical representation won't notice this, even if standard and italics are mixed throughout the text on the page). To rearrange this mess to a fluent text is a major task not very many converters are capable of.

So I guess all you can do is try some more converters for PDF to text (some are better than others, and some are better just for some specific input) or see that you can get the text from another source than the PDF files.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top