Question

I'm trying to use Python to run pdftotext, but for some reason, my code isn't working. If I run the below, I expect that the content variable would contain the contents of the PDF, but the result I am getting is just an empty string.

Does anybody know what I'm missing?

def getPDFContent(path):
    path = "/path/to/a valid/pdffile.pdf"

    process = subprocess.Popen(["pdftotext", path], shell=False, 
        stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
    content, err = process.communicate()[0:2]
    return content, err
Was it helpful?

Solution

By default pdftotext doesn't output anything on stdout, it instead creates a .txt file with the same base name as the pdf. To get the text on stdout, add - as a second parameter in the call to pdftotext:

process = subprocess.Popen(["pdftotext", path, "-"], shell=False, 
    stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top