PDF to PNG in Python with pdf2cairo

https://stackoverflow.com/questions/17231267

01-06-2022
|

题

I'm looking for a good PDF 2 Image convertor for a long time. I need to convert the PDF to an image in order to print it with use of Qt. I'm programming in Python/Pyside, so if I can convert the PDF to a series of (PNG) images with use of subprocess I can print them without problems.

I achieved to do this by calling convert.exe from Imagemagick. It works quite well but it relies on GhostScript and that is a big package which I want to avoid since its more complex to integrate.

I also tried muPDF from GhostScript, but this seems to not have stdin and stdout options. That's a pity because it first saves my file. Opens it with muPDF, converts and saves it and then reload it again in my Python application. It should be possible without all those steps!

Today I started with experimenting with Poppler's pdf2cairo. I assumed that it would work in this way to convert my (multi paged) PDF to a series of images and pipe it to the stdout. Unfortunately it doesn't and I experience two problems:

It complains that it can only export to stdout when you also use the -singlepage argument. How can I export all pages to stdout?
When I export to stdout I get the error: 'Error opening output file fd://0.png\r\n

Converting a pdf from stdin to image files is no problem it all.

This is my code which also triggers the error about opening the output file:

import subprocess

pdf = open('test.pdf')
p = subprocess.Popen(['pop/pdftocairo.exe', '-singlefile', '-png', '-', '-'],stdin = pdf, stdout = subprocess.PIPE, stderr = subprocess.PIPE)
print(p.stderr.read())
print(p.stdout.read())

I've downloaded PDF2Cairo pre-compiled from: http://blog.alivate.com.au/poppler-windows/ The documentation of the command line options of pdf2cairo can be found here: http://manpages.ubuntu.com/manpages/precise/man1/pdftocairo.1.html

Hopefully you can help me out to make this work!

Update As you can see below in the answers pdftocairo is buggy and does not work correctly when you want to use stdout. pdftoppm does work it return is byte object of your PDF file:

pdf = open('test.pdf')
p = subprocess.Popen(['pop/pdftoppm.exe',  '-png'],stdin = pdf, stdout = subprocess.PIPE,   stderr = subprocess.PIPE)
data, error = p.communicate()

The only thing I still need to do is split the byte object into multiple files.

解决方案

It's a bug in pdftocairo.

The output filename is first passed to getOutputFilename, which returns the special string fd://0 as placeholder for stdout.

But then later that string is passed to getImageFilename which unconditionally adds an extension to the filename, so that later the comparision fails and the program tires to open the literal file fd://0.png instead of using stdout.

Unfortunatlely, the only thing you can do is file a bug report.

As for exporting a multipage document to stdout, that's not supported at all, and it wouldn't work with filetypes like png or jpeg anyway, because these formats don't support multipage documents. It does work for svg, pdf, eps and ps output files, as these formats do support multipage documents (and the processing of the filename done correctly for these.)

其他提示

I thought it would be easier to just use os.system and pass the whole command string. This assumes there are "pdfs" and "imgs" folders; change accordingly.

import os
import glob

for pdf_file in glob.glob("pdfs\*.pdf"):
    cmd_str = "pdftocairo.exe -jpeg \"%s\" \"%s\"" % (pdf_file, os.path.join("imgs", os.path.splitext(os.path.split(pdf_file)[1])[0]))
    print cmd_str
    os.system(cmd_str)

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow