Trying to split a large .pdf into multiple files. (python, pdftk)

https://stackoverflow.com/questions/12221619

29-06-2021
|

Question

I've written a script in Python that will split a .pdf by chapter/bookmark. Here is essentially the crux of my program:

for each chapter:
    system('pdftk A=file.pdf cat A{start}-{end} output file2.pdf')

The toolkit works lovely, but invoking it over and over is obviously not a time-efficient task. Parsing a 200mb .pdf file takes a solid 15-20 seconds, and doing so over the span of some 30 individual chapters takes a long time. More time is spent opening the file than actually writing any data.

Since there doesn't seem to be an inherent way to string multiple commands within the toolkit, is there any memory trickery I can pull in Python or the CMD that will let me get around this (i.e. keep the .pdf open)? I'll look at another module, too, if you can suggest one (pyPdf has its own slew of problems though).

Solution

To keep the pdf file in memory, read it into a StringIO buffer and tell pdftk to read from stdin. Specifically: Use subprocess.call instead of os.system, with your StringIO buffer as the stdin argument:

mybuffer = StringIO.StringIO(open('file.pdf').read())
subprocess.call('pdftk ...', stdin=mybuffer)

It will still need to parse the pdf file anew each time, but at least you won't be spinning your hard drive more than you have to. The only really fast way is to use a tool that can do it in one pass (e.g., solve whatever problems you have with pypdf).

OTHER TIPS

If you have for example an input.pdf file with 20000 pages and you want to split it in to 1..20.pdf files with 1000 pages each file.

for (( i=0; i<=20; i++ )); do let n=$i*1000; let m=$[i+1]*1000; pdftk input.pdf cat $n-$m output $i.pdf; done;

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow