How to split a pdf document containing multiple logical pages on each sheet?

https://stackoverflow.com/questions/14773650

07-03-2022
|

سؤال

I want to split a 2x2 pdf document into its original pages. Each page consists of four logical pages which are arranged like in this example.

I'm trying to use python and pypdf:

import copy, sys
from pyPdf import PdfFileWriter, PdfFileReader

def ifel(condition, trueVal, falseVal):
    if condition:
        return trueVal
    else:
        return falseVal

input  = PdfFileReader(file(sys.argv[1], "rb"))
output = PdfFileWriter()

for p in [input.getPage(i) for i in range(0,input.getNumPages())]:
    (w, h) = p.mediaBox.upperRight

    for j in range(0,4):
        t = copy.copy(p)        
        t.mediaBox.lowerLeft  = (ifel(j%2==1, w/2, 0), ifel(j<2, h/2, 0))
        t.mediaBox.upperRight = (ifel(j%2==0, w/2, w), ifel(j>1, h/2, h))
        output.addPage(t)

output.write(file("out.pdf", "wb"))

Unfortunately, this script does not work as intended because it outputs every fourth logical page four times. As I haven't written anything in python before, I think it's a very basic problem, presumably due to the copy operation. I would really appreciate any help.

Edit: Well, I have done some experiments. I inserted the page width and height manually like in the following:

import copy, sys
from pyPdf import PdfFileWriter, PdfFileReader

def ifel(condition, trueVal, falseVal):
    if condition:
        return trueVal
    else:
        return falseVal

input  = PdfFileReader(file(sys.argv[1], "rb"))
output = PdfFileWriter()

for p in [input.getPage(i) for i in range(0,input.getNumPages())]:
    (w, h) = p.mediaBox.upperRight

    for j in range(0,4):
        t = copy.copy(p)        
        t.mediaBox.lowerLeft  = (ifel(j%2==1, 841/2, 0),   ifel(j<2, 595/2, 0))
        t.mediaBox.upperRight = (ifel(j%2==0, 841/2, 841), ifel(j>1, 595/2, 595))
        output.addPage(t)

output.write(file("out.pdf", "wb"))

This code leads to the same wrong result as my original one, but if I now comment out the line (w, h) = p.mediaBox.upperRight, everything works! I can't find any reason for this. The tuple (w, h) is not even used anymore, so how can removing its definition change anything?

المحلول

I suspect that the problem is that the mediaBox is only a magic accessor for a variable is shared across p and all copies t. Therefore, assignments to t.mediaBox will result in the mediaBox having the same coordinates in all four copies.

The variable behind the mediaBox field is lazily created on the first access to mediaBox, so if you comment out the line (w, h) = p.mediaBox.upperRight, the mediaBox variables will be created separately for each t .

Two possible solutions for automatically determining the page dimensions:

Get the dimensions after making the copy:

for p in [input.getPage(i) for i in range(0,input.getNumPages())]:

    for j in range(0,4):
        t = copy.copy(p)       
        (w, h) = t.mediaBox.upperRight
        t.mediaBox.lowerLeft  = (ifel(j%2==1, w/2, 0),   ifel(j<2, h/2, 0))
        t.mediaBox.upperRight = (ifel(j%2==0, w/2, w), ifel(j>1, h/2, h))
        output.addPage(t)

Instantiate fresh RectangleObjects to use for mediaBox variables

for p in [input.getPage(i) for i in range(0,input.getNumPages())]:
    (w, h) = p.mediaBox.upperRight

    for j in range(0,4):
        t = copy.copy(p)        
        t.mediaBox.lowerLeft  = pyPdf.generic.RectangleObject(
                                    ifel(j%2==1, w/2, 0),   
                                    ifel(j<2, h/2, 0),
                                    ifel(j%2==0, w/2, w), 
                                    ifel(j>1, h/2, h))
        output.addPage(t)

Using copy.deepcopy() will cause memory issues for large, complex PDFs,

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow