I am processing PDF files with PDF Box and insert text objects according to the given coordinates on the pages. The coordinates I get are top-left based and I find the media box of the page, then calculate the position of the text. However there are some PDF images (they're scanned) which the text I inserted does not stand on the correct location, like the size of the page is much bigger than I get with media box.

// getX-Y returns the coordinates that the text should be inserted
// getSize returns the text height
void write(PDDocument doc, PDPage page, PDPageContentStream cs) {
    PDRectangle rect = page.findMediaBox();
    cs.moveTextPositionByAmount(this.getX(), height-this.getY()-getSize());
}

The dimensions retreieved from media box are 595.2 x 841.92. For a given text location 300x420, I expect this text to be inserted in the middle of the page. However it is inserted way too down and left of the page. When I open the document with Acrobat Reader and copy the page as image (since it's scanned) I see that the image dimensions are 2480 x 3508. The location of the inserted text would make sense if the page dimensions were in that size.

I feel like pdf page size is changed according to the content it has, but why don't I get those dimensions as the page size and still getting something like 595.2 x 841.92 instead? Should I process every image on the page and find the true dimensions? What am I missing here?

Edit: Sample PDF Document

Edit: This is the code part that I get PDPageContentStream:

PDDocument doc = null;
doc = PDDocument.load(inputFile);
List <?> allPages = doc.getDocumentCatalog().getAllPages();
for (int i = 0; i < list.size(); i++) {
    PDFObject obj = (PDFObject) list.get(i);
    for (int j = 0; j < allPages.size(); j++) {
        PDPage page = (PDPage) allPages.get(j);
        PDPageContentStream contentStream = new PDPageContentStream(doc, page, true, true);
        obj.write(doc, page, contentStream);
        if ("F".equalsIgnoreCase(obj.getPageType())) {
            break;
        }
    }
}
有帮助吗?

解决方案

Unfortunately the OP did not post all relevant code. Thus, this answer is partially based on assumptions, especially that he created his PDPageContentStream without making sure that the default user space coordinate system is still in use at the position where he adds new operations.

The sample document

The content stream of the first page starts like this:

0.24000 0 0 0.24000 0 0 cm
q
2480 0 0 3508 0 0 cm
/Im5 Do
Q

Thus, it first scales the user space coordinate system by .24, pushes the graphics state, scales the coordinate system by 2480 (x direction) and 3508 (y direction), draws an image, and eventually restores the graphics state.

Thus, thereafter the user space coordinate system is still scaled by .24. So coordinates given in the following operations are subject to that factor.

Immediately following are text objects, e.g. this:

BT
1 0 0 rg
/F0 25 Tf
400 794.9199829102 Td
(JFE14006) Tj
ET 

I assume this is one of the objects added by the OP not taking the non-default user space coordinate system into account as the coordinates and font size seem adequate for the default user space coordinate system.

(BTW, the referred-to font is not defined in the resources dictionary of the page.)

Solution 1

As the user space coordinate system at the insertion point is scaled by .24, you can counter-scale your own coordinates and sizes (i.e. divide them by .24).

E.g. to draw the text "MIDDLE" at the given given text location 300x420 (origin in the upper-left) using a font of size 10, you could do:

PDDocument document = PDDocument.load("0006-sun1-4.pdf");
List<PDPage> allPages = document.getDocumentCatalog().getAllPages();
PDPage firstPage = allPages.get(0);
PDRectangle pageSize = firstPage.findMediaBox();

PDPageContentStream contentStream = new PDPageContentStream(document, firstPage, true, true);
contentStream.setStrokingColor(Color.red);
contentStream.beginText();
contentStream.moveTextPositionByAmount(300/.24f, (pageSize.getUpperRightY() - 420 - 10)/.24f);
contentStream.setFont(PDType1Font.HELVETICA_BOLD, 10/.24f);
contentStream.drawString("MIDDLE");
contentStream.endText();
contentStream.close();

document.save("0006-sun1-4-scaledAdd.pdf");
document.close();

This solution is not optimal, though:

  • as soon as you have another source document (e.g. an updated form), the content stream at the insertion point might have a differently scaled coordinate system;
  • other states of the graphics drawing engine might also not be in their default state expected by you.

Thus:

Solution 2

You can revert all changes to the graphics state by enclosing the existing content stream with a q (save graphics state) and Q (restore graphics state) operator pair.

E.g. as above to draw the text "MIDDLE" at the given given text location 300x420 (origin in the upper-left) using a font of size 10, you could do:

PDDocument document = PDDocument.load("0006-sun1-4.pdf");
List<PDPage> allPages = document.getDocumentCatalog().getAllPages();
PDPage firstPage = allPages.get(0);
PDRectangle pageSize = firstPage.findMediaBox();

PDStream contents = firstPage.getContents();  
PDFStreamParser parser = new PDFStreamParser(contents.getStream()); 
parser.parse();
List<Object> tokens = parser.getTokens();
tokens.add(0, PDFOperator.getOperator("q"));
tokens.add(PDFOperator.getOperator("Q"));
PDStream updatedStream = new PDStream(document);  
OutputStream out = updatedStream.createOutputStream();  
ContentStreamWriter tokenWriter = new ContentStreamWriter(out);  
tokenWriter.writeTokens(tokens);  
firstPage.setContents(updatedStream);

PDPageContentStream contentStream = new PDPageContentStream(document, firstPage, true, true);
contentStream.setStrokingColor(Color.red);
contentStream.beginText();
contentStream.moveTextPositionByAmount(300, pageSize.getUpperRightY() - 420 - 10);
contentStream.setFont(PDType1Font.HELVETICA_BOLD, 10);
contentStream.drawString("MIDDLE");
contentStream.endText();
contentStream.close();

document.save("0006-sun1-4-restoredAdd.pdf");
document.close();

(Parsing and rewriting the existing stream is not exactly good style resource-wise but not a real issue in case of pages which essentially only draw an image.)

Screenshot of result

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top