Question

I'm trying to use the iText 5.5 library to manipulate information within a PDF. I want to scan a PDF for attachments and if it has attachments make physical copies of them (without removing/editting the original file). I'm running into an issue when there is a PDF with a .joboptions file attached. I'm using the following code:

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;

import com.itextpdf.text.pdf.PRStream;
import com.itextpdf.text.pdf.PdfArray;
import com.itextpdf.text.pdf.PdfDictionary;
import com.itextpdf.text.pdf.PdfName;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.PdfString;

public class extractAttachments{

public extractAttachments(String src, String dir) throws IOException {

    File folder = new File(dir);
       folder.mkdirs();
       PdfReader reader = new PdfReader(src);
       PdfDictionary root = reader.getCatalog();
       PdfDictionary names = root.getAsDict(PdfName.NAMES);
       PdfDictionary embedded = names.getAsDict(PdfName.EMBEDDEDFILES);
       PdfArray filespecs = embedded.getAsArray(PdfName.NAMES);
       for (int i = 0; i < filespecs.size(); ) {
         extractAttachment(reader, folder, filespecs.getAsString(i++),
         filespecs.getAsDict(i++));
       }
     }

    protected void extractAttachment(PdfReader reader, File dir, PdfString name, PdfDictionary filespec)
       throws IOException {
       PRStream stream;
       FileOutputStream fos;
       String filename;
       PdfDictionary refs = filespec.getAsDict(PdfName.EF);
       for (PdfName key : refs.getKeys()) {
         stream = (PRStream)PdfReader.getPdfObject(refs.getAsIndirectObject(key));
         filename = filespec.getAsString(key).toString();
         fos = new FileOutputStream(new File(dir, filename));
         fos.write(PdfReader.getStreamBytes(stream));
         fos.flush();
         fos.close();
       }
     }
  }

Once it gets to PdfArray filespecs = embedded.getAsArray(PdfName.NAMES); null is returned. I don't care if the .joboptions file is copied, however I do want the other attachments (if there are any) to be copied. Any ideas how I can get around this?

Also, if you want to create a PDF with said .joboptions file open a PDF document, go to the print menu and change the Printer to "Adobe PDF". Now select Properties, click OK and in the main print menu click Print. This will prompt you to select a location to save the document and the new document will have a .joboptions as an attachment.

Was it helpful?

Solution

Your code is incomplete as it only understands very primitive EmbeddedFiles structures. Your sample file has a slightly more complex EmbeddedFiles structure. You need to improve your code to also understand such more complex structures.

The details

The EmbeddedFiles dictionary is specified to contain a name tree:

EmbeddedFiles name tree (Optional; PDF 1.4) A name tree mapping name strings to file specifications for embedded file streams (see 7.11.4, "Embedded File Streams").

(ISO 32000-1 Table 31 – Entries in the name dictionary)

A name tree shall be constructed of nodes, each of which shall be a dictionary object. Table 36 shows the entries in a node dictionary. The nodes shall be of three kinds, depending on the specific entries they contain. The tree shall always have exactly one root node, which shall contain a single entry: either Kids or Names but not both. If the root node has a Names entry, it shall be the only node in the tree. If it has a Kids entry, each of the remaining nodes shall be either an intermediate node, that shall contain a Limits entry and a Kids entry, or a leaf node, that shall contain a Limits entry and a Names entry.

(ISO 32000-1 Section 7.9.6 - Name Trees)

Your code only understands the variety in which the root node has a Names entry and, therefore, is the only node in the tree:

...
PdfDictionary embedded = names.getAsDict(PdfName.EMBEDDEDFILES);
PdfArray filespecs = embedded.getAsArray(PdfName.NAMES);
...

In your sample PDF file on the other hand the EmbeddedFiles dictionary has a Kids entry and, therefore, is not understood by your code:

Sample PDF structure with EmbeddedFiles expanded

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top