Question

I've been trying for a while to be able to extract the pdf documents contained in a PDF package with no success. I've found no documentation or example code anywhere, but I know it's not impossible because the Adobe Reader app and the PDFExpert app support it. It is possible that they have their own parser, I hope it doesn't come to that...

Any hint that will point me in the right direction will be greatly appreciated

Edit: after a long time I went back to working on this and finally figured it out. Special thanks to iPDFDev for pointing me in the right direction!!

Here's the code on how to obtain each inner CGPDFDocumentRef:

NSURL *url = [NSURL fileURLWithPath:filePath isDirectory:NO];
CGPDFDocumentRef pdf = CGPDFDocumentCreateWithURL((__bridge CFURLRef)url);
CGPDFDictionaryRef catalog = CGPDFDocumentGetCatalog(pdf);

CGPDFDictionaryRef names = NULL;
if (CGPDFDictionaryGetDictionary(catalog, "Names", &names)) {
    CGPDFDictionaryRef embFiles = NULL;
    if (CGPDFDictionaryGetDictionary(names, "EmbeddedFiles", &embFiles)) {
        // At this point you know this is a Package/Portfolio

        CGPDFArrayRef nameArray = NULL;
        CGPDFDictionaryGetArray(embFiles, "Names", &nameArray);

        // nameArray contains the inner documents
        // it brings the name and then a dictionary from where you can extract the pdf

        for (int i = 0; i < CGPDFArrayGetCount(nameArray); i+=2) {
            CGPDFStringRef name = NULL;
            CGPDFDictionaryRef dict = NULL;

            if (CGPDFArrayGetString(nameArray, i, &name) &&
                CGPDFArrayGetDictionary(nameArray, i+1, &dict)) {
                NSString *_name = [self convertPDFString:name];

                CGPDFDictionaryRef EF;
                if (CGPDFDictionaryGetDictionary(dict, "EF", &EF)) {
                    CGPDFStreamRef F;
                    if (CGPDFDictionaryGetStream(EF, "F", &F)) {
                        CFDataRef data = CGPDFStreamCopyData(F, NULL);
                        CGDataProviderRef provider = CGDataProviderCreateWithCFData(data);

                        CGPDFDocumentRef _doc = CGPDFDocumentCreateWithProvider(provider);
                        if (_doc) {
                            // save the docRef somewhere (_doc)
                            // save the pdf name somewhere (_name)
                        }

                        CFRelease(data);
                        CGDataProviderRelease(provider);
                    }
                }
            }
        }
    }
}



- (NSString *)convertPDFString:(CGPDFStringRef)string {
    CFStringRef cfString = CGPDFStringCopyTextString(string);
    NSString *result = [[NSString alloc] initWithString:(__bridge NSString *)cfString];
    CFRelease(cfString);
    return result;
}
Was it helpful?

Solution

By PDF packages I assume you refer to PDF portfolios. The files in a PDF portfolio are basically document attachments with some extended attributes and they are located in the EmbeddedFiles tree. You start with the document catalog dictionary. From the document catalog dictionary you retrieve the /Names dictionary. From the /Names dictionary, if exists (it is optional), you retrieve the /EmbeddedFiles dictionary. If it exists, it represents the head of the embedded files tree (a name tree in the PDF specification).
The PDF specification (available here: http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf) describes in section 7.9.6 the name trees and you'll get the idea how to parse the tree.
The tree maps string identifiers to file specification dictionaries (section 7.11.3). From the file specification dictionary you retrieve the value of the /EF key which is the embedded file stream (section 7.11.4). The stream associated with this object is the file content you're looking for.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top