Parsing PDF documents with Quartz

https://stackoverflow.com/questions/18281187

24-06-2022
|

Domanda

I am trying to parse a PDF document with the Quartz framework and have copy & pasted the code snippets from the Apple documentation into my source code. Unfortunately, it does not retrieve any data. It just iterates over the pages, logs the number of the current page to the console and crashes at the end. Do you have any idea on what is wrong with the code?

static void op_MP (CGPDFScannerRef s, void *info)
{
    const char *name;

    if (!CGPDFScannerPopName(s, &name))
        return;

    printf("MP /%s\n", name);
}

static void op_DP (CGPDFScannerRef s, void *info)
{
    const char *name;

    if (!CGPDFScannerPopName(s, &name))
        return;

     NSLog(@"DP /%s\n", name);
}

static void op_BMC (CGPDFScannerRef s, void *info)
{
    const char *name;

    if (!CGPDFScannerPopName(s, &name))
        return;

    NSLog(@"BMC /%s\n", name);
}

static void op_BDC (CGPDFScannerRef s, void *info)
{
    const char *name;

    if (!CGPDFScannerPopName(s, &name))
        return;
     NSLog(@"BDC /%s\n", name);
}

static void op_EMC (CGPDFScannerRef s, void *info)
{
    const char *name;

    if (!CGPDFScannerPopName(s, &name))
        return;

     NSLog(@"EMC /%s\n", name);
}

static void op_TJ (CGPDFScannerRef s, void *info)
{
    const char *name;

    if (!CGPDFScannerPopName(s, &name))
        return;

     NSLog(@"TJ /%s\n", name);
}

- (BOOL)application:(UIApplication *)application didFinishLaunchingWithOptions:(NSDictionary *)launchOptions
{
    CGPDFDocumentRef myDocument;
    NSString *urlAddress = [[NSBundle mainBundle] pathForResource:@"test" ofType:@"pdf"];
    NSURL *fileUrl = [NSURL fileURLWithPath:urlAddress];
    CFURLRef url = (__bridge CFURLRef)fileUrl;
    myDocument = CGPDFDocumentCreateWithURL(url);

    CFRelease (url);

    if (myDocument == NULL) {// 2
        NSLog(@"can't open `%@'.", fileUrl);
     }
    if (!CGPDFDocumentIsUnlocked (myDocument)) {// 4
         CGPDFDocumentRelease(myDocument);
    }
    else if (CGPDFDocumentGetNumberOfPages(myDocument) == 0) {// 5
        CGPDFDocumentRelease(myDocument);
    }
    else {
        CGPDFOperatorTableRef myTable;
        myTable = CGPDFOperatorTableCreate();

        CGPDFOperatorTableSetCallback (myTable, "MP", &op_MP);
        CGPDFOperatorTableSetCallback (myTable, "DP", &op_DP);
        CGPDFOperatorTableSetCallback (myTable, "BMC", &op_BMC);
        CGPDFOperatorTableSetCallback (myTable, "BDC", &op_BDC);
        CGPDFOperatorTableSetCallback (myTable, "EMC", &op_EMC);
        CGPDFOperatorTableSetCallback (myTable, "Tj", &op_TJ);

        int k;
        CGPDFPageRef myPage;
        CGPDFScannerRef myScanner;
        CGPDFContentStreamRef myContentStream;

        int numOfPages = CGPDFDocumentGetNumberOfPages (myDocument);// 1
        for (k = 0; k < numOfPages; k++) {
            myPage = CGPDFDocumentGetPage (myDocument, k + 1 );// 2
            myContentStream = CGPDFContentStreamCreateWithPage (myPage);// 3
            myScanner = CGPDFScannerCreate (myContentStream, myTable, NULL);// 4
            CGPDFScannerScan (myScanner);// 5
            CGPDFPageRelease (myPage);// 6
            CGPDFScannerRelease (myScanner);// 7
            CGPDFContentStreamRelease (myContentStream);// 8
            NSLog(@"processed page %i",k);
        }
        CGPDFOperatorTableRelease(myTable);
        CGPDFDocumentRelease(myDocument);
    }

    return YES;
}

Soluzione

I did not run the code but the first 5 operators might not exist in your page content. Also some of them have a name operand, some of them do not have any operands (such as EMC). Also the Tj operator has a string operand, not a name.
Remove all the pop name methods and leave only the logging and you might get some output. Then look in the PDF specification to see the exact operands for each operator and update your code accordingly.

Altri suggerimenti

While I can't give you a solution to your example code crash, last time we needed to do this we based our parser on PDFKitten.

https://github.com/KurtCode/PDFKitten

If you are interested in the parsing code, the interesting stuff is located in Scanner.m:

https://github.com/KurtCode/PDFKitten/blob/master/PDFKitten/Scanner.m

Given the complexity of PDF parsing I would suggest working with this library as a base and moving from there. If you need a polished implementation on a deadline, then PSPDFKit is probably the most well-developed (but expensive) package.

It's about the CFRelease(url). Delete it and it will be okay.

"(__bridge T) op casts the operand to the destination type T. If T is a retainable object pointer type, then op must have a non-retainable pointer type."

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow