Possible bug with load() and parse() methods in PDFBox?

https://stackoverflow.com//questions/20004290

java
pdfbox

20-12-2019
|

Domanda

I tried to use PDFBox on regular .pdf files and it worked fine.

However when I encountered a corrupted .pdf , the code would "freeze" .. not throwing errors or something .. simply the load or parse function take forever to execute

Here is the corrupted file (i have zipped it so that everybody could download it), it is probably not a native pdf file but it was saved as a .pdf extension and it is only 4 Kb.

I am not an expert at all, but I think that this is a bug with PDFBox. According to documentation, both load() and parse() methods are supposed to throw exceptions if they fail. However in case with my file, the code would take forever to execute and not throw exception.

I tried using only load, one can try parse() .. the result is the same

import java.io.FileNotFoundException;
import java.io.IOException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;

public class TestTest {

    public static void main(String[] args) throws FileNotFoundException, IOException {
        System.out.println(pdfToText("C:\\..............MYFILE.pdf")); 
        System.out.println("done ! ! !");
    }
    private static String pdfToText(String fileName) throws IOException {
        PDDocument document = null;
        document = PDDocument.load(new File(fileName)); // THIS TAKES FOREVER
        PDFTextStripper stripper = new PDFTextStripper();
        document.close();
        return stripper.getText(document);
    }
}

How to force this code throw an exception or stop executing if the .pdf file is corrupted? Thanks

Soluzione

For implementing simple timeouts for 3rd party libs I often use an implementation like Apache Commons ThreadMonitor:

long timeoutInMillis = 1000;

try {
    Thread monitor = ThreadMonitor.start(timeoutInMillis);  
    // do some work here
    ThreadMonitor.stop(monitor);
} catch (InterruptedException e) {
    // timed amount was reached
}

Example code is from Apache's ThreadMonitor Javadoc. I only use this when the 3rd party API does not provide some timeout mechanism, of course.

However I was forced to tweak this a bit some weeks ago, because this solution does not work well with (3rd party) code that is using Exception masking.

In particular we run into problems with c3p0 which masks all Exceptions (and in particular InterruptedExceptions). Our solution was to tweak the implementation to also check the exception's cause chain for InterruptedExceptions.

Altri suggerimenti

Try this solution:

private static String pdfToText(String fileName) {
    PDDocument document = null;
    try {
        document = PDDocument.load(fileName);
        PDFTextStripper stripper = new PDFTextStripper();
        return stripper.getText(document);
    } catch (IOException e) {
        System.err.println("Unable to open PDF Parser. " + e.getMessage());
        return null;
    } finally {
        if (document != null) {
            try {
                document.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }
}

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow