Question

I am using iText v5.4.2. I am trying to parse images from a PDF file. I get NullPointerException for certain images in certain PDF files. The PDF file with one "faulty" image can be downloaded here: https://dl.dropboxusercontent.com/u/3585277/LZW_Error.pdf

Here is a simple demo:

public class LZWDecodeDemo {

    public static void main(String[] args) throws Exception {
        LZWDecodeDemo demo = new LZWDecodeDemo();
        demo.parseImages();
    }

    private void parseImages() throws Exception {
        String pathToPdf = "C:\\temp\\LZW_Error.pdf";
        PdfReader reader = new PdfReader(pathToPdf);
        PdfReaderContentParser parser = new PdfReaderContentParser(reader);
        ImageRenderListener imageRenderListener = new ImageRenderListener();
        parser.processContent(1, imageRenderListener);
    }

    private class ImageRenderListener implements RenderListener {

        public ImageRenderListener() {
            //
        }

        public void beginTextBlock() {
            // nothing
        }

        public void endTextBlock() {
            // nothing
        }

        public void renderImage(ImageRenderInfo imageRenderInfo) {
            try {
                PdfImageObject image = imageRenderInfo.getImage();
                System.out.println("Rendered image :" + image);
            } catch (IOException e) {
                e.printStackTrace();
            }
        }

        public void renderText(TextRenderInfo arg0) {
            // nothing
        }
    }
}
Was it helpful?

Solution

The issue can be observed in your sample file when the end of image data occurs exactly when the LZW output bit length is increased:

In case of the image /Im3 the last code carrying image data caused the creation of the 511th LZW table entry which implies that the following code should be encoded using 10 bits. Unfortunately following EOD (End of Data) marker is encoded using only 9 bits.

iText decoding that next code correctly (i.e. using 10 bits, the next bit in the stream being a 0 bit), therefore, sees a 514 instead of the 257 (which is the EOD marker value), and trying to use table entry number 514 the NPE occurs; after all, the 511th entry had only just been added...

Probably this happens because the encoder (knowing that it was at the end of the image data) did not create a table entry after that last code at all; thus it didn't see that the table length trigger has been reached and simply forgot to use 10 bits.

The specification is quite clear on this, cf. section 7.4.4.2 "Details of LZW Encoding" in ISO 32000-1:

Data encoded using the LZW compression method shall consist of a sequence of codes that are 9 to 12 bits long. Each code shall represent a single character of input data (0–255), a clear-table marker (256), an EOD marker (257), or a table entry representing a multiple-character sequence that has been encountered previously in the input (258 or greater).

Initially, the code length shall be 9 bits and the LZW table shall contain only entries for the 258 fixed codes. As encoding proceeds, entries shall be appended to the table, associating new codes with longer and longer sequences of input characters. The encoder and the decoder shall maintain identical copies of this table.

Whenever both the encoder and the decoder independently (but synchronously) realize that the current code length is no longer sufficient to represent the number of entries in the table, they shall increase the number of bits per code by 1. The first output code that is 10 bits long shall be the one following the creation of table entry 511, and similarly for 11 (1023) and 12 (2047) bits. Codes shall never be longer than 12 bits; therefore, entry 4095 is the last entry of the LZW table.

The encoder shall execute the following sequence of steps to generate each output code:

a) Accumulate a sequence of one or more input characters matching a sequence already present in the table. For maximum compression, the encoder looks for the longest such sequence.

b) Emit the code corresponding to that sequence.

c) Create a new table entry for the first unused code. Its value is the sequence found in step (a) followed by the next input character.

Thus, even after emitting the code for the last input characters, a table entry has to be created. And if that table entry is number 511, the first output code to follow, i.e. the EOD marker, has to be 10 bits long.

That being said, iText's LZWDecoder method decode could be hardened by a null test, at least in the else branch of if (code < tableIndex), and act more gracefully, either throwing a more descriptive exception or even silently ignore the issue if there aren't very many input bits left.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top