Java: CRC error when using setDictionary for GZIPOutputStream's Deflater

https://stackoverflow.com/questions/9186347

27-04-2021
|

Question

I'm trying to take a stream of data from standard in, compress it one 128 byte block at a time, and then output it to standard out. (Example: "cat file.txt | java Dict | gzip -d | cmp file.txt", where file.txt just contains some ASCII characters.)

I also need to use a 32 byte dictionary taken from the end of each previous 128 byte block, for each subsequent block. (The first block uses its own first 32 bytes as its dictionary.) When I don't set the dictionary at all, the compression works fine. However, when I do set the dictionary, gzip gives me an error trying to decompress the data: "gzip: stdin: invalid compressed data--crc error".

I've tried adding/changing several parts of the code, but nothing has worked so far, and I haven't had any luck finding solutions with Google.

I've tried...

Adding "def.reset()" before "def.setDictionary(b)" near the bottom of the code does not work.
Only setting the dictionary for blocks after the first block does not work. (Not using a dictionary for the first block.)
Calling updateCRC with the "input" array before or after compressor.write(input, 0, bytesRead) does not work.

I'd really appreciate any suggestions - is there anything obvious I'm missing or doing wrong?

This is what I have in my Dict.java file:

import java.io.*;
import java.util.zip.GZIPOutputStream;

public class Dict {
  protected static final int BLOCK_SIZE = 128;
  protected static final int DICT_SIZE = 32;

  public static void main(String[] args) {
    InputStream stdinBytes = System.in;
    byte[] input = new byte[BLOCK_SIZE];
    byte[] dict = new byte[DICT_SIZE];
    int bytesRead = 0;

    try {
        DictGZIPOuputStream compressor = new DictGZIPOuputStream(System.out);
        bytesRead = stdinBytes.read(input, 0, BLOCK_SIZE);
        if (bytesRead >= DICT_SIZE) {
            System.arraycopy(input, 0, dict, 0, DICT_SIZE);
            compressor.setDictionary(dict);
        }

        do {
            compressor.write(input, 0, bytesRead);
            compressor.flush();

            if (bytesRead == BLOCK_SIZE) {
                System.arraycopy(input, BLOCK_SIZE-DICT_SIZE-1, dict, 0, DICT_SIZE);
                compressor.setDictionary(dict);
            }
            bytesRead = stdinBytes.read(input, 0, BLOCK_SIZE);
        } while (bytesRead > 0);

        compressor.finish();
    }
    catch (IOException e) {e.printStackTrace();}
  }

  public static class DictGZIPOuputStream extends GZIPOutputStream {
    public DictGZIPOuputStream(OutputStream out) throws IOException {
        super(out);
    }

    public void setDictionary(byte[] b) {
        def.setDictionary(b);
    }
    public void updateCRC(byte[] input) {
        crc.update(input);
    }
  }
}

Solution

I do not know exactly internally zlib algorithm work but based on my understanding on DictGZIPOutputStream, when you call write() method, after it is write, it will update its crc for that byte array. So if you call again updateCRC() in your code again, then thing become wrong as the crc is updated twice. Then when gzip -d is executed, as a result of previous two crc updates, gzip will complaint "invalid compressed data--crc error"

I also noticed that you did not close the compressor after it is used. When I executed the code pasted above, it gave error "gzip: stdin: unexpected end of file". So always make sure to flush method and close method is called in the end. With that said, I have the following,

import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.util.zip.GZIPOutputStream;


public class Dict
{
    protected static final int BLOCK_SIZE = 128;
    protected static final int DICT_DIZE = 32;

    public static void main(String[] args)
    {
        InputStream stdinBytes = System.in;
        byte[] input = new byte[BLOCK_SIZE];
        byte[] dict = new byte[DICT_DIZE];
        int bytesRead = 0;

        try
        {
            DictGZIPOutputStream compressor = new DictGZIPOutputStream(System.out);
            bytesRead = stdinBytes.read(input, 0, BLOCK_SIZE);

            if (bytesRead >= DICT_DIZE)
            {
                System.arraycopy(input, 0, dict, 0, DICT_DIZE);
            }

            do 
            {               
                compressor.write(input, 0, bytesRead);              

                if (bytesRead == BLOCK_SIZE)
                {
                    System.arraycopy(input, BLOCK_SIZE-1, dict, 0, DICT_DIZE);
                    compressor.setDictionary(dict);
                }

                bytesRead = stdinBytes.read(input, 0, BLOCK_SIZE);
            }
            while (bytesRead > 0);
            compressor.flush();         
            compressor.close();
        }
        catch (IOException e)
        {
            e.printStackTrace();
        }

    }

    public static class DictGZIPOutputStream extends GZIPOutputStream
    {

        public DictGZIPOutputStream(OutputStream out) throws IOException
        {
            super(out);
        }

        public void setDictionary(byte[] b)
        {
            def.setDictionary(b);
        }

        public void updateCRC(byte[] input)
        {
            crc.update(input);
        }                       
    }

}

The test result at the console.

$ cat file.txt 
hello world, how are you?1e3djw
hello world, how are you?1e3djw adfa asdfas

$ cat file.txt | java Dict | gzip -d | cmp file.txt ; echo $?
0

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow