Java compress byte array and base 64 encode to base 64 decode and decompress byte array error: different sized input/output arrays

https://stackoverflow.com/questions/20622837

02-09-2022
|

Pergunta

My application requires a list of doubles encoded as a byte array with little endian encoding that has been zlib compressed and then encoded as base 64. I wrote up a harness to test my encoding, which wasn't working. I was able to make progress.

However, I noticed that when I attempt to decompress to a fixed size buffer, I am able to come up with input such that the size of the decompressed byte array is smaller than the original byte array, which obviously isn't right. Coincident with this, the last double in the list disappears. On most inputs, the fixed buffer size reproduces the input. Does anyone know why that would be? I am guessing the error is in the way I am encoding the data, but I can't figure out what is going wrong.

When I try using a ByteArrayOutputStream to handle variable-length output of arbitrary size (which will be important for the real version of the code, as I can't guarantee max size limits), the inflate method of Inflater continuously returns 0. I looked up the documentation and it said this means it needs more data. Since there is no more data, I again suspect my encoding, and guess that it is the same issue causing the previously explained behavior.

In my code I've included an example of data that works fine with the fixed buffer size, as well as data that doesn't work for fixed buffer. Both data sets will cause the variable buffer size error I explained.

Any clues as to what I am doing wrong? Many thanks.

import java.io.ByteArrayOutputStream;
import java.io.UnsupportedEncodingException;
import java.nio.ByteBuffer;
import java.nio.ByteOrder;
import java.util.ArrayList;
import java.util.zip.DataFormatException;
import java.util.zip.Deflater;
import java.util.zip.Inflater;
import org.apache.commons.codec.binary.Base64;

public class BinaryReaderWriter {
    public static void main(String [ ] args) throws UnsupportedEncodingException, DataFormatException
{
    // this input will break the fixed buffer method
    //double[] centroids = {123.1212234143345453223123123, 28464632322456781.23, 3123121.0};

    // this input will break the fixed buffer method
    double[] centroids = {123.1212234143345453223123123, 28464632322456781.23, 31.0};
    BinaryReaderWriter brw = new BinaryReaderWriter();
    String output = brw.compressCentroids(centroids);
    brw.decompressCentroids(output);
}
void decompressCentroids(String encoded) throws DataFormatException{
    byte[] binArray = Base64.decodeBase64(encoded);


    // This block of code is the fixed buffer version
    //
System.out.println("binArray length " + binArray.length);
    Inflater deCompressor = new Inflater();
    deCompressor.setInput(binArray, 0, binArray.length);
    byte[] decompressed = new byte[1024];
    int decompressedLength = deCompressor.inflate(decompressed);
    deCompressor.end();
System.out.println("decompressedLength = " + decompressedLength);
    byte[] decompressedData = new byte[decompressedLength];
    for(int i=0;i<decompressedLength;i++){
        decompressedData[i] = decompressed[i];
    }


    /*
    // This block of code is the variable buffer version
    //
    ByteArrayOutputStream bos = new ByteArrayOutputStream(binArray.length);
    Inflater deCompressor = new Inflater();
    deCompressor.setInput(binArray, 0, binArray.length);
    byte[] decompressed = new byte[1024];
    while (!deCompressor.finished()) {
        int decompressedLength = deCompressor.inflate(decompressed);
        bos.write(decompressed, 0, decompressedLength);
    }
    deCompressor.end();
    byte[] decompressedData = bos.toByteArray();
    */

    ByteBuffer bb = ByteBuffer.wrap(decompressedData);
    bb.order(ByteOrder.LITTLE_ENDIAN);
System.out.println("decompressedData length = " + decompressedData.length);
    double[] doubleValues = new double[decompressedData.length / 8];
    for (int i = 0; i< doubleValues.length; i++){
        doubleValues[i] = bb.getDouble(i * 8);
    }

    for(double dbl : doubleValues){
        System.out.println(dbl);
    }   
}

String compressCentroids(double[] centroids){
    byte[] cinput = new byte[centroids.length * 8];
    ByteBuffer buf = ByteBuffer.wrap(cinput);
    buf.order(ByteOrder.LITTLE_ENDIAN);
    for (double cent : centroids){
        buf.putDouble(cent);
    }

    byte[] input = buf.array();
System.out.println("raw length = " + input.length);
    byte[] output = new byte[input.length];
    Deflater compresser = new Deflater();
    compresser.setInput(input);
    compresser.finish();
    int compressedLength = compresser.deflate(output);
    compresser.end();
System.out.println("Compressed length = " + compressedLength);
    byte[] compressed = new byte[compressedLength];
    for(int i = 0; i < compressedLength; i++){
        compressed[i] = output[i];
    }

    String decrypted = Base64.encodeBase64String(compressed);
    return decrypted;
}

}

Solução

When compressing data what we are really doing is re-encoding to increase entropy in the data. During the reecoding precess we have to add meta data to tell us how we have encoded the data so it can be converted back to what it was previously.

Compression will only be successful if the meta data size is less than the space we save by reencoding the data.

Consider Huffman encoding:

Huffman is a simple encoding scheme where we replace the fixed width character set with a variable width character set plus a charset length table. The length table size will be greater than 0 for obvious reasons. If all characters appear with a near equal distribution we will not be able to save any space. So our compressed data ends up being larger than our uncompressed data.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow