Question

I have the following code for compressing and decompressing string.

public static byte[] compress(String str)
{
    try
    {
        ByteArrayOutputStream obj = new ByteArrayOutputStream();
        GZIPOutputStream gzip = new GZIPOutputStream(obj);
        gzip.write(str.getBytes("UTF-8"));
        gzip.close();
        return obj.toByteArray();
    }
    catch (IOException e)
    {
        e.printStackTrace();
    }
    return null;
}

public static String decompress(byte[] bytes)
{
    try
    {
        GZIPInputStream gis = new GZIPInputStream(new ByteArrayInputStream(bytes));
        BufferedReader bf = new BufferedReader(new InputStreamReader(gis, "UTF-8"));
        StringBuilder outStr = new StringBuilder();
        String line;
        while ((line = bf.readLine()) != null)
        {
            outStr.append(line);
        }
        return outStr.toString();
    }
    catch (IOException e)
    {
        return e.getMessage();
    }
}

I compress into byte array on windows, and then send the byte array through socket to the linux and uncompress it there. However upon uncompression it seem that all my newline characters are gone.
So I thought that the problem was linux to windows relationship. However I have tried writing a simple program on windows that uses it, and found that the newlines are still gone.
Can anyone shed any light as to what causes it? I can't figure out any explanation.

Was it helpful?

Solution

I think the problem is here:

while ((line = bf.readLine()) != null)
    {
        outStr.append(line);
    }

The readLine see's the newline char but doesn't include it in the returned value for line

The problem is worse than you think, perhaps.

readLine() gets all the characters up to, but not including, a newline (or some variety of returns and linefeed characters) OR the end of file. So you don't know if the last line you get had a newline on the end or not.

This might not matter, and if so, you can just add this following the other append:

outStr.append('\n');

Some files might end up with an extra line ending at the end of file.

If it does matter, you will need to use read() and then output all the characters you receive. In that case, you might end up with the infamous "What's at the end of the line?" problem you allude to between Windows, Linux and the MacOS and the way they use different combinations of return and new-line characters to end lines.

OTHER TIPS

It is not GZIP that is "eating" newlines.

It is this code:

    while ((line = bf.readLine()) != null)
    {
        outStr.append(line);
    }

The readLine() method reads a line (up to and including a line termination sequence) and then returns it without a newline. You then append it to outStr ... without replacing the line termination that was stripped.

But even if you replaced the line termination, you can't guarantee to preserve the actual line termination sequence that was used ... if you do it that way.

I recommend that you replace the readLine() calls with read() calls; i.e. read and then buffer the data one character at a time. It solves two problems at once. It may even be faster, because you are avoiding the unnecessary overhead of assembling line Strings.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top