Mysterious bytes in GZIP (bytestream) compression Python3

https://stackoverflow.com/questions/20958298

25-09-2022
|

Question

I have a file that contains lots of contents (symbol heavy <>?!""''=:;), I wish to compress in parts, I read the file, convert to a bytestream and then compress. I'd expect the compression to come out vaguely like: \x1f\x8b\x08\x00\x00\x92\x04 and so on.

However, it comes out more like: \x1f\x8b\x08\x00\x00\xa60v?\x04{?X\x0eDa and so on. Surely I should be getting hex values within the range 00 to ff?

Main snippet of Python3 code:

with open('somefile', 'r') as f:
  for lines in f.readlines():
    messages = (str(lines)).encode('ascii') #Or 'UTF-8' both produce funny results
    compMessages = gzip.compress(messages) #Default level of 6 is fine here
    return compMessages

The only interesting/relevant information I can find is that len(str(lines)) is a different value to len(lines.encode('ascii'))

Ideas please?

Solution

There's nothing "mysterious" about your output. You're just not reading it correctly. This:

\x1f\x8b\x08\x00\x00\xa60v?\x04{?X\x0eDa

is the same as

\x1f\x8b\x08\x00\x00\xa6\x30\x76\x3f\x04\x7b\x3f\x58\x0e\x44\x61

It's just that the ASCII-printable characters (the ones with hex values between 0x20 and 0x7E), such as 0, v, ?, {, D, and a, are shown as their ascii values, rather than as \x escape codes.

To verify this, observe the following:

>>> [ord(i) for i in '\x1f\x8b\x08\x00\x00\xa60v?\x04{?X\x0eDa']
[31, 139, 8, 0, 0, 166, 48, 118, 63, 4, 123, 63, 88, 14, 68, 97]

All of the values are between 0 and 255.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow