Python bz2 - text vs. interactive console (data stream)

https://stackoverflow.com/questions/17006515

31-05-2022
|

Question

I was using bz2 earlier to try to decompress an input. The input that I wanted to decode was already in compressed format, so I decided to input the format into the interactive Python console:

>>> import bz2
>>> bz2.decompress(input)

This worked just fine without any errors. However, I got different results when I tried to extract the text from a html file and then decompress it:

file = open("example.html", "r")
contents = file.read()
# Insert code to pull out the text, which is of type 'str'
result = bz2.decompress(parsedString)

I've checked the string I parsed with the original one, and it looks identical. Furthermore, when I copy and paste the string I wish to decompress into my .py file (basically enclosing it with double parentheses ""), it works fine. I have also tried to open with "rb" in hopes that it'll look at the .html file as a binary, though that failed to work as well.

My questions are: what is the difference between these two strings? They are both of type 'str', so I'm assuming there is an underlying difference I am missing. Furthermore, how would I go about retrieving the bz2 content from the .html in such a way that it will not be treated as an incorrect datastream? Any help is appreciated. Thanks!

La solution

My guess is that the html file has the text representation of the data instead of the actual binary data in the file itself.

For instance take a look at the following code:

>>> t = '\x80'
>>> print t
>>> '\x80'

But say I create a text file with the contents \x80 and do:

with open('file') as f:
    t = f.read()
print t

I would get back:

'\\x80'

If this is the case, you could use eval to get the desired result:

result = bz2.decompress(eval('"'+parsedString'"'))

Just make sure that you only do this for trusted data.

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow