Question

I was using bz2 earlier to try to decompress an input. The input that I wanted to decode was already in compressed format, so I decided to input the format into the interactive Python console:

>>> import bz2
>>> bz2.decompress(input)

This worked just fine without any errors. However, I got different results when I tried to extract the text from a html file and then decompress it:

file = open("example.html", "r")
contents = file.read()
# Insert code to pull out the text, which is of type 'str'
result = bz2.decompress(parsedString)

I've checked the string I parsed with the original one, and it looks identical. Furthermore, when I copy and paste the string I wish to decompress into my .py file (basically enclosing it with double parentheses ""), it works fine. I have also tried to open with "rb" in hopes that it'll look at the .html file as a binary, though that failed to work as well.

My questions are: what is the difference between these two strings? They are both of type 'str', so I'm assuming there is an underlying difference I am missing. Furthermore, how would I go about retrieving the bz2 content from the .html in such a way that it will not be treated as an incorrect datastream? Any help is appreciated. Thanks!

Était-ce utile?

La solution

My guess is that the html file has the text representation of the data instead of the actual binary data in the file itself.

For instance take a look at the following code:

>>> t = '\x80'
>>> print t
>>> '\x80'

But say I create a text file with the contents \x80 and do:

with open('file') as f:
    t = f.read()
print t

I would get back:

'\\x80'

If this is the case, you could use eval to get the desired result:

result = bz2.decompress(eval('"'+parsedString'"'))

Just make sure that you only do this for trusted data.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top