Handle gzipped or bzip2ed downloads without keeping compressed data

https://stackoverflow.com/questions/15576789

29-03-2022
|

質問

I'd like to download a compressed file (either in gzip or bzip2), decompress it and analyze its contents (it's a CSV-like file with lots of data, I calculate sums, averages and such for certain columns) while the download happens (so that I can show partial results before the download ends). The file is big (4GB), decompressed stream is even bigger, so I don't want to keep the whole compressed file on disk or in memory.

I thought it will be possible to combine python's gzip or bz2 implementations with urllib2:

data_stream = csv.reader(
                  gzip.GzipFile(
                      fileobj=urllib2.urlopen('http://…/somefile.gz')),
                  delimiter='\t')

…but it seems that urlopen's file is not file-like enough for GzipFile. I get a traceback after trying to read from such a stream:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/gzip.py", line 450, in readline
    c = self.read(readsize)
  File "/usr/lib/python2.7/gzip.py", line 256, in read
    self._read(readsize)
  File "/usr/lib/python2.7/gzip.py", line 283, in _read
    pos = self.fileobj.tell()   # Save current position
AttributeError: addinfourl instance has no attribute 'tell'

BZ2 module is even worse—it doesn't allow passing a file object at all.

After looking for some answers, I found this question. The answer works by basically storing the whole compressed file in memory, which is unfeasible for me.

What can I do?

解決

Use zlib in python. zlib.decompressobj will create an object that can be fed gzip compressed data piecemeal, and spit out the available uncompressed data using the decompress method on the object. You need to set wbits to 31 to decode the gzip format. 15 will decode the zlib format.

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow