Cleaner way to read/gunzip a huge file in python

Question 1

I'm 99% sure that your problem is not in the gzip.open(), but in the readlines().

f.readlines() returns a list containing all the lines of data in the file.

Obviously, that requires reading reading and decompressing the entire file, and building up an absolutely gigantic list.

Most likely, it's actually the malloc calls to allocate all that memory that are taking forever. And then, at the end of this scope (assuming you're using CPython), it has to GC that whole gigantic list, which will also take forever.

You almost never want to use readlines. Unless you're using a very old Python, just do this:

for line in f:

A file is an iterable full of lines, just like the list returned by readlines—except that it's not actually a list, it generates more lines on the fly by reading out of a buffer. So, at any given time, you'll only have one line and a couple of buffers on the order of 10MB each, instead of a 25GB list. And the reading and decompressing will be spread out over the lifetime of the loop, instead of done all at once.

From a quick test, with a 3.5GB gzip file, gzip.open() is effectively instant, for line in f: pass takes a few seconds, gzip.close() is effectively instant. But if I do for line in f.readlines(): pass, it takes… well, I'm not sure how long, because after about a minute my system went into swap thrashing hell and I had to force-kill the interpreter to get it to respond to anything…

Since this has come up a dozen more times since this answer, I wrote this blog post which explains a bit more.

Question 2

Have a look at pandas, in particular IO tools. They support gzip compression when reading files and you can read files in chunks. Besides, pandas is very fast and memory efficient.

As I never tried, I don't know how well the compression and reading in chunks live together, but it might be worth giving a try