How to test a directory of files for gzip and uncompress gzipped files in Python using zcat?

https://stackoverflow.com/questions/15340982

23-03-2022
|

Question

I'm in my 2nd week of Python and I'm stuck on a directory of zipped/unzipped logfiles, which I need to parse and process.

Currently I'm doing this:

import os
import sys
import operator
import zipfile
import zlib
import gzip
import subprocess

if sys.version.startswith("3."):
    import io
    io_method = io.BytesIO
else:
    import cStringIO
    io_method = cStringIO.StringIO

for f in glob.glob('logs/*'):
    file = open(f,'rb')        
    new_file_name = f + "_unzipped"
    last_pos = file.tell()

    # test for gzip
    if (file.read(2) == b'\x1f\x8b'):
        file.seek(last_pos)

    #unzip to new file
    out = open( new_file_name, "wb" )
    process = subprocess.Popen(["zcat", f], stdout = subprocess.PIPE, stderr=subprocess.STDOUT)

    while True:
      if process.poll() != None:
        break;

    output = io_method(process.communicate()[0])
    exitCode = process.returncode


    if (exitCode == 0):
      print "done"
      out.write( output )
      out.close()
    else:
      raise ProcessException(command, exitCode, output)

which I've "stitched" together using these SO answers (here) and blogposts (here)

However, it does not seem to work, because my test file is 2.5GB and the script has been chewing on it for 10+mins plus I'm not really sure if what I'm doing is correct anyway.

Question:
If I don't want to use GZIP module and need to de-compress chunk-by-chunk (actual files are >10GB), how do I uncompress and save to file using zcat and subprocess in Python?

Thanks!

Solution

This should read the first line of every file in the logs subdirectory, unzipping as required:

#!/usr/bin/env python

import glob
import gzip
import subprocess

for f in glob.glob('logs/*'):
  if f.endswith('.gz'):
    # Open a compressed file. Here is the easy way:
    #   file = gzip.open(f, 'rb')
    # Or, here is the hard way:
    proc = subprocess.Popen(['zcat', f], stdout=subprocess.PIPE)
    file = proc.stdout
  else:
    # Otherwise, it must be a regular file
    file = open(f, 'rb')

  # Process file, for example:
  print f, file.readline()

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow