Question

I have the following structure for log files:

/var/log/2013-12-24.tar.gz

This file has :

2013-12-24/{00,01,02...23}.log.gz 

files

I need to parse these files, using Python,

without extracting any files on disk OR memory

Any suggestions ?

Was it helpful?

Solution

This is NOT possible with .tar.gz, and this is why.

In order to create .tar.gz file, you have to tar it first to create single .tar file, and then gzip the result.

.tar file CAN be scanned through without reading the whole thing, but it only works well if members of tar archive are relatively large. This is because tar does NOT have any header which lists all members in one compact place. Each archive member has 512 byte descriptor than shows where next one is located. With big archive members you can relatively quickly find arbitrary member contents on disk, but you will have to lseek() many times.

But, once you apply gzip on top of .tar, it basically makes it impossible to quickly get list of all members of the archive, much less to actually get their content uncompressed. You are forced to decompress whole archive even to simply get list of all its members.

Note that exact same problem exists for popular .tar.bz2 or .tar.xz formats.

You can fix it by using zip format. zip has big advantage over .tar.gz because it DOES have compact index that lists all archive members in one place, and it allows to quickly read that list and extract only files that you need, without necessity to extract (gunzip) entire archive that .tar.gz suffers from.

OTHER TIPS

I have found a solution. I'll just give it here for reference:

import subprocess
file_name = /var/log/2013-06-10.tar.gz
gzip_data = subprocess.Popen(["tar", "-Oxf", file_name], stdout=subprocess.PIPE)
data = subprocess.Popen(["zcat"], stdin=gzip_data.stdout, stdout=subprocess.PIPE)
for line in data.stdout:
    do_my_process_on(line)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top