Question

I have over a million text files compressed into 40 zip files. I also have a list of about 500 model names of phones. I want to find out the number of times a particular model was mentioned in the text files.

Is there any python module which can do a regex match on the files without unzipping it. Is there a simple way to solve this problem without unzipping?

Was it helpful?

Solution

There's nothing that will automatically do what you want.

However, there is a python zipfile module that will make this easy to do. Here's how to iterate over the lines in the file.

#!/usr/bin/python

import zipfile
f = zipfile.ZipFile('myfile.zip')

for subfile in f.namelist():
    print subfile
    data = f.read(subfile)
    for line in data.split('\n'):
        print line

OTHER TIPS

You could loop through the zip files, reading individual files using the zipfile module and running your regex on those, eliminating to unzip all the files at once.

I'm fairly certain that you can't run a regex over the zipped data, at least not meaningfully.

To access the contents of a zip file you have to unzip it, although the zipfile package makes this fairly easy, as you can unzip each file within an archive individually.

Python zipfile module

Isn't it (at least theoretically) possible, to read in the ZIP's Huffman coding and then translate the regexp into the Huffman code? Might this be more efficient than first de-compressing the data, then running the regexp?

(Note: I know it wouldn't be quite that simple: you'd also have to deal with other aspects of the ZIP coding—file layout, block structures, back-references—but one imagines this could be fairly lightweight.)

EDIT: Also note that it's probably much more sensible to just use the zipfile solution.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top