Pergunta

I have a Python script where I need to extract the contents of a ZIP file. However, the zip file is over 6GB in size.

There is a lot of information about zlib and zipfile modules, however, I can't find a single approach that works in my case. I have the code:

with zipfile.ZipFile(fname, "r") as z:
        try:
            log.info("Extracting %s " %fname)
            head, tail = os.path.split(fname)
            z.extractall(folder + "/" + tail)
        except zipfile.BadZipfile:
            log.error("Bad Zip file")
        except zipfile.LargeZipFile:
            log.error("Zip file requires ZIP64 functionality but that has not been enabled (i.e., too large)")
        except zipfile.error:
            log.error("Error decompressing ZIP file")

I know that I need to set the allowZip64 to true but I'm unsure of how to do this. Yet, even as is, the LargeZipFile exception is not thrown, but instead the BadZipFile exception is. I have no idea why.

Also, is this the best approach to handle extracting a 6GB zip archive???

Update: Modifying the BadZipfile exception to this:

except zipfile.BadZipfile as inst:
        log.error("Bad Zip file")
        print type(inst)     # the exception instance
        print inst.args      # arguments stored in .args
        print inst

shows:

<class 'zipfile.BadZipfile'>
('Bad magic number for file header',)

Update #2:

The full traceback shows

BadZipfile                                Traceback (most recent call last)
<ipython-input-1-8d34a9f58f6a> in <module>()
      6     for member in z.infolist():
      7         print member.filename[-70:],
----> 8         f = z.open(member, 'r')
      9         size = 0
     10         while True:

/Users/brspurri/anaconda/python.app/Contents/lib/python2.7/zipfile.pyc in open(self, name, mode, pwd)
    965             fheader = struct.unpack(structFileHeader, fheader)
    966             if fheader[_FH_SIGNATURE] != stringFileHeader:
--> 967                 raise BadZipfile("Bad magic number for file header")
    968 
    969             fname = zef_file.read(fheader[_FH_FILENAME_LENGTH])

BadZipfile: Bad magic number for file header

Running the code:

import sys
import zipfile

with open(zip_filename, 'rb') as zf:
    z = zipfile.ZipFile(zf, allowZip64=True)
    z.testzip()

doesn't output anything.
Foi útil?

Solução

The problem is that you have a corrupted zip file. I can add more details about the corruption below, but first the practical stuff:

You can use this code snippet to tell you which member within the archive is corrupted. However, print z.testzip() would already tell you the same thing. And zip -T or unzip on the command line should also give you that info with the appropriate verbosity.


So, what do you do about it?

Well, obviously, if you can get an uncorrupted copy of the file, do that.

If not, if you want to just skip over the bad file and extract everything else, that's pretty easy—mostly the same code as the snippet linked above:

with open(sys.argv[1], 'rb') as zf:
    z = zipfile.ZipFile(zf, allowZip64=True)
    for member in z.infolist():
        try:
            z.extract(member)
        except zipfile.error as e:
            # log the error, the member.filename, whatever

The Bad magic number for file header exception message means that zipfile was able to successfully open the zipfile, parse its directory, find the information for a member, seek to that member within the archive, and read the header of that member—all of which means you probably have no zip64-related problems here. However, when it read that header, it did not have the expected "magic" signature of PK\003\004. That means the archive is corrupted.

The fact that the corruption happens to be at exactly 4294967296 implies very strongly that you had a 64-bit problem somewhere along the chain, because that's exactly 2**32.


The command-line zip/unzip tool has some workarounds to deal with common causes of corruption that lead to problems like this. it looks like those workarounds may be working for this archive, given that you get a warning, but all of the files are apparently recovered. Python's zipfile library does not have those workarounds, and I doubt you want to write your own zip-handling code yourself…

However, that does open the door for two more possibilities:

First, zip might be able to repair the zipfile for you, using the -F of -FF flag. (Read the manpage, or zip -h, or ask at a site like SuperUser if you need help with that.)

And if all else fails, you can run the unzip tool from Python, instead of using zipfile, like this:

subprocess.check_output(['unzip', fname])

That gives you a lot less flexibility and power than the zipfile module, of course—but you're not using any of that flexibility anyway; you're just calling extractall.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top