How to protect myself from a gzip or bzip2 bomb?

Question 1

I guess the answer is: There is no easy, readymade solution. Here is what I use now:

class SafeUncompressor(object):
    """Small proxy class that enables external file object
    support for uncompressed, bzip2 and gzip files. Works transparently, and
    supports a maximum size to avoid zipbombs.
    """
    blocksize = 16 * 1024

    class FileTooLarge(Exception):
        pass

    def __init__(self, fileobj, maxsize=10*1024*1024):
        self.fileobj = fileobj
        self.name = getattr(self.fileobj, "name", None)
        self.maxsize = maxsize
        self.init()

    def init(self):
        import bz2
        import gzip
        self.pos = 0
        self.fileobj.seek(0)
        self.buf = ""
        self.format = "plain"

        magic = self.fileobj.read(2)
        if magic == '\037\213':
            self.format = "gzip"
            self.gzipobj = gzip.GzipFile(fileobj = self.fileobj, mode = 'r')
        elif magic == 'BZ':
            raise IOError, "bzip2 support in SafeUncompressor disabled, as self.bz2obj.decompress is not safe"
            self.format = "bz2"
            self.bz2obj = bz2.BZ2Decompressor()
        self.fileobj.seek(0)


    def read(self, size):
        b = [self.buf]
        x = len(self.buf)
        while x < size:
            if self.format == 'gzip':
                data = self.gzipobj.read(self.blocksize)
                if not data:
                    break
            elif self.format == 'bz2':
                raw = self.fileobj.read(self.blocksize)
                if not raw:
                    break
                # this can already bomb here, to some extend.
                # so disable bzip support until resolved.
                # Also monitor http://stackoverflow.com/questions/13622706/how-to-protect-myself-from-a-gzip-or-bzip2-bomb for ideas
                data = self.bz2obj.decompress(raw)
            else:
                data = self.fileobj.read(self.blocksize)
                if not data:
                    break
            b.append(data)
            x += len(data)

            if self.pos + x > self.maxsize:
                self.buf = ""
                self.pos = 0
                raise SafeUncompressor.FileTooLarge, "Compressed file too large"
        self.buf = "".join(b)

        buf = self.buf[:size]
        self.buf = self.buf[size:]
        self.pos += len(buf)
        return buf

    def seek(self, pos, whence=0):
        if whence != 0:
            raise IOError, "SafeUncompressor only supports whence=0"
        if pos < self.pos:
            self.init()
        self.read(pos - self.pos)

    def tell(self):
        return self.pos

It does not work well for bzip2, so that part of the code is disabled. The reason is that bz2.BZ2Decompressor.decompress can already produce an unwanted large chunk of data.

Question 2

You could use resource module to limit resources available to your process and its children.

If you need to decompress in memory then you could set resource.RLIMIT_AS (or RLIMIT_DATA, RLIMIT_STACK) e.g., using a context manager to automatically restore it to a previous value:

import contextlib
import resource

@contextlib.contextmanager
def limit(limit, type=resource.RLIMIT_AS):
    soft_limit, hard_limit = resource.getrlimit(type)
    resource.setrlimit(type, (limit, hard_limit)) # set soft limit
    try:
        yield
    finally:
        resource.setrlimit(type, (soft_limit, hard_limit)) # restore

with limit(1 << 30): # 1GB 
    # do the thing that might try to consume all memory

If the limit is reached; MemoryError is raised.

Question 3

This will determine the uncompressed size of the gzip stream, while using limited memory:

#!/usr/bin/python
import sys
import zlib
f = open(sys.argv[1], "rb")
z = zlib.decompressobj(15+16)
total = 0
while True:
    buf = z.unconsumed_tail
    if buf == "":
        buf = f.read(1024)
        if buf == "":
            break
    got = z.decompress(buf, 4096)
    if got == "":
        break
    total += len(got)
print total
if z.unused_data != "" or f.read(1024) != "":
    print "warning: more input after end of gzip stream"

It will return a slight overestimate of the space required for all of the files in the tar file in when extracted. The length includes those files, as well as the tar directory information.

The gzip.py code does not control the amount of data decompressed, except by virtue of the size of the input data. In gzip.py, it reads 1024 compressed bytes at a time. So you can use gzip.py if you're ok with up to about 1056768 bytes of memory usage for the uncompressed data (1032 * 1024, where 1032:1 is the maximum compression ratio of deflate). The solution here uses zlib.decompress with the second argument, which limits the amount of uncompressed data. gzip.py does not.

This will accurately determine the total size of the extracted tar entries by decoding the tar format:

#!/usr/bin/python

import sys
import zlib

def decompn(f, z, n):
    """Return n uncompressed bytes, or fewer if at the end of the compressed
       stream.  This only decompresses as much as necessary, in order to
       avoid excessive memory usage for highly compressed input.
    """
    blk = ""
    while len(blk) < n:
        buf = z.unconsumed_tail
        if buf == "":
            buf = f.read(1024)
        got = z.decompress(buf, n - len(blk))
        blk += got
        if got == "":
            break
    return blk

f = open(sys.argv[1], "rb")
z = zlib.decompressobj(15+16)
total = 0
left = 0
while True:
    blk = decompn(f, z, 512)
    if len(blk) < 512:
        break
    if left == 0:
        if blk == "\0"*512:
            continue
        if blk[156] in ["1", "2", "3", "4", "5", "6"]:
            continue
        if blk[124] == 0x80:
            size = 0
            for i in range(125, 136):
                size <<= 8
                size += blk[i]
        else:
            size = int(blk[124:136].split()[0].split("\0")[0], 8)
        if blk[156] not in ["x", "g", "X", "L", "K"]:
                total += size
        left = (size + 511) // 512
    else:
        left -= 1
print total
if blk != "":
    print "warning: partial final block"
if left != 0:
    print "warning: tar file ended in the middle of an entry"
if z.unused_data != "" or f.read(1024) != "":
    print "warning: more input after end of gzip stream"

You could use a variant of this to scan the tar file for bombs. This has the advantage of finding a large size in the header information before you even have to decompress that data.

As for .tar.bz2 archives, the Python bz2 library (at least as of 3.3) is unavoidably unsafe for bz2 bombs consuming too much memory. The bz2.decompress function does not offer a second argument like zlib.decompress does. This is made even worse by the fact that the bz2 format has a much, much higher maximum compression ratio than zlib due to run-length coding. bzip2 compresses 1 GB of zeros to 722 bytes. So you cannot meter the output of bz2.decompress by metering the input as can be done with zlib.decompress even without the second argument. The lack of a limit on the decompressed output size is a fundamental flaw in the Python interface.

I looked in the _bz2module.c in 3.3 to see if there is an undocumented way to use it to avoid this problem. There is no way around it. The decompress function in there just keeps growing the result buffer until it can decompress all of the provided input. _bz2module.c needs to be fixed.

Question 4

If you develop for linux, you can run decompression in separate process and use ulimit to limit the memory usage.

import subprocess
subprocess.Popen("ulimit -v %d; ./decompression_script.py %s" % (LIMIT, FILE))

Keep in mind that decompression_script.py should decompress the whole file in memory, before writing to disk.

Question 5

I also need to handle zip bombs in uploaded zipfiles.

I do this by creating a fixed size tmpfs, and unzipping to that. If the extracted data is too large then the tmpfs will run out of space and give an error.

Here is the linux commands to create a 200M tmpfs to unzip to.

sudo mkdir -p /mnt/ziptmpfs
echo 'tmpfs   /mnt/ziptmpfs         tmpfs   rw,nodev,nosuid,size=200M          0  0' | sudo tee -a /etc/fstab