Decode Base64 Gzip in python

Question 1

It looks like you're decoding the entire thing, including the begin-base64 644 data.xml.gz part, so you're getting a bunch of garbage at the start:

b1 = '''begin-base64 644 data.xml.gz\nH4sIAAAAAAAAA y9a4 lx3Hn
d6fguB7JzNuGZkNigNfdrAGbMAYaXeNfbPolXplYiRSIFu'''

b2 = '''\nH4sIAAAAAAAAA y9a4 lx3Hn
d6fguB7JzNuGZkNigNfdrAGbMAYaXeNfbPolXplYiRSIFu'''

If you run your algorithm on b2, you get something starting with this:

m\xe8"\x9d\xb6\xac{\xae

(I don't know how you lost the m in copying and pasting, but either way, it's not valid.)

If you run it on b2, you get something starting with this:

\x1f\x8b\x08\x00\x00\x00

That's what you want.

Of course taking off the '\n' has the same effect, since base64 ignores whitespace. So most likely, it's being used as a delimiter. If that's actually a '\\n' (aka r'\n') rathern than a '\n', you have to remove it to get the right answer.

Also, you seem to be doing a lot of extra work for no good reason. Most likely the data is actually correctly padded, but that part may be worthwhile. But the whole translate(dict(zip(map(ord, u'-_'), u'+/'))) does the same thing as passing an altchars argument to b64decode, but less efficiently and harder to read (if it's correct). (By the way, if you were doing translate as an optimization against the cost of calling replace twice, the conversion to and from Unicode is almost certain to overwhelm the savings. Even if you had profiled and determined that it made a difference, you'd probably want to generate the translate map above—both for efficiency, so you don't do it once per string, and, more importantly, for readability.)

Putting it together:

data = '''begin-base64 644 data.xml.gz\nH4sIAAAAAAAAA y9a4 lx3Hn
d6fguB7JzNuGZkNigNfdrAGbMAYaXeNfbPolXplYiRSIFu'''
_, data = data.split('\n', 1)
padding_factor = (4 - len(data) % 4) % 4
data += "="*padding_factor
data_decoded = base64.b64decode(data, '-_')

Again, if you've got a '\\n' rather than a '\n', change the split line accordingly.

Question 2

You need to strip off the beginning of the file since it is not part of the base64 data. If you know that the \n will be part of every file you can use it as a delimiter:

index = data.find('\\n')
if index > 0:
    data = data[index+2:]