How do I read selected files from a remote Zip archive over HTTP using Python?
Question
I need to read selected files, matching on the file name, from a remote zip archive using Python. I don't want to save the full zip to a temporary file (it's not that large, so I can handle everything in memory).
I've already written the code and it works, and I'm answering this myself so I can search for it later. But since evidence suggests that I'm one of the dumber participants on Stackoverflow, I'm sure there's room for improvement.
Solution
Here's how I did it (grabbing all files ending in ".ranks"):
import urllib2, cStringIO, zipfile
try:
remotezip = urllib2.urlopen(url)
zipinmemory = cStringIO.StringIO(remotezip.read())
zip = zipfile.ZipFile(zipinmemory)
for fn in zip.namelist():
if fn.endswith(".ranks"):
ranks_data = zip.read(fn)
for line in ranks_data.split("\n"):
# do something with each line
except urllib2.HTTPError:
# handle exception
OTHER TIPS
Thanks Marcel for your question and answer (I had the same problem in a different context and encountered the same difficulty with file-like objects not really being file-like)! Just as an update: For Python 3.0, your code needs to be modified slightly:
import urllib.request, io, zipfile
try:
remotezip = urllib.request.urlopen(url)
zipinmemory = io.BytesIO(remotezip.read())
zip = zipfile.ZipFile(zipinmemory)
for fn in zip.namelist():
if fn.endswith(".ranks"):
ranks_data = zip.read(fn)
for line in ranks_data.split("\n"):
# do something with each line
except urllib.request.HTTPError:
# handle exception
This will do the job without downloading the entire zip file!
Bear in mind that merely decompressing a ZIP file may result in a security vulnerability.