What's going wrong: Tar files are stored interleaved. They come in the order header, data, header, data, header, data, etc. When you enumerated the files with getmembers()
, you've already read through the entire file to get the headers. Then when you asked the tarfile object to read the data, it tried to seek backward from the last header to the first data. But you can't seek backward in a network stream without closing and reopening the urllib request.
How to work around it: You'll need to download the file, save a temporary copy to disk or to a StringIO, enumerate the files in this temporary copy, and then extract the files you want.
#!/usr/bin/env python3
from io import BytesIO
import urllib.request
import tarfile
tarfile_url = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/b0/ac/Breast_Cancer_Res_2001_Nov_9_3(1)_61-65.tar.gz"
ftpstream = urllib.request.urlopen(tarfile_url)
# BytesIO creates an in-memory temporary file.
# See the Python manual: http://docs.python.org/3/library/io.html
tmpfile = BytesIO()
while True:
# Download a piece of the file from the connection
s = ftpstream.read(16384)
# Once the entire file has been downloaded, tarfile returns b''
# (the empty bytes) which is a falsey value
if not s:
break
# Otherwise, write the piece of the file to the temporary file.
tmpfile.write(s)
ftpstream.close()
# Now that the FTP stream has been downloaded to the temporary file,
# we can ditch the FTP stream and have the tarfile module work with
# the temporary file. Begin by seeking back to the beginning of the
# temporary file.
tmpfile.seek(0)
# Now tell the tarfile module that you're using a file object
# that supports seeking backward.
# r|gz forbids seeking backward; r:gz allows seeking backward
tfile = tarfile.open(fileobj=tmpfile, mode="r:gz")
# You want to limit it to the .nxml files
tfile_members2 = [filename
for filename in tfile.getnames()
if filename.endswith('.nxml')]
tfile_extract1 = tfile.extractfile(tfile_members2[0])
tfile_extract1_text = tfile_extract1.read()
# And when you're done extracting members:
tfile.close()
tmpfile.close()