Read Contents Tarfile into Python - "seeking backwards is not allowed"

Question 1

What's going wrong: Tar files are stored interleaved. They come in the order header, data, header, data, header, data, etc. When you enumerated the files with getmembers(), you've already read through the entire file to get the headers. Then when you asked the tarfile object to read the data, it tried to seek backward from the last header to the first data. But you can't seek backward in a network stream without closing and reopening the urllib request.

How to work around it: You'll need to download the file, save a temporary copy to disk or to a StringIO, enumerate the files in this temporary copy, and then extract the files you want.

#!/usr/bin/env python3
from io import BytesIO
import urllib.request
import tarfile

tarfile_url = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/b0/ac/Breast_Cancer_Res_2001_Nov_9_3(1)_61-65.tar.gz"
ftpstream = urllib.request.urlopen(tarfile_url)

# BytesIO creates an in-memory temporary file.
# See the Python manual: http://docs.python.org/3/library/io.html
tmpfile = BytesIO()
while True:
    # Download a piece of the file from the connection
    s = ftpstream.read(16384)

    # Once the entire file has been downloaded, tarfile returns b''
    # (the empty bytes) which is a falsey value
    if not s:  
        break

    # Otherwise, write the piece of the file to the temporary file.
    tmpfile.write(s)
ftpstream.close()

# Now that the FTP stream has been downloaded to the temporary file,
# we can ditch the FTP stream and have the tarfile module work with
# the temporary file.  Begin by seeking back to the beginning of the
# temporary file.
tmpfile.seek(0)

# Now tell the tarfile module that you're using a file object
# that supports seeking backward.
# r|gz forbids seeking backward; r:gz allows seeking backward
tfile = tarfile.open(fileobj=tmpfile, mode="r:gz")

# You want to limit it to the .nxml files
tfile_members2 = [filename
                  for filename in tfile.getnames()
                  if filename.endswith('.nxml')]

tfile_extract1 = tfile.extractfile(tfile_members2[0])
tfile_extract1_text = tfile_extract1.read()

# And when you're done extracting members:
tfile.close()
tmpfile.close()

Question 2

I had the same error when trying to requests.get the file, so I extracted all to a tmp directory instead of using BytesIO, or extractfile(member):

# stream == requests.get
inputs = [tarfile.open(fileobj=LZMAFile(stream), mode='r|')]
t = "/tmp"
for tarfileobj in inputs:        
    tarfileobj.extractall(path=t, members=None)
for fn in os.listdir(t):
    with open(os.path.join(t, fn)) as payload:
        print(payload.read())

Question 3

A very easy solution to this is to change how tarfile reads the file instead of:

tfile = tarfile.open(tarfile_name)

change to:

with tarfile.open(fileobj=f, mode='r:*') as tar:

and the important part is to put ':' in the mode.

you can check this answer as well to read more about it