Python tarfile - check if file in tar exists outside (i.e., already been extracted)

StackOverflow https://stackoverflow.com/questions/16266651

  •  13-04-2022
  •  | 
  •  

Question

I'm new to stackoverflow. Sorry if this post is redundant but I haven't found the answer yet. Also, I'm fairly new to Python. I'd like to extract files from a tar file if they do not already exist in the root directory where the tar file exists. I've tried a number of versions. I think there is some redundancy in the code below, and it doesn't do what I need it to. It just keeps extracting and overwriting the existing file(s).

Files that need to be extracted will always end in "_B7.TIF". Code currently takes one argument - the full path of the directory that contains the tar file.

import os, shutil, sys, tarfile 
directory = sys.argv[1]

tifFiles = []
for root, dirs, files in os.walk(directory):
    for file in files:
        if file.endswith(".TIF"):
            # also tried tifFiles.append(file)
            tifFiles.append(file.name)
        elif file.endswith(".tar.gz"):
            tar = tarfile.open(root + "/" + file)
            for item in tar:
                if str(item) in tifFiles:
                    print "{0} has already been unzipped.".format(str(item))
                elif "_B7" in str(item):
                    tar.extract(item, path=root)
shutil.rmtree(root + "\gap_mask")

Here is another version that does not appear to be doing anything. I was trying to simplify...

import os, shutil, sys, tarfile
directory = sys.argv[1]

for root, dirs, files in os.walk(directory):
    if file not in tarfile.getnames() and file.endswith("_B7.TIF"):
        tar.extract(file, path=root)
    else:
        print "File: {0} has already been unzipped.".format(file)
shutil.rmtree(root + "\gap_mask")

Thank you both for your comments/suggestions. They both helped in some way. This code works for me.

import os, shutil, sys, tarfile
folder = sys.argv[1]

listFiles = os.listdir(folder)

try:
    for file in listFiles:
        if file.endswith(".tar.gz"):
            sceneTIF = file[:-7] + "_B7.TIF"
            if os.path.exists(os.path.join(folder,sceneTIF)):
                print sceneTIF, "has already been extracted."
            else:
                tar = tarfile.open(os.path.join(folder,file))
                for item in tar:
                    if "_B7" in str(item):
                        tar.extract(item, path=folder)
    shutil.rmtree(os.path.join(folder,"gap_mask")
except WindowsError:
    pass

Any thoughts on style/redundancy/ways to make it better? Thomas, your code was not working straight out of the box. I think it was the tarfile.open component. Probably needed tarfile.open(os.path.join(directory, archive)). I only thought of that after reworking the above though. Haven't tested. Thanks again.

Was it helpful?

Solution

os.walk iterates over directory trees, including sub-directories. From your description that is not what you want. Also, only files that are encountered earlier than your tarfiles will be considered for existence.

It is a lot easier to just check for the existence of files you encounter:

import sys
import os
import tarfile

directory = sys.argv[1]

def extract_nonexisting(archive):
    for name in archive.getnames():
        if os.path.exists(os.path.join(directory, name)):
            print name, "already exists"
        else:
            archive.extract(name, path=directory)

archives = [name for name in os.listdir(directory) if name.endswith("tar.gz")]
for archive_name in archives:
    with tarfile.open(archive_name) as archive:
        extract_nonexisting(archive)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top