Question

i have a directory with matching duplicate files. However these duplicate files have a randomly generated digits number behind the .txt which allows them to co-exist in the same directory.

aaaabbbbcccc.txt.12345678

aaaabbbbcccc.txt.34567890

qqqqwwwwrrrr.txt.98765432,

qqqqwwwwrrrr.txt.54321987

At the end of the day all i need is one of the two files(with the same name) and the information within them. I am capable of retrieving the data within the file. I have thousands of files to remove in this directory.

A co-worker has suggested this:

prev_base = None
for rs_file in sorted(os.listdir('.') ):
    base_rs_file = rs_file[:-7]
    if base_rs_file == prev_base:
      os.unlink( rs_file)
    else:
      prev_base = base_rs_file

i am not sure I am fully understanding how this snippet of code actually works. I understand what is happening up until the 'if' statement. Any help would be great.

Thanks, Shane

Was it helpful?

Solution

See comments inline

# initialize the variable to None, it could have been "" also..
prev_base = None

#Iterates through the directory containing the files sorted. 
#It assumes that you are   running the script in same directory as files.

for rs_file in sorted(os.listdir('.') ):

# Extract the identical part from each basename.
# Example: for qqqqwwwwrrrr.txt.54321987, base_rs_file=qqqqwwwwrrrr.txt

There is a problem here, it should be -9 not -7

    base_rs_file = rs_file[:-9]

# Compare to previous basename. This is why prev_base had to be initialized. Otherwise #   test would return an error

    if base_rs_file == prev_base:

# The basename has already been found, the file is removed. 
# The method unlink() removes (deletes) the file path

      os.unlink( rs_file)
else:

# if the basename has not been found previously, value is assigned to prev_base

      prev_base = base_rs_file
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top