Remove Duplicate Files with Bash Script on Network Share

https://stackoverflow.com/questions/21441221

04-10-2022
|

سؤال

To prevent this from being closed, I have narrowed my question to just the bash script.

EDITED QUESTION

I run a small network and made a mistake in a backup routine. I have rsync running daily, and how it is set up is that if a folder is renamed on the source, then potential duplication on the backup device can occur.

rsync -varz --no-perms --exclude-from=/path/to/exclude_file --log-file=/path/to/rsync_logs

Recently a user made quite a few changes, and it resulted in a lot of duplication.

What kind of bash script strategies can I use to attack this? I've tried listing recursively and outputting to files and using diff to compare these. This has lead me to see the impact of the duplication problem. If I could use some kind of automated process to remove the duplicates, that would save me loads of time.

I started by trying something like this:

find /mnt/data/ -maxdepth 2 -mindepth 1 -type d -printf '%f\n' > data.txt

and comparing to:

find /mnt/backup/ -maxdepth 2 -mindepth 1 -type d -printf '%f\n' > backup.txt

An example of my problem is this:

drwxr-xr-x 0 bob staff  0 Jun 25  2009 7-1-08
drwxr-xr-x 0 bob staff  0 Jun 25  2009 2008-07-01

This is an example from the backup drive, and the two directories are identical in their contents. The backup contains both and the source has only this one:

drwxr-xr-x 0 bob staff  0 Jun 25  2009 2008-07-01

This kind of issue is all throughout the backup drives.

EDIT

I created two lists and diffed them and then went through manually and reconciled the changes. It wasn't as bad as I originally thought, once I got into it. I gave +1s to both answers here (@Mark Pettit and @ebarrere), because I did end up using pieces from each answer. I ran several find commands in the course of this experiment and I ended up altering my rsync script as well, to be more specific. Thank you guys.

المحلول 2

Although I agree with @Mark's suggestion to fix the rsync script, you could use a find with exec to find duplicate files. Something like this:

cd /mnt/data
find . -type f -exec bash -c "ls /mnt/backup/'{}' &> /dev/null && echo /mnt/backup/'{}'" \;

would echo any files that exist in the same path under both directories. The path printed will be to the file in the backup directory. You could change the echo to an rm -f to remove the files, but be careful with that.

نصائح أخرى

You should fix this by fixing your rsync script, not by writing a new bash script.

If your source is clean, and it's only your backup destination that's messed up, you can easily clean up the destination by adding "--delete" to your list of arguments to "rsync". That flag tells rsync to delete any directories on the destination that do not exist on the source.

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow