Pregunta

I am trying to write a bash script that remove duplicate files from a folder, keeping only one copy. The script is the following:

#!/bin/sh

for f1 in `find ./ -name "*.txt"`
do
    if test -f $f1
    then
        for f2 in `find ./ -name "*.txt"`
        do
            if [ -f $f2 ] && [ "$f1" != "$f2" ]
            then
                # if cmp $f1 $f2 &> /dev/null # DOES NOT WORK
                if cmp $f1 $f2
                then
                    rm $f2
                    echo "$f2 purged"
                fi 
            fi
        done
    fi 
done 

I want to redirect the output and stderr to /dev/null to avoid printing them to screen.. But using the commented statement this script does not work as intended and removes all files but the first..

I'll give more informations if needed.

Thanks

¿Fue útil?

Solución 2

&> is bash syntax, you'll need to change the shebang line (first line) to #!/bin/bash (or the appropriate path to bash.

Or if you're really using the Bourne Shell (/bin/sh), then you have to use old-style redirection, i.e.

cmp ... >/dev/null 2>&1

Also, I think the &> was only introduced in bash 4, so if you're using bash, 3.X you'll still need the old-style redirections.

IHTH

Otros consejos

Few comments:

First, the:

for f1 in `find ./ -name "*.txt"`
do
    if test -f $f1
    then

is the same as (find only plain files with the txt extension)

for f1 in `find ./ -type f -name "*.txt"`

Better syntax (bash only) is

for f1 in $(find ./ -type f -name "*.txt")

and finally the whole is wrong, because if the filename contains a space, the f1 variable will not get the full path name. So instead the for do:

find ./ -type f -name "*.txt" -print | while read -r f1

and as @Sir Athos pointed out, the filename can contain \n so the best is to use

find . -type f -name "*.txt" -print0 | while IFS= read -r -d '' f1

Second:

Use "$f1" instead of $f1 - again, because the $f1 can contain space.

Third:

doing N*N comparisons is not very effective. You should make a checksum (md5 or better sha256) for every txt file. When the checksum is identical - the files are dups.

If you don't trust checksums, simply compare only files what has identical checksums. Files with different checksum are SURE not duplicates. ;)

Making checksums are slow to, so you should 1st compare ony files with the same size. Different sized files are not duplicates...

You can skip empty txt files - they are duplicates all :).

so the final command can be:

find -not -empty -type f -name \*.txt -printf "%s\n" | sort -rn | uniq -d |\
xargs -I% -n1 find -type f -name \*.txt -size %c -print0 | xargs -0 md5sum |\
sort | uniq -w32 --all-repeated=separate

commented:

#find all non-empty file with the txt extension and print their size (in bytes)
find . -not -empty -type f -name \*.txt -printf "%s\n" |\

#sort the sizes numerically, and keep only duplicated sizes
sort -rn | uniq -d |\

#for each sizes (what are duplicated) find all files with the given size and print their name (path)
xargs -I% -n1 find . -type f -name \*.txt -size %c -print0 |\

#make an md5 checksum for them
xargs -0 md5sum |\

#sort the checksums and keep duplicated files separated with an empty line
sort | uniq -w32 --all-repeated=separate

The output now, you can simply edit the output file and decide what want remove and what file want keep.

Credit to @kobame for this answer: this is really a comment but for the formatting.

You don't need to call find twice, print out the size and the filename in the find command

find . -not -empty -type f -name \*.txt -printf "%8s %p\n" |
# find the files that have duplicate sizes
sort -n | uniq -Dw 8 | 
# strip off the size and get the md5 sum
cut -c 10- | xargs md5sum 

An example

$ cat a.txt
this is file a
$ cat b.txt
this is file b
$ cat c.txt
different contents 
$ cp a.txt d.txt
$ cp b.txt e.txt
$ find . -not -empty -type f -name \*.txt -printf "%8s %p\n" |
sort -n | uniq -Dw 8 | cut -c 10- | xargs md5sum 
76fd4c1589ef708d9203f3cf09cfd032  ./a.txt
e2d75fd6a1080efb6230d0608b1f9014  ./b.txt
76fd4c1589ef708d9203f3cf09cfd032  ./d.txt
e2d75fd6a1080efb6230d0608b1f9014  ./e.txt

To keep one and delete the rest, I would pipe the output into:

...  | awk '++seen[$1] > 1 {print $2}' | xargs echo rm
rm ./d.txt ./e.txt

Remove the echo if your testing is satisfactory.

Like many complex pipelines, filenames containing newlines will break it.

All nice answers, so only one short suggestion: you can install and use the

fdupes -r .

from the man:

Searches the given path for duplicate files. Such files are found by comparing file sizes and MD5 signatures, followed by a byte-by-byte comparison.

Added by @Francesco

fdupes -rf . | xargs rm -f

for remove dupes. (the -f in fdupes omit the 1st occurence the file, so list only dupes)

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top