bash shell script not working as intended using cmp with output redirection

Question 1

&> is bash syntax, you'll need to change the shebang line (first line) to #!/bin/bash (or the appropriate path to bash.

Or if you're really using the Bourne Shell (/bin/sh), then you have to use old-style redirection, i.e.

cmp ... >/dev/null 2>&1

Also, I think the &> was only introduced in bash 4, so if you're using bash, 3.X you'll still need the old-style redirections.

IHTH

Question 2

Few comments:

First, the:

for f1 in `find ./ -name "*.txt"`
do
    if test -f $f1
    then

is the same as (find only plain files with the txt extension)

for f1 in `find ./ -type f -name "*.txt"`

Better syntax (bash only) is

for f1 in $(find ./ -type f -name "*.txt")

and finally the whole is wrong, because if the filename contains a space, the f1 variable will not get the full path name. So instead the for do:

find ./ -type f -name "*.txt" -print | while read -r f1

and as @Sir Athos pointed out, the filename can contain \n so the best is to use

find . -type f -name "*.txt" -print0 | while IFS= read -r -d '' f1

Second:

Use "$f1" instead of $f1 - again, because the $f1 can contain space.

Third:

doing N*N comparisons is not very effective. You should make a checksum (md5 or better sha256) for every txt file. When the checksum is identical - the files are dups.

If you don't trust checksums, simply compare only files what has identical checksums. Files with different checksum are SURE not duplicates. ;)

Making checksums are slow to, so you should 1st compare ony files with the same size. Different sized files are not duplicates...

You can skip empty txt files - they are duplicates all :).

so the final command can be:

find -not -empty -type f -name \*.txt -printf "%s\n" | sort -rn | uniq -d |\
xargs -I% -n1 find -type f -name \*.txt -size %c -print0 | xargs -0 md5sum |\
sort | uniq -w32 --all-repeated=separate

commented:

#find all non-empty file with the txt extension and print their size (in bytes)
find . -not -empty -type f -name \*.txt -printf "%s\n" |\

#sort the sizes numerically, and keep only duplicated sizes
sort -rn | uniq -d |\

#for each sizes (what are duplicated) find all files with the given size and print their name (path)
xargs -I% -n1 find . -type f -name \*.txt -size %c -print0 |\

#make an md5 checksum for them
xargs -0 md5sum |\

#sort the checksums and keep duplicated files separated with an empty line
sort | uniq -w32 --all-repeated=separate

The output now, you can simply edit the output file and decide what want remove and what file want keep.

Question 3

Credit to @kobame for this answer: this is really a comment but for the formatting.

You don't need to call find twice, print out the size and the filename in the find command

find . -not -empty -type f -name \*.txt -printf "%8s %p\n" |
# find the files that have duplicate sizes
sort -n | uniq -Dw 8 | 
# strip off the size and get the md5 sum
cut -c 10- | xargs md5sum

An example

$ cat a.txt
this is file a
$ cat b.txt
this is file b
$ cat c.txt
different contents 
$ cp a.txt d.txt
$ cp b.txt e.txt
$ find . -not -empty -type f -name \*.txt -printf "%8s %p\n" |
sort -n | uniq -Dw 8 | cut -c 10- | xargs md5sum

76fd4c1589ef708d9203f3cf09cfd032  ./a.txt
e2d75fd6a1080efb6230d0608b1f9014  ./b.txt
76fd4c1589ef708d9203f3cf09cfd032  ./d.txt
e2d75fd6a1080efb6230d0608b1f9014  ./e.txt

To keep one and delete the rest, I would pipe the output into:

...  | awk '++seen[$1] > 1 {print $2}' | xargs echo rm

rm ./d.txt ./e.txt

Remove the echo if your testing is satisfactory.

Like many complex pipelines, filenames containing newlines will break it.

Question 4

All nice answers, so only one short suggestion: you can install and use the

fdupes -r .

from the man:

Searches the given path for duplicate files. Such files are found by comparing file sizes and MD5 signatures, followed by a byte-by-byte comparison.

Added by @Francesco

fdupes -rf . | xargs rm -f

for remove dupes. (the -f in fdupes omit the 1st occurence the file, so list only dupes)