Few comments:
First, the:
for f1 in `find ./ -name "*.txt"`
do
if test -f $f1
then
is the same as (find only plain files with the txt
extension)
for f1 in `find ./ -type f -name "*.txt"`
Better syntax (bash only) is
for f1 in $(find ./ -type f -name "*.txt")
and finally the whole is wrong, because if the filename contains a space, the f1
variable will not get the full path name. So instead the for
do:
find ./ -type f -name "*.txt" -print | while read -r f1
and as @Sir Athos pointed out, the filename can contain \n
so the best is to use
find . -type f -name "*.txt" -print0 | while IFS= read -r -d '' f1
Second:
Use "$f1"
instead of $f1
- again, because the $f1
can contain space.
Third:
doing N*N comparisons is not very effective. You should make a checksum (md5 or better sha256) for every txt
file. When the checksum is identical - the files are dups.
If you don't trust checksums, simply compare only files what has identical checksums. Files with different checksum are SURE not duplicates. ;)
Making checksums are slow to, so you should 1st compare ony files with the same size
. Different sized files are not duplicates...
You can skip empty txt files
- they are duplicates all :).
so the final command can be:
find -not -empty -type f -name \*.txt -printf "%s\n" | sort -rn | uniq -d |\
xargs -I% -n1 find -type f -name \*.txt -size %c -print0 | xargs -0 md5sum |\
sort | uniq -w32 --all-repeated=separate
commented:
#find all non-empty file with the txt extension and print their size (in bytes)
find . -not -empty -type f -name \*.txt -printf "%s\n" |\
#sort the sizes numerically, and keep only duplicated sizes
sort -rn | uniq -d |\
#for each sizes (what are duplicated) find all files with the given size and print their name (path)
xargs -I% -n1 find . -type f -name \*.txt -size %c -print0 |\
#make an md5 checksum for them
xargs -0 md5sum |\
#sort the checksums and keep duplicated files separated with an empty line
sort | uniq -w32 --all-repeated=separate
The output now, you can simply edit the output file and decide what want remove and what file want keep.