You're almost there: use the matching operator ~
:
awk -v RS='\0\0' -v pattern="dir1/index.htm" '$0~pattern' duplicated.log
Question
I am improving a script listing duplicated files that I have written last year (see the second script if you follow the link).
The record separator of the duplicated.log
output is the zero byte instead of the carriage return \n
. Example:
$> tr '\0' '\n' < duplicated.log
12 dir1/index.htm
12 dir2/index.htm
12 dir3/index.htm
12 dir4/index.htm
12 dir5/index.htm
32 dir6/video.m4v
32 dir7/video.m4v
(in this example, the five files dir1/index.htm
, ... and dir5/index.htm
have same md5sum
and their size is 12 bytes. The other two files dir6/video.m4v
and dir7/video.m4v
have same md5sum
and their content size (du
) is 32 bytes.)
As each line is ended by a zero byte (\0
) instead of carriage return symbol (\n
), blank lines are represented as two successive zero bytes (\0\0
).
I use zero byte as line separator because, path-file-name may contain carriage return symbol.
But, doing that I am faced to this issue:
How to 'grep' all duplicates of a specified file from duplicated.log
?
(e.g. How to retrieve duplicates of dir1/index.htm
?)
I need:
$> ./youranswer.sh "dir1/index.htm" < duplicated.log | tr '\0' '\n'
12 dir1/index.htm
12 dir2/index.htm
12 dir3/index.htm
12 dir4/index.htm
12 dir5/index.htm
$> ./youranswer.sh "dir4/index.htm" < duplicated.log | tr '\0' '\n'
12 dir1/index.htm
12 dir2/index.htm
12 dir3/index.htm
12 dir4/index.htm
12 dir5/index.htm
$> ./youranswer.sh "dir7/video.m4v" < duplicated.log | tr '\0' '\n'
32 dir6/video.m4v
32 dir7/video.m4v
I was thinking about some thing like:
awk 'BEGIN { RS="\0\0" } #input record separator is double zero byte
/filepath/ { print $0 }' duplicated.log
...but filepath
may contain slash symbols /
and many other symbols (quotes, carriage return...).
I may have to use perl
to deal with this situation...
I am open to any suggestions, questions, other ideas...
Solution
You're almost there: use the matching operator ~
:
awk -v RS='\0\0' -v pattern="dir1/index.htm" '$0~pattern' duplicated.log
OTHER TIPS
I have just realized that I could use the md5sum
instead of the pathname because in my new version of the script I am keeping the md5sum
information.
This is the new format I am currently using:
$> tr '\0' '\n' < duplicated.log
12 89e8a208e5f06c65e6448ddeb40ad879 dir1/index.htm
12 89e8a208e5f06c65e6448ddeb40ad879 dir2/index.htm
12 89e8a208e5f06c65e6448ddeb40ad879 dir3/index.htm
12 89e8a208e5f06c65e6448ddeb40ad879 dir4/index.htm
12 89e8a208e5f06c65e6448ddeb40ad879 dir5/index.htm
32 fc191f86efabfca83a94d33aad2f87b4 dir6/video.m4v
32 fc191f86efabfca83a94d33aad2f87b4 dir7/video.m4v
gawk
and nawk
give wanted result:
$> awk 'BEGIN { RS="\0\0" }
/89e8a208e5f06c65e6448ddeb40ad879/ { print $0 }' duplicated.log |
tr '\0' '\n'
12 89e8a208e5f06c65e6448ddeb40ad879 dir1/index.htm
12 89e8a208e5f06c65e6448ddeb40ad879 dir2/index.htm
12 89e8a208e5f06c65e6448ddeb40ad879 dir3/index.htm
12 89e8a208e5f06c65e6448ddeb40ad879 dir4/index.htm
12 89e8a208e5f06c65e6448ddeb40ad879 dir5/index.htm
But I am still open about your answers :-)
(this current answer is just a workaround)
For curious, below the new (horrible) script under construction...
#!/bin/bash
fifo=$(mktemp -u)
fif2=$(mktemp -u)
dups=$(mktemp -u)
dirs=$(mktemp -u)
menu=$(mktemp -u)
numb=$(mktemp -u)
list=$(mktemp -u)
mkfifo $fifo $fif2
# run processing in background
find . -type f -printf '%11s %P\0' | #print size and filename
tee $fifo | #write in fifo for dialog progressbox
grep -vzZ '^ 0 ' | #ignore empty files
LC_ALL=C sort -z | #sort by size
uniq -Dzw11 | #keep files having same size
while IFS= read -r -d '' line
do #for each file compute md5sum
echo -en "${line:0:11}" "\t" $(md5sum "${line:12}") "\0"
#file size + md5sim + file name + null terminated instead of '\n'
done | #keep the duplicates (same md5sum)
tee $fif2 |
uniq -zs12 -w46 --all-repeated=separate |
tee $dups |
#xargs -d '\n' du -sb 2<&- | #retrieve size of each file
gawk '
function tgmkb(size) {
if(size<1024) return int(size) ; size/=1024;
if(size<1024) return int(size) "K"; size/=1024;
if(size<1024) return int(size) "M"; size/=1024;
if(size<1024) return int(size) "G"; size/=1024;
return int(size) "T"; }
function dirname (path)
{ if(sub(/\/[^\/]*$/, "", path)) return path; else return "."; }
BEGIN { RS=ORS="\0" }
!/^$/ { sz=substr($0,0,11); name=substr($0,48); dir=dirname(name); sizes[dir]+=sz; files[dir]++ }
END { for(dir in sizes) print tgmkb(sizes[dir]) "\t(" files[dir] "\tfiles)\t" dir }' |
LC_ALL=C sort -zrshk1 > $dirs &
pid=$!
tr '\0' '\n' <$fifo |
dialog --title "Collecting files having same size..." --no-shadow --no-lines --progressbox $(tput lines) $(tput cols)
tr '\0' '\n' <$fif2 |
dialog --title "Computing MD5 sum" --no-shadow --no-lines --progressbox $(tput lines) $(tput cols)
wait $pid
DUPLICATES=$( grep -zac -v '^$' $dups) #total number of files concerned
UNIQUES=$( grep -zac '^$' $dups) #number of files, if all redundant are removed
DIRECTORIES=$(grep -zac . $dirs) #number of directories concerned
lins=$(tput lines)
cols=$(tput cols)
cat > $menu <<EOF
--no-shadow
--no-lines
--hline "After selection of the directory, you will choose the redundant files you want to remove"
--menu "There are $DUPLICATES duplicated files within $DIRECTORIES directories.\nThese duplicated files represent $UNIQUES unique files.\nChoose directory to proceed redundant file removal:"
$lins
$cols
$DIRECTORIES
EOF
tr '\n"' "_'" < $dirs |
gawk 'BEGIN { RS="\0" } { print FNR " \"" $0 "\" " }' >> $menu
dialog --file $menu 2> $numb
[[ $? -eq 1 ]] && exit
set -x
dir=$( grep -zam"$(< $numb)" . $dirs | tac -s'\0' | grep -zam1 . | cut -f4- )
md5=$( grep -zam"$(< $numb)" . $dirs | tac -s'\0' | grep -zam1 . | cut -f2 )
grep -zao "$dir/[^/]*$" "$dups" |
while IFS= read -r -d '' line
do
file="${line:47}"
awk 'BEGIN { RS="\0\0" } '"/$md5/"' { print $0 }' >> $list
done
echo -e "
fifo $fifo \t dups $dups \t menu $menu
fif2 $fif2 \t dirs $dirs \t numb $numb \t list $list"
#rm -f $fifo $fif2 $dups $dirs $menu $numb