Sed command help to summarize similar log messages
-
26-06-2021 - |
Question
I'm trying to craft a log file summarisation tool for an application that creates a lot of duplicate entries with only a different suffix to indicate point of execution.
Here's a genericized version:
A text_file (infile_grocery.txt
) with these contents.
milk skim fruit apple banana
milk skim fruit orange
milk skim fruit mango
milk skim fruit pomegranate
milk 2 percent fruit cherry tomato
milk 2 percent fruit peach
milk whole fruit pineapple
milk skim fruit strawberry raspberry
milk skim fruit strawberry rhubarb
milk whole fruit pineapple
What I'm hoping to get is:
milk skim fruit apple banana, orange, mango, pomegranate
milk 2 percent fruit cherry tomato, peach
milk whole fruit pineapple
milk skim fruit strawberry raspberry, strawberry rhubarb
milk whole fruit pineapple
The command line I've currently cooked up is:
sed -rn "{H;x;s|^(.+) fruit ([^\n]+)\n(.*)\1 fruit (.+)$|\1 fruit \2, \4|;x}; ${x;s/^\n//;p}" infile_grocery.txt
But the results I'm getting are:
milk skim fruit apple banana, mango, strawberry raspberry
milk skim fruit strawberry rhubarb
milk whole fruit pineapple
I'm discarding input somehow. Any gurus with a better idea how to structure this?
Solution
This might work for you (GNU sed):
sed ':a;$!N;s/^\(\(.*fruit\).*\)\n\2\(.*\)/\1,\3/;ta;P;D' file
Explanation:
:a
is a place holder for a loop$!N
append a newline followed by the next line except on the last line.s/^\(\(.*fruit\).*\)\n\2\(.*\)/\1,\3/
collect everything upto the newline into back reference 1 (aka\1
). Within this collect everything from the beginning of the line upto and including the wordfruit
into back reference 2 (aka\2
). Collect everything following the matching\2
into back reference 3 (aka\3
). Replace this regexp with back reference 1, followed by a comma, a space and then back reference 3.ta
if the substitution was true loop to place holder:a
P
if the substitution was false print upto and including the first newline in the pattern space.D
if the substitution was false delete upto and including the first newline in the pattern space.
OTHER TIPS
This is a awk
solution.
awk -F fruit '
$1==x{
printf ",%s", $2
next
}
{
x=$1
printf "\n%s", $0
}
END {
print ""
}' input.txt
Output
milk skim fruit apple banana, orange, mango, pomegranate
milk 2 percent fruit cherry tomato, peach
milk whole fruit pineapple
milk skim fruit strawberry raspberry, strawberry rhubarb
milk whole fruit pineapple
opref=""
nline=""
while read line; do
pref=`echo $line | sed 's/\(.*fruit\).*/\1/'`
item=`echo $line | sed 's/.*fruit\s\(.*\)/\1/'`
if [ "$opref" == "$pref" ]; then
nline="$nline, $item"
else
[ "$nline" != "" ] && echo $nline
nline=$line
fi
opref=$pref
done < input_file
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow