Sed command help to summarize similar log messages

https://stackoverflow.com/questions/11934271

26-06-2021
|

Question

I'm trying to craft a log file summarisation tool for an application that creates a lot of duplicate entries with only a different suffix to indicate point of execution.

Here's a genericized version: A text_file (infile_grocery.txt) with these contents.

milk skim fruit apple banana
milk skim fruit orange
milk skim fruit mango
milk skim fruit pomegranate
milk 2 percent fruit cherry tomato
milk 2 percent fruit peach
milk whole fruit pineapple
milk skim fruit strawberry raspberry
milk skim fruit strawberry rhubarb
milk whole fruit pineapple

What I'm hoping to get is:

milk skim fruit apple banana, orange, mango, pomegranate
milk 2 percent fruit cherry tomato, peach
milk whole fruit pineapple
milk skim fruit strawberry raspberry, strawberry rhubarb
milk whole fruit pineapple

The command line I've currently cooked up is:

sed -rn "{H;x;s|^(.+) fruit ([^\n]+)\n(.*)\1 fruit (.+)$|\1 fruit \2, \4|;x}; ${x;s/^\n//;p}" infile_grocery.txt

But the results I'm getting are:

milk skim fruit apple banana, mango, strawberry raspberry
milk skim fruit strawberry rhubarb
milk whole fruit pineapple

I'm discarding input somehow. Any gurus with a better idea how to structure this?

Solution

This might work for you (GNU sed):

sed ':a;$!N;s/^\(\(.*fruit\).*\)\n\2\(.*\)/\1,\3/;ta;P;D' file

Explanation:

:a is a place holder for a loop
$!N append a newline followed by the next line except on the last line.
s/^$\(.*fruit$.*\)\n\2$.*$/\1,\3/ collect everything upto the newline into back reference 1 (aka \1). Within this collect everything from the beginning of the line upto and including the word fruit into back reference 2 (aka \2). Collect everything following the matching \2 into back reference 3 (aka \3). Replace this regexp with back reference 1, followed by a comma, a space and then back reference 3.
ta if the substitution was true loop to place holder :a
P if the substitution was false print upto and including the first newline in the pattern space.
D if the substitution was false delete upto and including the first newline in the pattern space.

OTHER TIPS

This is a awk solution.

awk -F fruit '
$1==x{
    printf ",%s", $2
    next
}
{
    x=$1
    printf "\n%s", $0
}
END {
    print ""
}' input.txt

Output

milk skim fruit apple banana, orange, mango, pomegranate
milk 2 percent fruit cherry tomato, peach
milk whole fruit pineapple
milk skim fruit strawberry raspberry, strawberry rhubarb
milk whole fruit pineapple

opref=""
nline=""
while read line; do
  pref=`echo $line | sed 's/\(.*fruit\).*/\1/'`
  item=`echo $line | sed 's/.*fruit\s\(.*\)/\1/'`
  if [ "$opref" == "$pref" ]; then
    nline="$nline, $item"
  else
    [ "$nline" != "" ] && echo $nline
    nline=$line
  fi  
  opref=$pref
done < input_file

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow