How to perform calculation over a log file
Question
I have a that looks like this:
I, [2009-03-04T15:03:25.502546 #17925] INFO -- : [8541, 931, 0, 0]
I, [2009-03-04T15:03:26.094855 #17925] INFO -- : [8545, 6678, 0, 0]
I, [2009-03-04T15:03:26.353079 #17925] INFO -- : [5448, 1598, 185, 0]
I, [2009-03-04T15:03:26.360148 #17925] INFO -- : [8555, 1747, 0, 0]
I, [2009-03-04T15:03:26.367523 #17925] INFO -- : [7630, 278, 0, 0]
I, [2009-03-04T15:03:26.375845 #17925] INFO -- : [7640, 286, 0, 0]
I, [2009-03-04T15:03:26.562425 #17925] INFO -- : [5721, 896, 0, 0]
I, [2009-03-04T15:03:30.951336 #17925] INFO -- : [8551, 4752, 1587, 1]
I, [2009-03-04T15:03:30.960007 #17925] INFO -- : [5709, 5295, 0, 0]
I, [2009-03-04T15:03:30.966612 #17925] INFO -- : [7252, 4928, 0, 0]
I, [2009-03-04T15:03:30.974251 #17925] INFO -- : [8561, 4883, 1, 0]
I, [2009-03-04T15:03:31.230426 #17925] INFO -- : [8563, 3866, 250, 0]
I, [2009-03-04T15:03:31.236830 #17925] INFO -- : [8567, 4122, 0, 0]
I, [2009-03-04T15:03:32.056901 #17925] INFO -- : [5696, 5902, 526, 1]
I, [2009-03-04T15:03:32.086004 #17925] INFO -- : [5805, 793, 0, 0]
I, [2009-03-04T15:03:32.110039 #17925] INFO -- : [5786, 818, 0, 0]
I, [2009-03-04T15:03:32.131433 #17925] INFO -- : [5777, 840, 0, 0]
I'd like to create a shell script that calculates the average of the 2nd and 3rd fields in brackets (840
and 0
in the last example). An even tougher question: is it possible to get the average of the 3rd field only when the last one is not 0
?
I know I could use Ruby
or another language to create a script, but I'd like to do it in Bash
. Any good suggestions on resources or hints in how to create such a script would help.
Solution
Posting the reply I pasted to you over IM here too, just because it makes me try StackOverflow out :)
# replace $2 with the column you want to avg;
awk '{ print $2 }' | perl -ne 'END{ printf "%.2f\n", $total/$n }; chomp; $total+= $_; $n++' < log
OTHER TIPS
Use bash
and awk
:
cat file | sed -ne 's:^.*INFO.*\[\([0-9, ]*\)\][ \r]*$:\1:p' | awk -F ' *, *' '{ sum2 += $2 ; sum3 += $3 } END { if (NR>0) printf "avg2=%.2f, avg3=%.2f\n", sum2/NR, sum3/NR }'
Sample output (for your original data):
avg2=2859.59, avg3=149.94
Of course, you do not need to use cat
, it is included there for legibility and to illustrate the fact that input data can come from any pipe; if you have to operate on an existing file, run sed -ne '...' file | ...
directly.
EDIT
If you have access to gawk
(GNU awk), you can eliminate the need for sed
as follows:
cat file | gawk '{ if(match($0, /.*INFO.*\[([0-9, ]*)\][ \r]*$/, a)) { cnt++; split(a[1], b, / *, */); sum2+=b[2]; sum3+=b[3] } } END { if (cnt>0) printf "avg2=%.2f, avg3=%.2f\n", sum2/cnt, sum3/cnt }'
Same remarks re. cat
apply.
A bit of explanation:
sed
only prints out lines (-n ... :p
combination) that match the regular expression (lines containing INFO followed by any combination of digits, spaces and commas between square brackets at the end of the line, allowing for trailing spaces and CR); if any such line matches, only keep what's between the square brackets (\1
, corresponding to what's between\(...\)
in the regular expression) before printing (:p
)- sed will output lines that look like:
8541, 931, 0, 0
- sed will output lines that look like:
awk
uses a comma surrounded by 0 or more spaces (-F ' *, *'
) as field delimiters;$1
corresponds to the first column (e.g. 8541),$2
to the second etc. Missing columns count as value0
- at the end,
awk
divides the accumulatorssum2
etc by the number of records processed,NR
- at the end,
gawk
does everything in one shot; it will first test whether each line matches the same regular expression passed in the previous example tosed
(except that unlikesed
,awk
does not require a\
in fron the round brackets delimiting areas or interest). If the line matches, what's between the round brackets ends up in a[1], which we then split using the same separator (a comma surrounded by any number of spaces) and use that to accumulate. I introducedcnt
instead of continuing to useNR
because the number of records processedNR
may be larger than the actual number of relevant records (cnt
) if not all lines are of the formINFO ... [...comma-separated-numbers...]
, which was not the case withsed|awk
sincesed
guaranteed that all lines passed on toawk
were relevant.
Use nawk or /usr/xpg4/bin/awk on Solaris.
awk -F'[],]' 'END {
print s/NR, t/ct
}
{
s += $(NF-3)
if ($(NF-1)) {
t += $(NF-2)
ct++
}
}' infile
Use Python
logfile= open( "somelogfile.log", "r" )
sum2, count2= 0, 0
sum3, count3= 0, 0
for line in logfile:
# find right-most brackets
_, bracket, fieldtext = line.rpartition('[')
datatext, bracket, _ = fieldtext.partition(']')
# split fields and convert to integers
data = map( int, datatext.split(',') )
# compute sums and counts
sum2 += data[1]
count2 += 1
if data[3] != 0:
sum3 += data[2]
count3 += 1
logfile.close()
print sum2, count2, float(sum2)/count2
print sum3, count3, float(sum3)/count3