How to perform calculation over a log file

https://stackoverflow.com/questions/612906

03-07-2019
|

Question

I have a that looks like this:

I, [2009-03-04T15:03:25.502546 #17925]  INFO -- : [8541, 931, 0, 0]
I, [2009-03-04T15:03:26.094855 #17925]  INFO -- : [8545, 6678, 0, 0]
I, [2009-03-04T15:03:26.353079 #17925]  INFO -- : [5448, 1598, 185, 0]
I, [2009-03-04T15:03:26.360148 #17925]  INFO -- : [8555, 1747, 0, 0]
I, [2009-03-04T15:03:26.367523 #17925]  INFO -- : [7630, 278, 0, 0]
I, [2009-03-04T15:03:26.375845 #17925]  INFO -- : [7640, 286, 0, 0]
I, [2009-03-04T15:03:26.562425 #17925]  INFO -- : [5721, 896, 0, 0]
I, [2009-03-04T15:03:30.951336 #17925]  INFO -- : [8551, 4752, 1587, 1]
I, [2009-03-04T15:03:30.960007 #17925]  INFO -- : [5709, 5295, 0, 0]
I, [2009-03-04T15:03:30.966612 #17925]  INFO -- : [7252, 4928, 0, 0]
I, [2009-03-04T15:03:30.974251 #17925]  INFO -- : [8561, 4883, 1, 0]
I, [2009-03-04T15:03:31.230426 #17925]  INFO -- : [8563, 3866, 250, 0]
I, [2009-03-04T15:03:31.236830 #17925]  INFO -- : [8567, 4122, 0, 0]
I, [2009-03-04T15:03:32.056901 #17925]  INFO -- : [5696, 5902, 526, 1]
I, [2009-03-04T15:03:32.086004 #17925]  INFO -- : [5805, 793, 0, 0]
I, [2009-03-04T15:03:32.110039 #17925]  INFO -- : [5786, 818, 0, 0]
I, [2009-03-04T15:03:32.131433 #17925]  INFO -- : [5777, 840, 0, 0]

I'd like to create a shell script that calculates the average of the 2nd and 3rd fields in brackets (840 and 0 in the last example). An even tougher question: is it possible to get the average of the 3rd field only when the last one is not 0?

I know I could use Ruby or another language to create a script, but I'd like to do it in Bash. Any good suggestions on resources or hints in how to create such a script would help.

Solution

Posting the reply I pasted to you over IM here too, just because it makes me try StackOverflow out :)

# replace $2 with the column you want to avg; 
awk '{ print $2 }' | perl -ne 'END{ printf "%.2f\n", $total/$n }; chomp; $total+= $_; $n++' < log

OTHER TIPS

Use bash and awk:

cat file | sed -ne 's:^.*INFO.*\[$[0-9, ]*$\][ \r]*$:\1:p' | awk -F ' *, *' '{ sum2 += $2 ; sum3 += $3 } END { if (NR>0) printf "avg2=%.2f, avg3=%.2f\n", sum2/NR, sum3/NR }'

Sample output (for your original data):

avg2=2859.59, avg3=149.94

Of course, you do not need to use cat, it is included there for legibility and to illustrate the fact that input data can come from any pipe; if you have to operate on an existing file, run sed -ne '...' file | ... directly.

EDIT

If you have access to gawk (GNU awk), you can eliminate the need for sed as follows:

cat file | gawk '{ if(match($0, /.*INFO.*\[([0-9, ]*)\][ \r]*$/, a)) { cnt++; split(a[1], b, / *, */); sum2+=b[2]; sum3+=b[3] } } END { if (cnt>0) printf "avg2=%.2f, avg3=%.2f\n", sum2/cnt, sum3/cnt }'

Same remarks re. cat apply.

A bit of explanation:

sed only prints out lines (-n ... :p combination) that match the regular expression (lines containing INFO followed by any combination of digits, spaces and commas between square brackets at the end of the line, allowing for trailing spaces and CR); if any such line matches, only keep what's between the square brackets (\1, corresponding to what's between $...$ in the regular expression) before printing (:p)
- sed will output lines that look like: 8541, 931, 0, 0
awk uses a comma surrounded by 0 or more spaces (-F ' *, *') as field delimiters; $1 corresponds to the first column (e.g. 8541), $2 to the second etc. Missing columns count as value 0
- at the end, awk divides the accumulators sum2 etc by the number of records processed, NR
gawk does everything in one shot; it will first test whether each line matches the same regular expression passed in the previous example to sed (except that unlike sed, awk does not require a \ in fron the round brackets delimiting areas or interest). If the line matches, what's between the round brackets ends up in a[1], which we then split using the same separator (a comma surrounded by any number of spaces) and use that to accumulate. I introduced cnt instead of continuing to use NR because the number of records processed NR may be larger than the actual number of relevant records (cnt) if not all lines are of the form INFO ... [...comma-separated-numbers...], which was not the case with sed|awk since sed guaranteed that all lines passed on to awk were relevant.

Use nawk or /usr/xpg4/bin/awk on Solaris.

awk -F'[],]' 'END { 
  print s/NR, t/ct 
  }  
{ 
  s += $(NF-3) 
  if ($(NF-1)) {
    t += $(NF-2)
    ct++
    }
  }' infile

Use Python

logfile= open( "somelogfile.log", "r" )
sum2, count2= 0, 0
sum3, count3= 0, 0
for line in logfile:
    # find right-most brackets
    _, bracket, fieldtext = line.rpartition('[')
    datatext, bracket, _ = fieldtext.partition(']')
    # split fields and convert to integers
    data = map( int, datatext.split(',') )
    # compute sums and counts
    sum2 += data[1]
    count2 += 1
    if data[3] != 0:
        sum3 += data[2]
        count3 += 1
logfile.close()

print sum2, count2, float(sum2)/count2
print sum3, count3, float(sum3)/count3

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow