awk search and calculate standard deviation different results

https://stackoverflow.com/questions/12310075

30-06-2021
|

Domanda

I am working to take the output of sar and calculate the standard deviation of a column. I can perform this successfully with a single column in a file. However when I calculate this same column in a file where I am stripping out the 'bad' lines like the title lines and avg lines, it is giving me a different value.

Here are the files I am performing this on:

/tmp/saru.tmp

# cat /tmp/saru.tmp
Linux 2.6.32-279.el6.x86_64 (progserver)        09/06/2012      _x86_64_        (4 CPU)

11:09:01 PM     CPU     %user     %nice   %system   %iowait    %steal     %idle
11:10:01 PM     all      0.01      0.00      0.05      0.01      0.00     99.93
11:11:01 PM     all      0.01      0.00      0.06      0.00      0.00     99.92
11:12:01 PM     all      0.01      0.00      0.05      0.01      0.00     99.93
11:13:01 PM     all      0.01      0.00      0.05      0.00      0.00     99.93
11:14:01 PM     all      0.01      0.00      0.04      0.00      0.00     99.95
11:15:01 PM     all      0.01      0.00      0.06      0.00      0.00     99.92
11:16:01 PM     all      0.01      0.00      2.64      0.01      0.01     97.33
11:17:01 PM     all      0.02      0.00     21.96      0.00      0.08     77.94
11:18:01 PM     all      0.02      0.00     21.99      0.00      0.08     77.91
11:19:01 PM     all      0.02      0.00     22.10      0.00      0.09     77.78
11:20:01 PM     all      0.02      0.00     22.06      0.00      0.09     77.83
11:21:01 PM     all      0.02      0.00     22.10      0.03      0.11     77.75
11:22:01 PM     all      0.01      0.00     21.94      0.00      0.09     77.95
11:23:01 PM     all      0.02      0.00     22.15      0.00      0.10     77.73
11:24:01 PM     all      0.02      0.00     22.02      0.00      0.09     77.87
11:25:01 PM     all      0.02      0.00     22.03      0.00      0.13     77.82
11:26:01 PM     all      0.02      0.00     21.96      0.01      0.14     77.86
11:27:01 PM     all      0.02      0.00     22.00      0.00      0.09     77.89
11:28:01 PM     all      0.02      0.00     21.91      0.00      0.09     77.98
11:29:01 PM     all      0.03      0.00     22.02      0.02      0.08     77.85
11:30:01 PM     all      0.14      0.00     22.23      0.01      0.13     77.48
11:31:01 PM     all      0.02      0.00     22.26      0.00      0.16     77.56
11:32:01 PM     all      0.03      0.00     22.04      0.01      0.10     77.83
Average:        all      0.02      0.00     15.29      0.01      0.07     84.61

/tmp/sarustriped.tmp

# cat /tmp/sarustriped.tmp                              
0.05
0.06
0.05
0.05
0.04
0.06
2.64
21.96
21.99
22.10
22.06
22.10
21.94
22.15
22.02
22.03
21.96
22.00
21.91
22.02
22.23
22.26
22.04

The Calculation based on /tmp/saru.tmp:

# awk  '$1~/^[01]/ && $6~/^[0-9]/ {sum+=$6; array[NR]=$6} END {for(x=1;x<=NR;x++){sumsq+=((array[x]-(sum/NR))**2);}print sqrt(sumsq/NR)}' /tmp/saru.tmp
10.7126

The Calculation based on /tmp/sarustriped.tmp ( the correct one )

# awk '{sum+=$1; array[NR]=$1} END {for(x=1;x<=NR;x++){sumsq+=((array[x]-(sum/NR))**2);}print sqrt(sumsq/NR)}' /tmp/sarustriped.tmp
9.96397

Could someone assist and tell me why these results are different and is there a way to get the corrected results with a single awk command. I am trying to do this for performance so not using a separate command like grep or another awk command is preferable.

Thanks!

UPDATE

so I tried this ...

awk  '
  $1~/^[01]/ && $6~/^[0-9]/ {
    numrec += 1
    sum    += $6
    array[numrec] = $6
  } 
  END {
    for(x=1; x<=numrec; x++)
      sumsq += ((array[x]-(sum/numrec))^2)
    print sqrt(sumsq/numrec)
  }
' saru.tmp

and it works correctly for the sar -u output I was working with. I do not see why it would not work with other 'lists'. To make it short, trying to work with sar -r column 5. it is giving a wrong answer again... Output is giving 1.68891 but actual deviation is .107374... this is the same command that worked with sar -u..... if you need files I can provide. Just not sure how to make a new 'full' comment... so i just edited the old one...thanks!

Soluzione

I think the bug is that your first awk line (the one that operates on saru.tmp) does not ignore the invalid lines, so when you do math using NR your result depends on the number of skipped lines. When you remove all of the invalid/skipped lines the result is the same from both programs. So in the first command, you should use the number of valid lines rather than NR in your math.

How about this?

awk '
  $1 ~ /^[01]/ && $6~/^[0-9]/ {
    numrec       += 1
    sum          += $6
    array[numrec] = $6
  } 
  END {
    for(x=1; x<=numrec; x++)
      sumsq += (array[x]-(sum/numrec))^2
    print sqrt(sumsq/numrec)
  }
' saru.tmp

Altri suggerimenti

For debugging problems like this, the simplest technique is to print out some basic data. You might print the number of items, and the sum of the values, and the sum of the squares of the values (or sum of the squares of the deviations from the mean). This will likely tell you what's different between the two runs. Sometimes, it might help to print out the values you're accumulating as you're accumulating the data. If I had to guess, I'd suspect you are counting inappropriate lines (blanks, or the decoration lines), so the counts are different (and maybe the sums too).

I have a couple of (non-standard) programs to do the calculations. Given the 23 relevant lines from the multi-column output in a file data, I ran:

$ colnum -c 6 data | pstats
# Count    = 23
# Sum(x1)  =  3.557200e+02
# Sum(x2)  =  7.785051e+03
# Mean     =  1.546609e+01
# Std Dev  =  1.018790e+01
# Variance =  1.037934e+02
# Min      =  4.000000e-02
# Max      =  2.226000e+01
$

The standard deviation here is the sample standard deviation rather than the population standard deviation; the difference is dividing by (N-1) for the sample and N for the population.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow