How can I calculate the median and standard deviation of a bunch stream of numbers in Perl?
-
18-09-2019 - |
Question
In our logfiles we store response times for the requests. What's the most efficient way to calculate the median response time, the "75/90/95% of requests were served in less than N time" numbers etc? (I guess a variation of my question is: What's the best way to calculate the median and standard deviation of a bunch stream of numbers).
The best I came up with was just reading all the numbers, ordering them and then picking out the numbers, but that seems really goofy. Isn't there a smarter way?
We use Perl, but solutions for any language might be helpful.
Solution
See the article Calculating Percentiles in Memory-bound Applications. It explains how to calculate median and other percentiles efficiently.
Also, here's an article on calculating standard deviation (variance) as you go: Accurately computing running variance.
OTHER TIPS
you can have look at quick select:
http://en.wikipedia.org/wiki/Selection_algorithm
Or at the Wirth algorithm: http://www.mail-archive.com/numpy-discussion@scipy.org/msg20059.html
Benchmark for the median can be found here: http://ndevilla.free.fr/median/median/index.html
Have a look at PDL... the Perl Data Language.
Also see these previous SO questions about mean/std dev:
- How to efficiently calculate a running standard deviation?
- How can I get the average and standard deviations grouped by key?
- Is there a Perl statistics package that doesn’t make me load the entire dataset at once?
/I3az/
There are code examples here: http://rosettacode.org/wiki/Standard_Deviation