Question

My program listens for incoming data and an estimate of 5 data comes in every second. All the data will be stored in a data structure. When the data structure is of size 360 000, I will need to find the 25th, 50th and 75th percentile among the data stored.

Which of the following would be more efficient? Or if you know a better method please help me out.

Should I use an order statistics tree? Insert, delete (log n).

Or should I wait till it has collected all 360 000 data, then sort it and find the 25th, 50th and 75th percentile from there.

Was it helpful?

Solution

You could use a selection sort to find the different percentiles.

In your problem you know you need to find the 90k, 180k, and 270k positioned elements in a sorted list. Once all the 360k elements are fetched, choose a random element and split the elements to sublists based on those smaller, equal, and bigger than the element you chose. After that step, you will be able to see at what position the element you chose was in. Then, you can either choose to do the same with the smaller or bigger sublist, depending on what percentile you are looking for.

In the best case, this could be solved in O(n), as you could choose the right percentiles on the first go, but this is very unlikely. In the worst case, you could choose always the smallest element, and therefore do passes o(n) times making it o(n^2), but thats very unlikely too. Luckily, the expected running time turns out to be T(n) <= 8n, which is linear running time.

As a tip, you could gather the min/max numbers during the streaming of the data, and then you can estimate by choosing the first element to choose as (max+min)/2. This will of course be an assumption that the numbers are somehow similar to a normal distribution, and not totally off.

If you need more details on the algorithm, have a look here: http://cseweb.ucsd.edu/~dasgupta/103/4a.pdf

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top