Median of distribution with memory constraint

https://cs.stackexchange.com/questions/129243

29-09-2020
|

Question

Task

I want to approximate the median of a given distribution $D$ that I can sample from.

A simple algorithm for this, using $n$ samples, is:

samples = [D.sample() for i in range(n)] # generate n samples from D
sort(samples)
return samples[n/2]

However, I am looking for an algorithm that requires less than $O(n)$ space.

Ideas

I have looked into these algorithms:

Median of medians: Needs $O(n)$ space, so it does not work for me.
Randomized median: It seems like this could be easily generalized to an algorithm that uses $O(n^{3/4})$ space.

Are there any other algorithms that use less then $O(n)$ space that could solve my problem? In particular, I was thinking there may be an algorithm that uses $O(m)$ space by generating batches of samples from $D$ of size $m$...

Details

Ideally, I am looking for a reference to an algorithm that also includes analysis (success probability, expected runtime, etc).
Actually, I need an algorithm to estimate $D$'s $p$-th percentile for a given $p$, but I am hoping most median-finding algorithms can be generalized to that.
I would like to achieve the same accuracy as the simple algorithm shown above. One way to achieve this is by using an algorithm whose output distribution is the same as the sample algorithm (but maybe the new algorithm may fail in rare cases)

Solution

Sure, you can definitely achieve this using a bit more running time. Here is a conceptually simple approach, which might not be optimal, but will get you started and is probably pretty good:

Use binary search to find an approximate median $m$. How do you know whether is candidate $m$ is too large or too small? Sample $n'$ times from the distribution, count how many times the samples are $\ge m$, and compare that count to $n'/2$. This can be done with $O(1)$ space.

Then the key question becomes: how do we choose $n'$, to control the probability of error? A simple approach is to choose $n'$ to be sufficiently bigger than $n$ that the probability of error in each iteration of binary search is $t$ smaller than the probability of error when using $n$ samples, where $t$ is the number of iterations of binary search needed to achieve the desired accuracy. Then, a union bound ensures that this will meet your accuracy conditions.

Unfortunately, your accuracy condition is a bit hard to work with, when we don't know anything about the distribution of data, as the accuracy of the sample median can be arbitrarily bad. For instance, consider a distribution that outputs $0$ with probability $(1-\epsilon)/2$ and $100$ with probability $(1+\epsilon)/2$. Then the sample median is about equally likely to be 0 or 100, whereas the distribution median is 100, so the average error of the sample median is about 50 (unless you're drawing $\gg 1/\epsilon^2$ samples). That's a particularly nasty distribution, and it's going to be hard to work with. But if you assume the distribution is approximately Gaussian (say) with standard deviation $\sigma$, then the error of the sample median, with $n$ samples, is roughly $1.25 \sigma/\sqrt{n}$. Thus, the above algorithm can be used where we set $t \approx \lg (\sqrt{n}/1.25)$ and we set $n' \approx n t^2$.

That's one simple approach. You can probably do better. You might like to look up streaming algorithms for computing the median, as they tackle the problem you're working with: given an unlimited number of samples from the distribution, but only a limited amount of space, what's the best estimate we can get for the median? For instance, here is one simple algorithm: the first layer repeatedly takes three samples and outputs the median of those three; the second layer repeatedly takes three numbers from the first layer and outputs the median of those three; and so on. After logarithmically number of layers, you get a reasonable approximation to the median. There's an entire literature on this subject, and you should be able to find lots more.

Licensed under: CC-BY-SA with attribution

Not affiliated with cs.stackexchange