Wondering why would one calculate the median this way?

https://stackoverflow.com/questions/9447267

12-11-2019
|

Question

I was wondering what may be the reason to use this median function, instead of just calculating the min + (max - min) / 2:

// used by the random number generator
private static final double  M_E12 = 162754.79141900392083592475;

/**
 * Return an estimate of median of n values distributed in [min,max)
 * @param min the minimum value 
 * @param max the maximum value
 * @param n 
 * @return an estimate of median of n values distributed in [min,max)
 **/
private static double median(double min, double max, int n) 
{
    // get random value in [0.0, 1.0)
    double t = (new Random()).nextDouble();

    double retval;
    if (t > 0.5) {
        retval = java.lang.Math.log(1.0-(2.0*(M_E12-1)*(t-0.5)/M_E12))/12.0;
    } else {
        retval = -java.lang.Math.log(1.0-(2.0*(M_E12-1)*t/M_E12))/12.0;
    }
    // We now have something distributed on (-1.0,1.0)
    retval = (retval+1.0) * (max-min)/2.0;
    retval = retval + min;
    return retval;
}

The only downside of my approach would maybe be its deterministic nature, I'd say?

The whole code can be found here, http://www.koders.com/java/fid42BB059926626852A0D146D54F7D66D7D2D5A28D.aspx?s=cdef%3atree#L8, btw.

Thanks

Solution

[trying to cover a range here because it's not clear to me what you're not understanding]

first, the median is the middle value. the median of [0,0,1,99,99] is 1.

and so we can see that the code given is not calculating the median (it's not finding a middle value). instead, it's estimating it from some theoretical distribution. as the comment says.

the forumla you give is for the mid-point. if many values are uniformly distributed between min and max then yes, that is a good estimation of the median. in this case (presumably) the values are not distributed in that way and so some other method is necessary.

you can see why this could be necessary by calculating the mid point of the numbers above - your formula would give 49.5.

the reason for using an estimate is probably that it is much faster than finding the median. the reason for making that estimate random is likely to avoid a bad worst case on multiple calls.

and finally, sorry but i don't know what the distribution is in this case. you probably need to search for the data structure and/or author name to see if you can find a paper or book reference (i thought it might be assuming a power law, but see edit below - it seems to be adding a very small correction) (i'm not sure if that is what you are asking, or if you are more generally confused).

[edit] looking some more, i think the log(...) is giving a central bias to the uniformly random t. so it's basically doing what you suggest, but with some spread around the 0.5. here's a plot of one case which shows that retval is actually a pretty small adjustment.

OTHER TIPS

I can't tell you what this code is attempting to achieve; for a start it doesn't even use n!

But from the looks of it, it's simply generating some sort of exponentially-distributed random value in the range [min,max]. See http://en.wikipedia.org/wiki/Exponential_distribution#Generating_exponential_variates.

Interestingly, Googling for that magic number brings up lots of relevant hits, none of which are illuminating: http://www.google.co.uk/search?q=162754.79141900392083592475.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow