Question

I am attempting to find the most accurate function to give me the quantile of a given value within a data set. The data set will (likely) always be an exponential distribution.

The methodology I am using is as follows (and I apologize if the coding is poor, as I'm really an infrastructure guy, not a stats dude, nor a daily dev):

import sys, scipy, numpy
from matplotlib import pyplot
from scipy.stats.mstats import mquantiles

def FindQuantile(data,findme):
    print 'entered FindQuantile'
    probset=[]
    #cheap hack to make a quick list to get quantiles for each permille value]
    for i in numpy.linspace(0,1,10000):
            probset.append(i)

    #http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mstats.mquantiles.html
    quantile_results = mquantiles(data,prob=probset)
    quantiles = []
    i = 0
    for value in quantile_results:
            print str(i) +  ' permille ' + str(value)
            quantiles.append(value)
            i = i+1
    #goal is to figure out which quantile findme falls in:
    i = 0
    for quantile in quantiles:
            if (findme > quantile):
                    print str(quantile) + ' is too small for ' + str(findme)
            else:
                    print str(quantile) + ' is the quantile value for the ' + str(i) + '-' + str(i + 1) + ' permille quantile range. ' + str(findme) + ' falls within this range.'
                    break
            i = i + 1

During my research, I noticed there are several more advanced functions to use, such as scipy.stats.[distribution type].ppf().

What is the advantage of using these over mquantiles()?

Is there a method available to efficiently determine the distribution of the data in a data set (this is my concern with scipy.stats.[distribution type]())?

Thanks,

Matt

[update]

After discussing with a "stats dude," I believe that this method (what he referred to as the "empirical method") is just as valid if you do not know the distribution. To find the distribution, you could use the Kolmogorov–Smirnov test, which is revealed via scipy.stats.ksone and scipy.stats.kstwobign to determine the distribution, then utilize one of the scipy.stats.[distribution type].ppf() functions. He also said that it doesn't matter at all, that the method above is just as good as doing all this work, with little reward. Although he cautioned that the strength of the above method would rise with the amount of data available in data (meaning the inverse is also true), that no one has solved the problem of applying laws against small data sets.

What I will do, is to consider the strength of the data set, and put a weight on my resultant, and consider it to be much more fuzzy/have less weight when the data set is "small." What is "small?" I'm not sure yet.

I would still like to find other peoples input as to effective use of ppf() versus mquantile().

Was it helpful?

Solution

ppf gives you the quantiles for a particular distribution given the parameters of the distribution. For example, you could fit your data to and exponential distribution, and then you can use the ppf with the estimated parameters to get the quantiles.

When you use mquantiles, then you do not assume that you have a specific distribution.

Estimating the parameters of a given distribution and using ppf will give you better results, with lower variance, than mquantiles, if your data really comes from that distribution or the distribution is at least a very good approximation.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top