문제

Is there a convenient way to calculate percentiles for a sequence or single-dimensional numpy array?

I am looking for something similar to Excel's percentile function.

I looked in NumPy's statistics reference, and couldn't find this. All I could find is the median (50th percentile), but not something more specific.

도움이 되었습니까?

해결책

You might be interested in the SciPy Stats package. It has the percentile function you're after and many other statistical goodies.

percentile() is available in numpy too.

import numpy as np
a = np.array([1,2,3,4,5])
p = np.percentile(a, 50) # return 50th percentile, e.g median.
print p
3.0

This ticket leads me to believe they won't be integrating percentile() into numpy anytime soon.

다른 팁

By the way, there is a pure-Python implementation of percentile function, in case one doesn't want to depend on scipy. The function is copied below:

## {{{ http://code.activestate.com/recipes/511478/ (r1)
import math
import functools

def percentile(N, percent, key=lambda x:x):
    """
    Find the percentile of a list of values.

    @parameter N - is a list of values. Note N MUST BE already sorted.
    @parameter percent - a float value from 0.0 to 1.0.
    @parameter key - optional key function to compute value from each element of N.

    @return - the percentile of the values
    """
    if not N:
        return None
    k = (len(N)-1) * percent
    f = math.floor(k)
    c = math.ceil(k)
    if f == c:
        return key(N[int(k)])
    d0 = key(N[int(f)]) * (c-k)
    d1 = key(N[int(c)]) * (k-f)
    return d0+d1

# median is 50th percentile.
median = functools.partial(percentile, percent=0.5)
## end of http://code.activestate.com/recipes/511478/ }}}
import numpy as np
a = [154, 400, 1124, 82, 94, 108]
print np.percentile(a,95) # gives the 95th percentile

Here's how to do it without numpy, using only python to calculate the percentile.

import math

def percentile(data, percentile):
    size = len(data)
    return sorted(data)[int(math.ceil((size * percentile) / 100)) - 1]

p5 = percentile(mylist, 5)
p25 = percentile(mylist, 25)
p50 = percentile(mylist, 50)
p75 = percentile(mylist, 75)
p95 = percentile(mylist, 95)

The definition of percentile I usually see expects as a result the value from the supplied list below which P percent of values are found... which means the result must be from the set, not an interpolation between set elements. To get that, you can use a simpler function.

def percentile(N, P):
    """
    Find the percentile of a list of values

    @parameter N - A list of values.  N must be sorted.
    @parameter P - A float value from 0.0 to 1.0

    @return - The percentile of the values.
    """
    n = int(round(P * len(N) + 0.5))
    return N[n-1]

# A = (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
# B = (15, 20, 35, 40, 50)
#
# print percentile(A, P=0.3)
# 4
# print percentile(A, P=0.8)
# 9
# print percentile(B, P=0.3)
# 20
# print percentile(B, P=0.8)
# 50

If you would rather get the value from the supplied list at or below which P percent of values are found, then use this simple modification:

def percentile(N, P):
    n = int(round(P * len(N) + 0.5))
    if n > 1:
        return N[n-2]
    else:
        return N[0]

Or with the simplification suggested by @ijustlovemath:

def percentile(N, P):
    n = max(int(round(P * len(N) + 0.5)), 2)
    return N[n-2]

check for scipy.stats module:

 scipy.stats.scoreatpercentile

To calculate the percentile of a series, run:

from scipy.stats import rankdata
import numpy as np

def calc_percentile(a, method='min'):
    if isinstance(a, list):
        a = np.asarray(a)
    return rankdata(a, method=method) / float(len(a))

For example:

a = range(20)
print {val: round(percentile, 3) for val, percentile in zip(a, calc_percentile(a))}
>>> {0: 0.05, 1: 0.1, 2: 0.15, 3: 0.2, 4: 0.25, 5: 0.3, 6: 0.35, 7: 0.4, 8: 0.45, 9: 0.5, 10: 0.55, 11: 0.6, 12: 0.65, 13: 0.7, 14: 0.75, 15: 0.8, 16: 0.85, 17: 0.9, 18: 0.95, 19: 1.0}

Starting Python 3.8, the standard library comes with the quantiles function as part of the statistics module:

from statistics import quantiles

quantiles([1, 2, 3, 4, 5], n=100)
# [0.06, 0.12, 0.18, 0.24, 0.3, 0.36, 0.42, 0.48, 0.54, 0.6, 0.66, 0.72, 0.78, 0.84, 0.9, 0.96, 1.02, 1.08, 1.14, 1.2, 1.26, 1.32, 1.38, 1.44, 1.5, 1.56, 1.62, 1.68, 1.74, 1.8, 1.86, 1.92, 1.98, 2.04, 2.1, 2.16, 2.22, 2.28, 2.34, 2.4, 2.46, 2.52, 2.58, 2.64, 2.7, 2.76, 2.82, 2.88, 2.94, 3.0, 3.06, 3.12, 3.18, 3.24, 3.3, 3.36, 3.42, 3.48, 3.54, 3.6, 3.66, 3.72, 3.78, 3.84, 3.9, 3.96, 4.02, 4.08, 4.14, 4.2, 4.26, 4.32, 4.38, 4.44, 4.5, 4.56, 4.62, 4.68, 4.74, 4.8, 4.86, 4.92, 4.98, 5.04, 5.1, 5.16, 5.22, 5.28, 5.34, 5.4, 5.46, 5.52, 5.58, 5.64, 5.7, 5.76, 5.82, 5.88, 5.94]
quantiles([1, 2, 3, 4, 5], n=100)[49] # 50th percentile (e.g median)
# 3.0

quantiles returns for a given distribution dist a list of n - 1 cut points separating the n quantile intervals (division of dist into n continuous intervals with equal probability):

statistics.quantiles(dist, *, n=4, method='exclusive')

where n, in our case (percentiles) is 100.

In case you need the answer to be a member of the input numpy array:

Just to add that the percentile function in numpy by default calculates the output as a linear weighted average of the two neighboring entries in the input vector. In some cases people may want the returned percentile to be an actual element of the vector, in this case, from v1.9.0 onwards you can use the "interpolation" option, with either "lower", "higher" or "nearest".

import numpy as np
x=np.random.uniform(10,size=(1000))-5.0

np.percentile(x,70) # 70th percentile

2.075966046220879

np.percentile(x,70,interpolation="nearest")

2.0729677997904314

The latter is an actual entry in the vector, while the former is a linear interpolation of two vector entries that border the percentile

for a series: used describe functions

suppose you have df with following columns sales and id. you want to calculate percentiles for sales then it works like this,

df['sales'].describe(percentiles = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1])

0.0: .0: minimum
1: maximum 
0.1 : 10th percentile and so on
라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top