Python SciPy Stats percentilofscore

https://stackoverflow.com/questions/8138552

01-03-2021
|

Domanda

Consider the following Python code:

In [1]: import numpy as np
In [2]: import scipy.stats as stats
In [3]: ar = np.array([0.8389, 0.5176, 0.1867, 0.1953, 0.4153, 0.6036, 0.2497, 0.5188, 0.4723, 0.3963])
In [4]: x = ar[-1]
In [5]: stats.percentileofscore(ar, x, kind='strict')
Out[5]: 30.0
In [6]: stats.percentileofscore(ar, x, kind='rank')
Out[6]: 40.0
In [7]: stats.percentileofscore(ar, x, kind='weak')
Out[7]: 40.0
In [8]: stats.percentileofscore(ar, x, kind='mean')
Out[8]: 35.0

The kind argument represents the interpretation of the resulting score.

Now when I use Excel's PERCENTRANK function with the same data, I get 0.3333. This appears to be correct as there are 3 values less than x=0.3963.

Can someone explain why I'm getting inconsistent results?

Soluzione

When I rewrote this function in scipy.stats, I found many different definitions, some of them are included.

The basic example is when I want to rank students on a score. In this case the score includes all students, and the percentileofscore gives the rank among all students. The main distinction then is just how to handle ties.

Excel seems to use how you would rank a student relative to an existing scale, for example what's the rank of a score on the historical GRE scale. I have no idea if excel drops one entry if the score is not in the existing list.

A similar problem in statistics are "plotting positions" for quantiles. I don't find a good reference on the internet. Here is one general formula http://amsglossary.allenpress.com/glossary/search?id=plotting-position1 Wikipedia only has a short paragraph: http://en.wikipedia.org/wiki/Q-Q_plot#Plotting_positions

The literature has a large number of cases of different choices of b (or even choices of a second parameter a), that correspond to different approximations for different distributions. Several are implemented in scipy.stats.mstats.

I don't think it's a question of which is right. It's, what you want to use it for? And what's the common definition for your problem or your field?

Altri suggerimenti

This is a weird one, near as I can tell they are doing different calculations, Scipy will reproduce the excel result if called this way.

In [1]: import numpy as np
In [2]: In [2]: import scipy.stats as stats
In [3]: In [3]: ar = np.array([0.8389, 0.5176, 0.1867, 0.1953, 0.4153, 0.6036, 0.2497, 0.5188, 0.4723, 0.3963])
In [4]: In [4]: x = ar[-1]
In [5]: stats.percentileofscore(ar[:-1], x, kind='mean')
Out[5]: 33.333333333333336

using any of the kind keywords I get the same answer. This is leaving out the value in the data that is exactly equal to the query. Have a look at this PercentRank algorithm in VBA as it might have a bit of insight.

So which is right? Excel or Scipy?

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow