Calculating percentile rank in MySQL

https://stackoverflow.com/questions/1057027

20-08-2019
|

Question

I have a very big table of measurement data in MySQL and I need to compute the percentile rank for each and every one of these values. Oracle appears to have a function called percent_rank but I can't find anything similar for MySQL. Sure I could just brute-force it in Python which I use anyways to populate the table but I suspect that would be quite inefficient because one sample might have 200.000 observations.

Solution

This is a relatively ugly answer, and I feel guilty saying it. That said, it might help you with your issue.

One way to determine the percentage would be to count all of the rows, and count the number of rows that are greater than the number you provided. You can calculate either greater or less than and take the inverse as necessary.

Create an index on your number. total = select count(); less_equal = select count() where value > indexed_number;

The percentage would be something like: less_equal / total or (total - less_equal)/total

Make sure that both of them are using the index that you created. If they are not, tweak them until they are. The explain query should have "using index" in the right hand column. In the case of the select count(*) it should be using index for InnoDB and something like const for MyISAM. MyISAM will know this value at any time without having to calculate it.

If you needed to have the percentage stored in the database, you can use the setup from above for performance and then calculate the value for each row by using the second query as an inner select. The first query's value can be set as a constant.

Does this help?

Jacob

OTHER TIPS

Here's a different approach that doesn't require a join. In my case (a table with 15,000+) rows, it runs in about 3 seconds. (The JOIN method takes an order of magnitude longer).

In the sample, assume that measure is the column on which you're calculating the percent rank, and id is just a row identifier (not required):

SELECT
    id,
    @prev := @curr as prev,
    @curr := measure as curr,
    @rank := IF(@prev > @curr, @rank+@ties, @rank) AS rank,
    @ties := IF(@prev = @curr, @ties+1, 1) AS ties,
    (1-@rank/@total) as percentrank
FROM
    mytable,
    (SELECT
        @curr := null,
        @prev := null,
        @rank := 0,
        @ties := 1,
        @total := count(*) from mytable where measure is not null
    ) b
WHERE
    measure is not null
ORDER BY
    measure DESC

Credit for this method goes to Shlomi Noach. He writes about it in detail here:

http://code.openark.org/blog/mysql/sql-ranking-without-self-join

I've tested this in MySQL and it works great; no idea about Oracle, SQLServer, etc.

there is no easy way to do this. see http://rpbouman.blogspot.com/2008/07/calculating-nth-percentile-in-mysql.html

SELECT 
    c.id, c.score, ROUND(((@rank - rank) / @rank) * 100, 2) AS percentile_rank
FROM
    (SELECT 
    *,
        @prev:=@curr,
        @curr:=a.score,
        @rank:=IF(@prev = @curr, @rank, @rank + 1) AS rank
    FROM
        (SELECT id, score FROM mytable) AS a,
        (SELECT @curr:= null, @prev:= null, @rank:= 0) AS b
ORDER BY score DESC) AS c;

If you're combining your SQL with a procedural language like PHP, you can do the following. This example breaks down excess flight block times into an airport, into their percentiles. Uses the LIMIT x,y clause in MySQL in combination with ORDER BY. Not very pretty, but does the job (sorry struggled with the formatting):

$startDt = "2011-01-01";
$endDt = "2011-02-28";
$arrPort= 'JFK';

$strSQL = "SELECT COUNT(*) as TotFlights FROM FIDS where depdt >= '$startDt' And depdt <= '$endDt' and ArrPort='$arrPort'";
if (!($queryResult = mysql_query($strSQL, $con)) ) {
    echo $strSQL . " FAILED\n"; echo mysql_error();
    exit(0);
}
$totFlights=0;
while($fltRow=mysql_fetch_array($queryResult)) {
    echo "Total Flights into " . $arrPort . " = " . $fltRow['TotFlights'];
    $totFlights = $fltRow['TotFlights'];

    /* 1906 flights. Percentile 90 = int(0.9 * 1906). */
    for ($x = 1; $x<=10; $x++) {
        $pctlPosn = $totFlights - intval( ($x/10) * $totFlights);
        echo "PCTL POSN for " . $x * 10 . " IS " . $pctlPosn . "\t";
        $pctlSQL = "SELECT  (ablk-sblk) as ExcessBlk from FIDS where ArrPort='" . $arrPort . "' order by ExcessBlk DESC limit " . $pctlPosn . ",1;";
        if (!($query2Result = mysql_query($pctlSQL, $con)) ) {
            echo $pctlSQL  . " FAILED\n";
            echo mysql_error();
            exit(0);
        }
        while ($pctlRow = mysql_fetch_array($query2Result)) {
            echo "Excess Block is :" . $pctlRow['ExcessBlk'] . "\n";
        }
    }
}

MySQL 8 finally introduced window functions, and among them, the PERCENT_RANK() function you were looking for. So, just write:

SELECT col, percent_rank() OVER (ORDER BY col)
FROM t
ORDER BY col

Your question mentions "percentiles", which are a slightly different thing. For completeness' sake, there are PERCENTILE_DISC and PERCENTILE_CONT inverse distribution functions in the SQL standard and in some RBDMS (Oracle, PostgreSQL, SQL Server, Teradata), but not in MySQL. With MySQL 8 and window functions, you can emulate PERCENTILE_DISC, however, again using the PERCENT_RANK and FIRST_VALUE window functions.

To get the rank, I'd say you need to (left) outer join the table on itself something like :

select t1.name, t1.value, count(distinct isnull(t2.value,0))  
from table t1  
left join table t2  
on t1.value>t2.value  
group by t1.name, t1.value

For each row, you will count how many (if any) rows of the same table have an inferior value.

Note that I'm more familiar with sqlserver so the syntax might not be right. Also the distinct may not have the right behaviour for what you want to achieve. But that's the general idea.
Then to get the real percentile rank you will need to first get the number of values in a variable (or distinct values depending on the convention you want to take) and compute the percentile rank using the real rank given above.

Suppose we have a sales table like :

user_id,units

then following query will give percentile of each user :

select a.user_id,a.units,
(sum(case when a.units >= b.units then 1 else 0 end )*100)/count(1) percentile
from sales a join sales b ;

Note that this will go for cross join so result in O(n2) complexity so can be considered as unoptimized solution but seems simple given we do not have any function in mysql version.

Not sure what the op meant by 'percentile rank', but to get a given percentile for a set of values see http://rpbouman.blogspot.com/2008/07/calculating-nth-percentile-in-mysql.html The sql calculation could easily be changed to produce another or multiple percentiles.

One note: I had to change the calculation slightly, for example the 90th percentile - "90/100 * COUNT(*) + 0.5" instead of "90/100 * COUNT(*) + 1". Sometimes it was skipping two values past the percentile point in the ordered list, instead of picking the next higher value for the percentile. Maybe the way integer rounding works in mysql.

ie:

.... SUBSTRING_INDEX(SUBSTRING_INDEX( GROUP_CONCAT(fieldValue ORDER BY fieldValue SEPARATOR ','), ',', 90/100 * COUNT(*) + 0.5), ',', -1) as 90thPercentile ....

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow