Postgres top-k with ties

https://dba.stackexchange.com/questions/80311

10-12-2020
|

Question

Assume I have the following DB table with 3 integer fields.

A | B | C
1 | 2 | 3
1 | 2 | 4
1 | 3 | 1
2 | 4 | 2
2 | 4 | 3

When I do:

SELECT * FROM dbTable ORDER BY A,B LIMIT 1;

I get

1 | 2 | 3

which is expected. But the 2nd record:

 1 | 2 | 4

also has the same values for dbFields A and B. Is there any efficient way to actually retrieve all records that have the same value as the top-k records? For example, when I search for the first 100 records to get instead 102 records if the last two have the same values as the 100th record? Is there any index to accelerate such queries? I do not mind if it has to be done with pl/pgsql (and not plain SQL) if the implementation is efficient.

Solution 2

This is what I came up with:

/* Get the records of original table that correspond to each value of A 
and for values of B better or the same for the top-k records */ 
SELECT n3.A,n3.B,n3.C
FROM dbTable n3,
(
/* n2 = Get the worst value of B per A for the top-k records */ 
SELECT A, MAX(B) AS B
FROM 
/* n1 = Count records per A, ordered by B */ 
(SELECT A, B, C,
        row_number() over (partition BY A ORDER BY B,C)  AS counter
  FROM dbTable) n1

WHERE n1.counter<=100 /* k=100 */
GROUP BY A) n2

WHERE n3.A=n2.A AND n3.B<=n2.B
ORDER BY n3.A,n3.B,n3.C;

Seems correct, but please spot any possible oversight.

OTHER TIPS

You can use a window function for this:

select a,b,c
from (
  select a,b,c,
         dense_rank() over (order by a,b) as rnk
  from dbTable
) t
where rnk = 1;

For the "first" rows, it doesn't matter if you use rank() or dense_rank(). When you e.g. want the "second" ones, the rank() and dense_rank() would return different results in case of ties. Because rank() will have "gaps" in the numbers, but dense_rank() will not.

A possible speedup might be achieved by doing this in two steps and of course having an index on (a,b)

with ranked as (
  select *
  from (
    select a,b,
           dense_rank() over (order by a,b) as rnk
    from dbTable
  ) t
  where r.rnk = 1  -- (or <= for "top-k")
)
select t.a, t.b, t.c
from dbTable t
   join ranked r on r.a = t.a and r.b = t.b;

The idea is to give Postgres the chance to do an index-only scan for the ranking part and then join only the result of that to the base table to get the remaining column(s). The filtering on the rank is done inside the CTE as Postgres doesn't push down conditions from the outer query into the CTE itself (that's why I have the derived table inside the CTE)

I'm not sure if this really improves the performance, but I guess it would be worth trying and have a look a the execution plan with the real tables (and data).

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange