Improve performance for order by with columns from many tables

https://dba.stackexchange.com/questions/112679

29-09-2020
|

Question

Using PostgreSQL 8.4, I'm trying to consult two tables with 1 million records using order by with indexed columns of the two tables, and I'm losing performance (with 1 column takes 30 ms and with two columns takes 5 minutes). E.g.:

select r.code, r.test_code, r.sample_code, s.barcode, s.registry_date
from requests r
inner join samples s on (s.code = r.sample_code)
order by s.barcode  asc , r.code asc
limit 21;

Table Info:

CREATE TABLE public.samples (
  code BIGINT NOT NULL,
  barcode VARCHAR(40) NOT NULL,
  registry_date TIMESTAMP WITH TIME ZONE NOT NULL,
  CONSTRAINT samples_pkey PRIMARY KEY(code)
);

CREATE INDEX idx_samp_barcode ON public.samples (barcode);
CREATE INDEX idx_samp_barcode_code ON public.samples (barcode, code);
CREATE INDEX idx_samp_barcode_code_desc ON public.samples (barcode DESC, code DESC);
CREATE INDEX idx_test_identifier_desc ON public.samples (barcode DESC);


CREATE TABLE public.requests (
  code BIGINT NOT NULL,
  test_code INTEGER NOT NULL,
  sample_code BIGINT NOT NULL,
  CONSTRAINT requests_pkey PRIMARY KEY(code),
  CONSTRAINT "Requests_S_fk" FOREIGN KEY (sample_code)
    REFERENCES public.samples(code)
    ON DELETE NO ACTION ON UPDATE NO ACTION NOT DEFERRABLE,
  CONSTRAINT "Requests_T_fk" FOREIGN KEY (test_code)
    REFERENCES public.tests(code)
    ON DELETE NO ACTION ON UPDATE NO ACTION NOT DEFERRABLE
);

CREATE INDEX idx_req_test_code ON public.requests (test_code);
CREATE INDEX idx_req_test_code_desc ON public.requests (test_code DESC);
CREATE INDEX request_sample_code_index ON public.requests (sample_code);
CREATE INDEX requests_sample_code_desc_idx ON public.requests (sample_code DESC, code DESC);
CREATE INDEX requests_sample_code_idx ON public.requests (sample_code, code);

Each table has 1 million rows.
All rows in both tables are distinct.
If I order by either column alone, the execution time is ~ 30 ms.
There is no sample without request.
A sample can have many s.code for the smaes.barcode.

How can I improve the performance?

Solution

Problem

This is a more complex problem than is obvious on a quick glance. You are sorting by two columns, each from a different table, while you join on two other columns. This makes it impossible for Postgres to use the provided indexes and it has to default to (very) expensive sequential scans. Here is a related case on dba.SE:

Can spatial index help a “range - order by - limit” query

Indexes

You need two indexes for best performance, both of which are already present:

CREATE INDEX idx_samp_barcode_code ON public.samples (barcode, code);
CREATE INDEX requests_sample_code_idx ON public.requests (sample_code, code);

But the plain query you have can not make use of them, not even in pg 9.4.

Query

Postgres 8.4 is quite a hurdle to take. But I think I found a way with a recursive CTE:

WITH RECURSIVE cte AS (
   ( -- all parentheses are required
   SELECT r.code, r.test_code, r.sample_code, s.barcode, s.registry_date
   FROM  (
      SELECT s.barcode
      FROM   samples s
      -- WHERE  EXISTS (SELECT 1 FROM requests WHERE r.sample_code = s.code)
      ORDER  BY s.barcode
      LIMIT  1  -- get smallest barcode to start with
      ) s0
   JOIN   samples  s USING (barcode)  -- join all samples with same barcode
   JOIN   requests r ON r.sample_code = s.code
   ORDER  BY r.code  -- start with ordered list
   )

   UNION ALL
   (
   SELECT r.code, r.test_code, r.sample_code, s.barcode, s.registry_date
   FROM  (
      SELECT s.barcode
      FROM   cte c
      JOIN   samples s ON s.barcode > c.barcode 
      -- WHERE  EXISTS (SELECT 1 FROM requests WHERE r.sample_code = s.code)
      ORDER  BY s.barcode
      LIMIT  1  -- get next higher barcode
      ) s0
   JOIN   samples  s USING (barcode)  -- join all samples with same barcode
   JOIN   requests r ON r.sample_code = s.code
   ORDER  BY r.code  -- again, ordered list
   )
   )
TABLE cte
LIMIT 21;

Tested and works for me in pg 9.4. It uses the indexes (partly in index-only scans in pg 9.2+).

pg 8.4 already has recursive CTEs. The rest is basic SQL (even TABLE is available in pg 8.4 (and can be replaced with SELECT * FROM). I hope there are no restrictions to spoil the party, I don't have an installation of pg 8.4 any more.

Explanation

The commented parts:

-- WHERE  EXISTS (SELECT 1 FROM requests WHERE r.sample_code = s.code)

would be needed if there could be samples with no requests at all, which you ruled out in a comment.

The query relies on this documented behavior of recursive CTEs:

Tip: The recursive query evaluation algorithm produces its output in breadth-first search order.

Bold emphasis mine. And on this, too:

This works because PostgreSQL's implementation evaluates only as many rows of a WITH query as are actually fetched by the parent query. Using this trick in production is not recommended, because other systems might work differently. Also, it usually won't work if you make the outer query sort the recursive query's results or join them to some other table.

Bold emphasis mine. So, it's only guaranteed to work in PostgreSQL and only if you do not add another ORDER BY in the outer query.

This is fast for a (relatively) small LIMIT in the outer query. For the case at hand, it should be very fast. The query would be no good for a large LIMIT.

PL/pgSQL function

The other option I can think of is a procedural solution with plpgsql. Should be a bit faster yet, especially for repeated calls:

CREATE OR REPLACE FUNCTION f_demo(_limit int)
  RETURNS TABLE (code int8, test_code int, sample_code int8
               , barcode varchar, registry_date timestamptz) AS
$func$
DECLARE
   _barcode text;
   _n       int;
   _rest    int := _limit;  -- init with _limit parameter
BEGIN
   FOR _barcode IN
      SELECT DISTINCT s.barcode
      FROM   samples s
      -- WHERE  EXISTS (SELECT 1 FROM requests WHERE r.sample_code = s.code)
      ORDER  BY s.barcode

   LOOP
      RETURN QUERY
      SELECT r.code, r.test_code, r.sample_code, s.barcode, s.registry_date
      FROM   samples  s
      JOIN   requests r ON r.sample_code = s.code
      WHERE  s.barcode = _barcode
      ORDER  BY r.code
      LIMIT  _limit;

      GET DIAGNOSTICS _n = ROW_COUNT;
      _rest := _rest - _n;
      EXIT WHEN _rest < 1;
   END LOOP;
END
$func$ LANGUAGE plpgsql;

Call:

SELECT * FROM f_demo(21);

`MATERIALIZED VIEW`

A large OFFSET has a similar effect as a large LIMIT. While all the skipped rows don't add to I/O, they still have to be computed to determine the first rows after the offset. If you need that, use a MATERIALIZED VIEW with an added row number and an index on that.

Postgres 9.3+ has a built-in feature, with your outdated software you have to knit the solution by hand. It's not that complicated, though. Basically a snapshot- table thats populated from a pre-defined SELECT statement or VIEW, like. Basically, create the table as:

CREATE TABLE req_sample AS
SELECT row_number() OVER (ORDER BY s.barcode, r.code) AS rn
     , r.code, r.test_code, r.sample_code, s.barcode, s.registry_date
FROM   requests r
JOIN   samples s on (s.code = r.sample_code)
ORDER  by s.barcode, r.code;

ALTER TABLE req_sample ADD CONSTRAINT req_sample_rk PRIMARY KEY (code);
CREATE INDEX foo ON  req_sample (rn);

Save that SELECT as function, view or the plain query text. Refresh (expensive) in one transaction:

BEGIN;
-- drop PK & index
TRUNCATE req_sample;
INSERT INTO req_sample SELECT ... ;
-- add PK & index
COMMIT;

Redundant `barcode` column

The MV gets more expensive if underlying tables change a lot and you need "current" results. There's one more halfway cheap option: store the barcode in the requests table redundantly. You could include it as 2nd column in the FK constraint to samples ("Requests_S_fk") and make that ON UPDATE CASCADE to avoid outdated / conflicting data. Then you can work with an index on (barcode, code) in requests.

For the last two options you can then simplify pagination with directly applicable indexes with a WHERE clause on values from the previous page replacing the OFFSET. Consider this presentation about "Pagination done the PostgreSQL way" on use-the-index-luke.com.

OTHER TIPS

try to do

select * from 
(
select
r.code, r.test_code, r.sample_code, s.barcode, s.registry_date,
 from requests r
inner join samples s on (s.code = r.sample_code)
order by s.barcode  asc   limit 21
) as tbl 
 order by tbl.code asc

    select * from 
(
select
r.code, r.test_code, r.sample_code, s.barcode, s.registry_date,
 from requests r
inner join samples s on (s.code = r.sample_code)    
) as tbl 
 order by tbl.barcode  asc , tbl.code asc limit 21

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange