RECURSIVE CTE does not use an INDEX. (Disabling seqscan forces it using an index however and it is faster)

https://dba.stackexchange.com/questions/284332

14-03-2021
|

Question

Suppose the following relations:

match(match_id)
event(match_id, seq, gt, ...)

There are the following indexes:

match(match_id)
event(match_id, seq)

Further notes:

gt is monotonically increasing
For a given match I have a collection of events which happens at a specific 'gt' time
both match and event are mat views.
List item

I am using postgresql 13.1

My goal is to come up with a RECURSIVE CTE query which calculates the delta between one event and the next one, however I find that very slow. While this can be solved practically with a self-join, I am not interested in that, I want to find out why my CTE is slow. I believe it should not be that slow.

Further numbers:

number of matches is 400
each match has an average of 541 events

My RECURSIVE CTE query is the following:


WITH RECURSIVE
    delta_gts AS (
        SELECT m.match_id, 1 AS seq, 0 AS gt, 0 AS delta
        FROM matches m

        UNION

        SELECT dgt.match_id, ev.seq AS seq, ev.gt AS gt, (ev.gt - dgt.gt) AS delta
        FROM delta_gts dgt
        JOIN events ev ON ev.match_id = dgt.match_id AND ev.seq = (dgt.seq + 1)

    )

SELECT * FROM delta_gts g

Futher notes I also tried by adding the following (for one match only):

WHERE g.match_id = 'ita_1672780'

and I find out in the plan that there is no predicate pushdown. I think this was implemented in pgsql 13.1

This is the plan:

QUERY PLAN
CTE Scan on delta_gts g  (cost=160601.44..161032.40 rows=21548 width=76) (actual time=173.940..354185.831 rows=220268 loops=1)
"  Buffers: shared hit=5453034 read=596370, temp read=1340253 written=1581611"
  CTE delta_gts
    ->  Recursive Union  (cost=0.00..160601.44 rows=21548 width=76) (actual time=173.931..353944.926 rows=220268 loops=1)
"          Buffers: shared hit=5453034 read=596370, temp read=1340253 written=1580590"
          ->  Seq Scan on netcastingdocument_matches m  (cost=0.00..10.08 rows=408 width=28) (actual time=173.917..174.265 rows=408 loops=1)
                Buffers: shared hit=6
          ->  Hash Join  (cost=14121.22..16016.04 rows=2114 width=76) (actual time=259.550..305.356 rows=190 loops=1158)
                Hash Cond: ((dgt.match_id = ev.match_id) AND ((dgt.seq + 1) = ev.seq))
"                Buffers: shared hit=5453028 read=596370, temp read=1340253 written=1580590"
                ->  WorkTable Scan on delta_gts dgt  (cost=0.00..81.60 rows=4080 width=72) (actual time=0.005..0.067 rows=190 loops=1158)
                ->  Hash  (cost=8106.89..8106.89 rows=288289 width=24) (actual time=257.949..257.949 rows=288323 loops=1158)
                      Buckets: 65536  Batches: 8  Memory Usage: 2484kB
"                      Buffers: shared hit=5453022 read=596370, temp written=1565616"
                      ->  Seq Scan on netcastingdocument_events ev  (cost=0.00..8106.89 rows=288289 width=24) (actual time=0.016..92.171 rows=288323 loops=1158)
                            Buffers: shared hit=5453022 read=596370
Planning:
  Buffers: shared hit=107
Planning Time: 50.290 ms
JIT:
  Functions: 13
"  Options: Inlining false, Optimization false, Expressions true, Deforming true"
"  Timing: Generation 4.108 ms, Inlining 0.000 ms, Optimization 19.158 ms, Emission 154.531 ms, Total 177.796 ms"
Execution Time: 355489.930 ms

Considerations:

It is not using the index (match_id, seq) on the events table at all when the recursive part of the CTE is executed.
Disabling seqscan does the trick as it will use the index for events.

After some investigation it looks like that the issue is that a SeqScan is being performed for looking up the next event which is not right in my situation.

Solution

There may be several causes; I cannot be certain, because you didn't post the EXPLAIN (ANALYZE, BUFFERS) output for both executions.

PostgreSQL might mis-estimate the row counts. Running ANALYZE as you did is a good approach here, but in a recursive CTE the row counts are often hard to predict, and it is hard to fix these estimates.

If you don't mind a nasty trick, you could try adding another superfluous join condition to make PostgreSQL think that the result will have fewer rows:
```
JOIN events ev
   ON ev.match_id = dgt.match_id
      AND ev.seq = dgt.seq + 1
      AND ev.seq - 1 = dgt.seq
```
PostgreSQL might price an index scan too high, which induces it to choose a sequential scan and a hash join instead of a nested loop join.
- If you have an SSD as disk, you should lower random_page_cost to 1 or 1.1 to give the PostgreSQL optimizer an idea that index scans are not four times as expensive as sequential scans.
- If you have enough RAM, you should set effective_cache_size high enough, so that PostgreSQL knows that data are likely cached. That will also lower the cost of an index scan.

OTHER TIPS

calculate the delta between one event and the next one

why not use window function LEAD() or LAG() that will do what you want. And if it can get the order of the rows from an index, it won't need to do any sort.

BEGIN;
CREATE TABLE events( match_id INTEGER NOT NULL, seq INTEGER NOT NULL, value FLOAT );
INSERT INTO events SELECT n/500, n%500, random() FROM generate_series(1,500*500) n;
ALTER TABLE events ADD PRIMARY KEY (match_id, seq);
CREATE TABLE matches( match_id INTEGER PRIMARY KEY );
INSERT INTO matches SELECT DISTINCT match_id FROM events;
COMMIT;
VACUUM ANALYZE matches, events;

EXPLAIN ANALYZE SELECT match_id, seq, value, 
    lag(value,1,0::FLOAT) OVER (PARTITION BY match_id ORDER BY seq) 
    FROM events;

 WindowAgg  (cost=0.42..14009.61 rows=250000 width=24) (actual time=0.037..371.799 rows=250000 loops=1)
   ->  Index Scan using events_pkey on events  (cost=0.42..9634.61 rows=250000 width=16) (actual time=0.024..98.620 rows=250000 loops=1)
 Planning Time: 0.090 ms
 Execution Time: 390.870 ms

If you want the difference between one row and the previous one, use "value-lag(value,1)" ; also lag() takes a default parameter, so if you want the first one to be 0 instead of NULL, use lag(value,1,0::FLOAT). It doesn't seem to work if the type is not explicitly cast.

Now, the original question...

WITH RECURSIVE
    delta_gts AS (
        SELECT m.match_id, 1 AS seq, 0::FLOAT AS value, 0::FLOAT AS delta FROM matches m
        UNION ALL
        SELECT dgt.match_id, ev.seq AS seq, ev.value, (ev.value - dgt.value) AS delta
        FROM delta_gts dgt
        JOIN events ev ON ev.match_id = dgt.match_id AND ev.seq = (dgt.seq + 1)
    )
SELECT * FROM delta_gts g;

UNION removes duplicate rows. Since there will be no duplicate rows, as the new rows added by each recursive query are different from the previous ones, this is a waste of CPU, so I replaced it with UNION ALL, which doesn't do the extra work. This makes it about 2x faster.

 CTE Scan on delta_gts g  (cost=78436.57..79448.59 rows=50601 width=24) (actual time=0.019..715.390 rows=249501 loops=1)
   CTE delta_gts
     ->  Recursive Union  (cost=0.00..78436.57 rows=50601 width=24) (actual time=0.016..437.205 rows=249501 loops=1)
           ->  Seq Scan on matches m  (cost=0.00..8.01 rows=501 width=24) (actual time=0.014..0.133 rows=501 loops=1)
           ->  Hash Join  (cost=7602.00..7741.65 rows=5010 width=24) (actual time=0.294..0.733 rows=499 loops=499)
                 Hash Cond: ((dgt.match_id = ev.match_id) AND ((dgt.seq + 1) = ev.seq))
                 ->  WorkTable Scan on delta_gts dgt  (cost=0.00..100.20 rows=5010 width=16) (actual time=0.000..0.061 rows=500 loops=499)
                 ->  Hash  (cost=3852.00..3852.00 rows=250000 width=16) (actual time=145.065..145.066 rows=250000 loops=1)
                       Buckets: 262144  Batches: 1  Memory Usage: 14744kB
                       ->  Seq Scan on events ev  (cost=0.00..3852.00 rows=250000 width=16) (actual time=0.012..59.861 rows=250000 loops=1)
 Planning Time: 0.278 ms
 Execution Time: 745.422 ms

With both UNION and UNION ALL, I got pretty much the same plan as yours, except it's much faster. So, same query, same plan, different speed, this is weird.

Big difference is your hash uses: Batches: 8 Memory Usage: 2484kB

And mine uses only one batch. So I set work_mem to 2MB (it was at 64MB) and also got a multi-batch hash, which was just as slow as your query.

It appears that, when the hash can be done in one batch, postgres will only do it once for the whole query, but if it has to be done in several batches, it will redo it for each iteration of the recursive query, that means about 500 times. That should explain why it's slow. Apparently the planner isn't aware of that, so it picks the hashjoin.

Using set enable_hashjoin to 'f'; before the query makes it use the index. This is much slower than a 1-bucket hash done once, and much faster than a hash re-done for every iteration.

But really, the proper solution is to use a window function. It will also properly handle a condition like WHERE match_id=...

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange