Unpredictable query performance in Postgresql

https://stackoverflow.com/questions/19691575

02-07-2022
|

Domanda

I have tables like that in a Postgres 9.3 database:

A <1---n B n---1> C

Table A contains ~10^7 rows, table B is rather big with ~10^9 rows and C contains ~100 rows.

I use the following query to find all As (distinct) that match some criteria in B and C (the real query is more complex, joins more tables and checks more attributes within the subquery):

Query 1:

explain analyze
select A.SNr from A
where exists (select 1 from B, C
              where B.AId = A.Id and
                    B.CId = C.Id and
                    B.Timestamp >= '2013-01-01' and
                    B.Timestamp <= '2013-01-12' and
                    C.Name = '00000015')
limit 200;

That query takes about 500ms (Note that C.Name = '00000015' exists in the table):

Limit  (cost=119656.37..120234.06 rows=200 width=9) (actual time=427.799..465.485 rows=200 loops=1)
  ->  Hash Semi Join  (cost=119656.37..483518.78 rows=125971 width=9) (actual time=427.797..465.460 rows=200 loops=1)
        Hash Cond: (a.id = b.aid)
        ->  Seq Scan on a  (cost=0.00..196761.34 rows=12020034 width=13) (actual time=0.010..15.058 rows=133470 loops=1)
        ->  Hash  (cost=117588.73..117588.73 rows=125971 width=4) (actual time=427.233..427.233 rows=190920 loops=1)
              Buckets: 4096  Batches: 8  Memory Usage: 838kB
              ->  Nested Loop  (cost=0.57..117588.73 rows=125971 width=4) (actual time=0.176..400.326 rows=190920 loops=1)
                    ->  Seq Scan on c  (cost=0.00..2.88 rows=1 width=4) (actual time=0.015..0.030 rows=1 loops=1)
                          Filter: (name = '00000015'::text)
                          Rows Removed by Filter: 149
                    ->  Index Only Scan using cid_aid on b  (cost=0.57..116291.64 rows=129422 width=8) (actual time=0.157..382.896 rows=190920 loops=1)
                          Index Cond: ((cid = c.id) AND ("timestamp" >= '2013-01-01 00:00:00'::timestamp without time zone) AND ("timestamp" <= '2013-01-12 00:00:00'::timestamp without time zone))
                          Heap Fetches: 0
Total runtime: 476.173 ms

Query 2: Changing C.Name to something that doesn't exist (C.Name = 'foo') takes 0.1ms:

explain analyze
select A.SNr from A
where exists (select 1 from B, C
              where B.AId = A.Id and
                    B.CId = C.Id and
                    B.Timestamp >= '2013-01-01' and
                    B.Timestamp <= '2013-01-12' and
                    C.Name = 'foo')
limit 200;

Limit  (cost=119656.37..120234.06 rows=200 width=9) (actual time=0.063..0.063 rows=0 loops=1)
  ->  Hash Semi Join  (cost=119656.37..483518.78 rows=125971 width=9) (actual time=0.062..0.062 rows=0 loops=1)
        Hash Cond: (a.id = b.aid)
        ->  Seq Scan on a  (cost=0.00..196761.34 rows=12020034 width=13) (actual time=0.010..0.010 rows=1 loops=1)
        ->  Hash  (cost=117588.73..117588.73 rows=125971 width=4) (actual time=0.038..0.038 rows=0 loops=1)
              Buckets: 4096  Batches: 8  Memory Usage: 0kB
              ->  Nested Loop  (cost=0.57..117588.73 rows=125971 width=4) (actual time=0.038..0.038 rows=0 loops=1)
                    ->  Seq Scan on c  (cost=0.00..2.88 rows=1 width=4) (actual time=0.037..0.037 rows=0 loops=1)
                          Filter: (name = 'foo'::text)
                          Rows Removed by Filter: 150
                    ->  Index Only Scan using cid_aid on b  (cost=0.57..116291.64 rows=129422 width=8) (never executed)
                          Index Cond: ((cid = c.id) AND ("timestamp" >= '2013-01-01 00:00:00'::timestamp without time zone) AND ("timestamp" <= '2013-01-12 00:00:00'::timestamp without time zone))
                          Heap Fetches: 0
Total runtime: 0.120 ms

Query 3: Resetting the C.Name to something that exists (like in the first query) and increasing the timestamp by 3 days uses another query plan than before, but is still fast (200ms):

explain analyze
select A.SNr from A
where exists (select 1 from B, C
              where B.AId = A.Id and
                    B.CId = C.Id and
                    B.Timestamp >= '2013-01-01' and
                    B.Timestamp <= '2013-01-15' and
                    C.Name = '00000015')
limit 200;

Limit  (cost=0.57..112656.93 rows=200 width=9) (actual time=4.404..227.569 rows=200 loops=1)
  ->  Nested Loop Semi Join  (cost=0.57..90347016.34 rows=160394 width=9) (actual time=4.403..227.544 rows=200 loops=1)
        ->  Seq Scan on a  (cost=0.00..196761.34 rows=12020034 width=13) (actual time=0.008..1.046 rows=12250 loops=1)
        ->  Nested Loop  (cost=0.57..7.49 rows=1 width=4) (actual time=0.017..0.017 rows=0 loops=12250)
              ->  Seq Scan on c  (cost=0.00..2.88 rows=1 width=4) (actual time=0.005..0.015 rows=1 loops=12250)
                    Filter: (name = '00000015'::text)
                    Rows Removed by Filter: 147
              ->  Index Only Scan using cid_aid on b  (cost=0.57..4.60 rows=1 width=8) (actual time=0.002..0.002 rows=0 loops=12250)
                    Index Cond: ((cid = c.id) AND (aid = a.id) AND ("timestamp" >= '2013-01-01 00:00:00'::timestamp without time zone) AND ("timestamp" <= '2013-01-15 00:00:00'::timestamp without time zone))
                    Heap Fetches: 0
Total runtime: 227.632 ms

Query 4: But that new query plan utterly fails when searching for a C.Name that doesn't exist::

explain analyze
select A.SNr from A
where exists (select 1 from B, C
              where B.AId = A.Id and
                    B.CId = C.Id and
                    B.Timestamp >= '2013-01-01' and
                    B.Timestamp <= '2013-01-15' and
                    C.Name = 'foo')
limit 200;

Now it takes 170 seconds (vs. 0.1ms before!) to return the same 0 rows:

Limit  (cost=0.57..112656.93 rows=200 width=9) (actual time=170184.979..170184.979 rows=0 loops=1)
  ->  Nested Loop Semi Join  (cost=0.57..90347016.34 rows=160394 width=9) (actual time=170184.977..170184.977 rows=0 loops=1)
        ->  Seq Scan on a  (cost=0.00..196761.34 rows=12020034 width=13) (actual time=0.008..794.626 rows=12020034 loops=1)
        ->  Nested Loop  (cost=0.57..7.49 rows=1 width=4) (actual time=0.013..0.013 rows=0 loops=12020034)
              ->  Seq Scan on c  (cost=0.00..2.88 rows=1 width=4) (actual time=0.013..0.013 rows=0 loops=12020034)
                    Filter: (name = 'foo'::text)
                    Rows Removed by Filter: 150
              ->  Index Only Scan using cid_aid on b  (cost=0.57..4.60 rows=1 width=8) (never executed)
                    Index Cond: ((cid = c.id) AND (aid = a.id) AND ("timestamp" >= '2013-01-01 00:00:00'::timestamp without time zone) AND ("timestamp" <= '2013-01-15 00:00:00'::timestamp without time zone))
                    Heap Fetches: 0
Total runtime: 170185.033 ms

All queries were run after "alter table set statistics" with a value of 10000 on all columns and after running analyze on the whole db.

Right now it looks like the slightest change of a parameter (not even of the SQL) can make Postgres choose a bad plan (0.1ms vs. 170s in this case!). I always try to check query plans when changing things, but it's hard to ever be sure that something will work when such small changes on parameters can make such huge differences. I have similar problems with other queries too.

What can I do to get more predictable results?

(I have tried modifying certain query planning parameters (set enable_... = on/off) and some different SQL statements - joining+distinct/group by instead of "exists" - but nothing seems to make postgres choose "stable" query plans while still providing acceptable performance).

Edit #1: Table + index definitions

test=# \d a
                          Tabelle äpublic.aô
 Spalte |   Typ   |                     Attribute
--------+---------+----------------------------------------------------
 id     | integer | not null Vorgabewert nextval('a_id_seq'::regclass)
 anr    | integer |
 snr    | text    |
Indexe:
    "a_pkey" PRIMARY KEY, btree (id)
    "anr_snr_index" UNIQUE, btree (anr, snr)
    "anr_index" btree (anr)
Fremdschlnssel-Constraints:
    "anr_fkey" FOREIGN KEY (anr) REFERENCES pt(id)
Fremdschlnsselverweise von:
    TABLE "b" CONSTRAINT "aid_fkey" FOREIGN KEY (aid) REFERENCES a(id)


test=# \d b
                 Tabelle äpublic.bô
  Spalte   |             Typ             | Attribute
-----------+-----------------------------+-----------
 id        | uuid                        | not null
 timestamp | timestamp without time zone |
 cid       | integer                     |
 aid       | integer                     |
 prop1     | text                        |
 propn     | integer                     |
Indexe:
    "b_pkey" PRIMARY KEY, btree (id)
    "aid_cid" btree (aid, cid)
    "cid_aid" btree (cid, aid, "timestamp")
    "timestamp_index" btree ("timestamp")
Fremdschlnssel-Constraints:
    "aid_fkey" FOREIGN KEY (aid) REFERENCES a(id)
    "cid_fkey" FOREIGN KEY (cid) REFERENCES c(id)


test=# \d c
                          Tabelle äpublic.cô
 Spalte |   Typ   |                     Attribute
--------+---------+----------------------------------------------------
 id     | integer | not null Vorgabewert nextval('c_id_seq'::regclass)
 name   | text    |
Indexe:
    "c_pkey" PRIMARY KEY, btree (id)
    "c_name_index" UNIQUE, btree (name)
Fremdschlnsselverweise von:
    TABLE "b" CONSTRAINT "cid_fkey" FOREIGN KEY (cid) REFERENCES c(id)

Soluzione

Your problem is that the query needs to evaluate the correlated sub query for the entire table a. When Postgres quickly finds 200 random rows that fit (which seems to occasionally be the case when c.name exists), it yields them accordingly, and reasonably fast if there are plenty to choose from. But when no such rows exists, it evaluates the entire hogwash in the exists() statement as many times as table a has rows, hence the performance issue you're seeing.

Adding an uncorrelated where clause will most certainly fix a number of edge cases:

and exists(select 1 from c where name = ?)

It might also work when you join the latter with b and write it as a cte:

with bc as (
select aid
from b join c on b.cid = c.bid
and b.timestamp between ? and ?
and c.name = ?
)
select a.id
from a
where exists (select 1 from bc)
and exists (select 1 from bc where a.id = bc.aid)
limit 200

If not, just toss in the bc query verbatim instead of using the cte. The point here is to force Postgres to consider the bc lookup as independent, and bail early if the resulting set yields no rows at all.

I assume your query is more complex in the end, but note that the above could be rewritten as:

with bc as (...)
select aid
from bc
limit 200

Or:

with bc as (...)
select a.id
from a
where a.id in (select aid from bc)
limit 200

Both should yield better plans in edge cases.

(Side note: it's usually unadvisable to limit without ordering.)

Altri suggerimenti

Maybe try to rewrite query with CTE?

with BC as (
    select distinct B.AId from B where
    B.Timestamp >= '2013-01-01' and
    B.Timestamp <= '2013-01-12' and
    B.CId in (select C.Id from C where C.Name = '00000015')
    limit 200
)

select A.SNr from A where A.Id in (select AId from BC)

If I understand correctly - limit could be easily put inside BC query to avoid scan on table A.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow