Question

Sorry if this seems to be a duplicate question. I'm using Postgres 11.6 on AWS RDS. I have 2 tables:

CREATE TABLE public.e
(
    id character varying(32) COLLATE pg_catalog."default" NOT NULL,
    p_id character varying(32) COLLATE pg_catalog."default" NOT NULL,
    CONSTRAINT e_pkey PRIMARY KEY (id)
)
WITH (
    OIDS = FALSE
)
TABLESPACE pg_default;

CREATE TABLE public.ed
(
    e_id character varying(32) COLLATE pg_catalog."default" NOT NULL,
    <other columns + primary key>
)
WITH (
    OIDS = FALSE
)
TABLESPACE pg_default;

I have an index on ed.e_id:

CREATE INDEX ix_ed_e_id
    ON public.ed USING btree
    (e_id COLLATE pg_catalog."default" ASC NULLS LAST)
    TABLESPACE pg_default;

When I run this query:

select *
from ed, e
where e.id = ed.e_id
and e.p_id = '5c7cae8df6d10f1064b2eaf5';

(Problem persists when using from ed inner join e on e.id = ed.e_id)

The explain analyze plan is:

Gather  (cost=1136.68..141235.01 rows=28320 width=311) (actual time=0.456..871.155 rows=102709 loops=1)
  Workers Planned: 2
  Workers Launched: 2
  ->  Hash Join  (cost=136.68..137403.01 rows=11800 width=311) (actual time=0.241..688.095 rows=34236 loops=3)
        Hash Cond: (ed.e_id = e.id)
        ->  Parallel Seq Scan on ed ed  (cost=0.00..133210.10 rows=1544610 width=218) (actual time=0.005..314.524 rows=1235269 loops=3)
        ->  Hash  (cost=135.67..135.67 rows=81 width=93) (actual time=0.125..0.126 rows=81 loops=3)
              Buckets: 1024  Batches: 1  Memory Usage: 19kB
              ->  Bitmap Heap Scan on e e  (cost=4.91..135.67 rows=81 width=93) (actual time=0.045..0.097 rows=81 loops=3)
                    Recheck Cond: ((p_id)::text = '5c7cae8df6d10f1064b2eaf5'::text)
                    Heap Blocks: exact=31
                    ->  Bitmap Index Scan on ix_e_p_id  (cost=0.00..4.89 rows=81 width=0) (actual time=0.035..0.035 rows=81 loops=3)
                          Index Cond: ((p_id)::text = '5c7cae8df6d10f1064b2eaf5'::text)
Planning Time: 0.329 ms
Execution Time: 877.804 ms

With a Parallel Seq Scan on ed for the ed.e_id match.

When I SET SESSION enable_seqscan = OFF, the explain plan is:

Nested Loop  (cost=0.72..395895.14 rows=28320 width=311) (actual time=0.037..60.068 rows=102709 loops=1)
  ->  Index Scan using e_pkey on e e  (cost=0.29..917.61 rows=81 width=93) (actual time=0.019..4.995 rows=81 loops=1)
        Filter: ((p_id)::text = '5c7cae8df6d10f1064b2eaf5'::text)
        Rows Removed by Filter: 10522
  ->  Index Scan using ix_ed_e_id on ed ed  (cost=0.43..4757.83 rows=11844 width=218) (actual time=0.013..0.334 rows=1268 loops=81)
        Index Cond: (e_id = e.id)
Planning Time: 0.273 ms
Execution Time: 64.675 ms

A whole order of magnitude faster (877ms vs 64ms)! I tried VACUUM ANALYZE ed, but that didn't help. I even tried changing e.id & ed.e_id to UUID type, but that didn't help either.

How can I convince Postgres to use the ix_ed_e_id index without setting enable_seqscan to off?

Était-ce utile?

La solution

It seems that PostgreSQL is overestimating the cost of an index scan, which leads it to prefer a hash join over a nested loop join.

There are two parameters that tell PostgreSQL about the hardware and influence its estimate of the cost of an index scan:

  • random_page_cost: the greater that is compared to seq_page_cost, the more expensive PostgreSQL will estimate an index scan's random I/O compared to sequential I/O. So you can lower that parameter to encourage index scans.

  • effective_cache_size: this tells the optimizer how much memory is available for caching data. If the value is high, it will assume that indexes are cached and price index scans lower.

Perhaps adjusting these parameters will change PostgreSQL's mind here, although the cost estimates are suspiciously far apart.

Licencié sous: CC-BY-SA avec attribution
Non affilié à dba.stackexchange
scroll top