Aurora PostgreSQL database using a slower query plan than a normal PostgreSQL for an identical query?

https://dba.stackexchange.com/questions/260429

24-02-2021
|

Question

Following the migration of an application and its database from a classical PostgreSQL database to an Amazon Aurora RDS PostgreSQL database (both using 9.6 version), we have found that a specific query is running much slower -- around 10 times slower -- on Aurora than on PostgreSQL.

Both databases have the same configuration, be it for the hardware or the pg_conf.

The query itself is fairly simple. It is generated from our backend written in Java and using jOOQ for writing the queries:

with "all_acp_ids"("acp_id") as (
    select acp_id from temp_table_de3398bacb6c4e8ca8b37be227eac089
) 
select distinct "public"."f1_folio_milestones"."acp_id", 
    coalesce("public"."sa_milestone_overrides"."team", 
    "public"."f1_folio_milestones"."team_responsible") 
from "public"."f1_folio_milestones" 
left outer join 
    "public"."sa_milestone_overrides" on (
        "public"."f1_folio_milestones"."milestone" = "public"."sa_milestone_overrides"."milestone" 
        and "public"."f1_folio_milestones"."view" = "public"."sa_milestone_overrides"."view" 
        and "public"."f1_folio_milestones"."acp_id" = "public"."sa_milestone_overrides"."acp_id"
)
where "public"."f1_folio_milestones"."acp_id" in (
    select "all_acp_ids"."acp_id" from "all_acp_ids"
)

With temp_table_de3398bacb6c4e8ca8b37be227eac089 being a single-column table, f1_folio_milestones (17 million entries) and sa_milestone_overrides (Around 1 million entries) being similarly designed tables having indexes on all the columns used for the LEFT OUTER JOIN.

temp_table_de3398bacb6c4e8ca8b37be227eac089 can contain up to 5000 entries, all of them being distinct.

When we run it on the normal PostgreSQL database, it generates the following query plan:

Unique  (cost=4802622.20..4868822.51 rows=8826708 width=43) (actual time=483.928..483.930 rows=1 loops=1)
  CTE all_acp_ids
    ->  Seq Scan on temp_table_de3398bacb6c4e8ca8b37be227eac089  (cost=0.00..23.60 rows=1360 width=32) (actual time=0.004..0.005 rows=1 loops=1)
  ->  Sort  (cost=4802598.60..4824665.37 rows=8826708 width=43) (actual time=483.927..483.927 rows=4 loops=1)
        Sort Key: f1_folio_milestones.acp_id, (COALESCE(sa_milestone_overrides.team, f1_folio_milestones.team_responsible))
        Sort Method: quicksort  Memory: 25kB
        ->  Hash Left Join  (cost=46051.06..3590338.34 rows=8826708 width=43) (actual time=483.905..483.917 rows=4 loops=1)
              Hash Cond: ((f1_folio_milestones.milestone = sa_milestone_overrides.milestone) AND (f1_folio_milestones.view = (sa_milestone_overrides.view)::text) AND (f1_folio_milestones.acp_id = (sa_milestone_overrides.acp_id)::text))
              ->  Nested Loop  (cost=31.16..2572.60 rows=8826708 width=37) (actual time=0.029..0.038 rows=4 loops=1)
                    ->  HashAggregate  (cost=30.60..32.60 rows=200 width=32) (actual time=0.009..0.010 rows=1 loops=1)
                          Group Key: all_acp_ids.acp_id
                          ->  CTE Scan on all_acp_ids  (cost=0.00..27.20 rows=1360 width=32) (actual time=0.006..0.007 rows=1 loops=1)
                    ->  Index Scan using f1_folio_milestones_acp_id_idx on f1_folio_milestones  (cost=0.56..12.65 rows=5 width=37) (actual time=0.018..0.025 rows=4 loops=1)
                          Index Cond: (acp_id = all_acp_ids.acp_id)
              ->  Hash  (cost=28726.78..28726.78 rows=988178 width=34) (actual time=480.423..480.423 rows=987355 loops=1)
                    Buckets: 1048576  Batches: 1  Memory Usage: 72580kB
                    ->  Seq Scan on sa_milestone_overrides  (cost=0.00..28726.78 rows=988178 width=34) (actual time=0.004..189.641 rows=987355 loops=1)
Planning time: 3.561 ms
Execution time: 489.223 ms

And it goes pretty smoothly as one can see -- less than a second for the query. But on the Aurora instance, this happens:

Unique  (cost=2632927.29..2699194.83 rows=8835672 width=43) (actual time=4577.348..4577.350 rows=1 loops=1)
  CTE all_acp_ids
    ->  Seq Scan on temp_table_de3398bacb6c4e8ca8b37be227eac089  (cost=0.00..23.60 rows=1360 width=32) (actual time=0.001..0.001 rows=1 loops=1)
  ->  Sort  (cost=2632903.69..2654992.87 rows=8835672 width=43) (actual time=4577.348..4577.348 rows=4 loops=1)
        Sort Key: f1_folio_milestones.acp_id, (COALESCE(sa_milestone_overrides.team, f1_folio_milestones.team_responsible))
        Sort Method: quicksort  Memory: 25kB
        ->  Merge Left Join  (cost=1321097.58..1419347.08 rows=8835672 width=43) (actual time=4488.369..4577.330 rows=4 loops=1)
              Merge Cond: ((f1_folio_milestones.view = (sa_milestone_overrides.view)::text) AND (f1_folio_milestones.milestone = sa_milestone_overrides.milestone) AND (f1_folio_milestones.acp_id = (sa_milestone_overrides.acp_id)::text))
              ->  Sort  (cost=1194151.06..1216240.24 rows=8835672 width=37) (actual time=0.039..0.040 rows=4 loops=1)
                    Sort Key: f1_folio_milestones.view, f1_folio_milestones.milestone, f1_folio_milestones.acp_id
                    Sort Method: quicksort  Memory: 25kB
                    ->  Nested Loop  (cost=31.16..2166.95 rows=8835672 width=37) (actual time=0.022..0.028 rows=4 loops=1)
                          ->  HashAggregate  (cost=30.60..32.60 rows=200 width=32) (actual time=0.006..0.006 rows=1 loops=1)
                                Group Key: all_acp_ids.acp_id
                                ->  CTE Scan on all_acp_ids  (cost=0.00..27.20 rows=1360 width=32) (actual time=0.003..0.004 rows=1 loops=1)
                          ->  Index Scan using f1_folio_milestones_acp_id_idx on f1_folio_milestones  (cost=0.56..10.63 rows=4 width=37) (actual time=0.011..0.015 rows=4 loops=1)
                                Index Cond: (acp_id = all_acp_ids.acp_id)
              ->  Sort  (cost=126946.52..129413.75 rows=986892 width=34) (actual time=4462.727..4526.822 rows=448136 loops=1)
                    Sort Key: sa_milestone_overrides.view, sa_milestone_overrides.milestone, sa_milestone_overrides.acp_id
                    Sort Method: quicksort  Memory: 106092kB
                    ->  Seq Scan on sa_milestone_overrides  (cost=0.00..28688.92 rows=986892 width=34) (actual time=0.003..164.348 rows=986867 loops=1)
Planning time: 1.394 ms
Execution time: 4583.295 ms

It effectively has a lower global cost, but takes almost 10 times as much time than before!

Disabling merge joins makes Aurora revert to a hash join, which gives the expected execution time -- but permanently disabling it is not an option. Curiously though, disabling nested loops gives an even better result while still using a merge join...

Unique  (cost=3610230.74..3676431.05 rows=8826708 width=43) (actual time=2.465..2.466 rows=1 loops=1)
  CTE all_acp_ids
    ->  Seq Scan on temp_table_de3398bacb6c4e8ca8b37be227eac089  (cost=0.00..23.60 rows=1360 width=32) (actual time=0.004..0.004 rows=1 loops=1)
  ->  Sort  (cost=3610207.14..3632273.91 rows=8826708 width=43) (actual time=2.464..2.464 rows=4 loops=1)
        Sort Key: f1_folio_milestones.acp_id, (COALESCE(sa_milestone_overrides.team, f1_folio_milestones.team_responsible))
        Sort Method: quicksort  Memory: 25kB
        ->  Merge Left Join  (cost=59.48..2397946.87 rows=8826708 width=43) (actual time=2.450..2.455 rows=4 loops=1)
              Merge Cond: (f1_folio_milestones.acp_id = (sa_milestone_overrides.acp_id)::text)
              Join Filter: ((f1_folio_milestones.milestone = sa_milestone_overrides.milestone) AND (f1_folio_milestones.view = (sa_milestone_overrides.view)::text))
              ->  Merge Join  (cost=40.81..2267461.88 rows=8826708 width=37) (actual time=2.312..2.317 rows=4 loops=1)
                    Merge Cond: (f1_folio_milestones.acp_id = all_acp_ids.acp_id)
                    ->  Index Scan using f1_folio_milestones_acp_id_idx on f1_folio_milestones  (cost=0.56..2223273.29 rows=17653416 width=37) (actual time=0.020..2.020 rows=1952 loops=1)
                    ->  Sort  (cost=40.24..40.74 rows=200 width=32) (actual time=0.011..0.012 rows=1 loops=1)
                          Sort Key: all_acp_ids.acp_id
                          Sort Method: quicksort  Memory: 25kB
                          ->  HashAggregate  (cost=30.60..32.60 rows=200 width=32) (actual time=0.008..0.008 rows=1 loops=1)
                                Group Key: all_acp_ids.acp_id
                                ->  CTE Scan on all_acp_ids  (cost=0.00..27.20 rows=1360 width=32) (actual time=0.005..0.005 rows=1 loops=1)
              ->  Materialize  (cost=0.42..62167.38 rows=987968 width=34) (actual time=0.021..0.101 rows=199 loops=1)
                    ->  Index Scan using sa_milestone_overrides_acp_id_index on sa_milestone_overrides  (cost=0.42..59697.46 rows=987968 width=34) (actual time=0.019..0.078 rows=199 loops=1)
Planning time: 5.500 ms
Execution time: 2.516 ms

We have asked the AWS support team, they are still looking at the issue, but we are wondering what could cause that issue to happen. What could explain such a behaviour difference?

While looking at some of the documentation for the database, I read that Aurora favors cost over time -- and hence it uses the query plan that has the lowest cost.

But as we can see, it's far from being optimal given its response time... Is there a threshold or a setting that could make the database use a more expensive -- but faster -- query plan?

La solution

One problem stands out (in all query plans) and is easy to fix:

Seq Scan on temp_table_de3398bacb6c4e8ca8b37be227eac089 (cost=0.00..23.60 rows=1360 width=32) (actual time=0.004..0.005 rows=1 loops=1)

Bold emphasis mine. Says, Postgres expects 1360 rows in that table, but finds only 1.

You commented:

it's a normal table which gets dropped after everything is done. [...] the query plan was done with a single value in the table, but it can have up to 5000 entries in total, all distinct.

.. which can explain the the totally misleading expected row count. Statistics are kept up to date by autovacuum. But it needs some time to kick in. If you run a complex query right after populating (or largely modifying) such a table, it would be wise to run at least ANALYZE manually in between:

ANALYZE temp_table_de3398bacb6c4e8ca8b37be227eac089;

Maybe even VACUUM ANALYZE, but that cannot be run inside a transaction.

An actual temporary table (CREATE TEMP TABLE ...) would seem like the better choice for your use case. (While other sessions don't need to see the same state of the table.) Better performance overall. But worth mentioning that those are exempt from being analyzed by autovacuum at all. See:

Are regular VACUUM ANALYZE still recommended under 9.1?

Failing that, Postgres tries all sorts of inapt query plans, based on a completely misleading row estimation in the core table. There are many possible reasons why Postgres picks a different query plan - anything that changes cost estimations. But fix that and the problem at hand may very well go away.

"cost" in Postgres query plans is the estimated time (in an arbitrary unit).
"actual time" is time measured after the fact.

Postgres always favors lower cost. That's how it decides which plan to pick. Not related to Aurora in any way. Your main problem are misleading statistics --> misleading row estimates --> misleading cost estimates. Details in the manual. Start here and here

Alternative query

That said, I would simplify the query. The CTE adds nothing useful to the query - materialized or not. Remove it.

SELECT DISTINCT m.acp_id, 
       COALESCE(o.team, m.team_responsible) 
FROM   temp_table_de3398bacb6c4e8ca8b37be227eac089 t
JOIN   public.f1_folio_milestones                  m USING (acp_id)
LEFT   JOIN public.sa_milestone_overrides          o USING (milestone, view, acp_id);

USING is just convenience to save typing. No performance impact. Only use where applicable.

Given your cardinalities (max 5000 rows in t, but 17 million in m) I would try this alternative query with a lateral subquery:

SELECT t.acp_id, om.team
FROM   temp_table_de3398bacb6c4e8ca8b37be227eac089 t
CROSS  JOIN LATERAL (
   SELECT COALESCE(o.team, m.team_responsible) AS team
   FROM   public.f1_folio_milestones         m
   LEFT   JOIN public.sa_milestone_overrides o USING (milestone, view, acp_id)
   WHERE  m.acp_id = t.acp_id
   ) om;

With favorable data distribution and fitting indexes, it might be much faster. Further reading (recommended if you want to optimize!):

Optimize GROUP BY query to retrieve latest row per user

Licencié sous: CC-BY-SA avec attribution

Non affilié à dba.stackexchange