Suboptimal plan when joining on an ID column with incorrect statistics

https://dba.stackexchange.com/questions/282131

12-03-2021
|

Question

We are having an issue with a joining query between 2 tables (showing only relevant columns from the query):

CREATE TABLE IF NOT EXISTS payments (
    id TEXT PRIMARY KEY,
    consumer_id TEXT,
    created_at TIMESTAMP WITHOUT TIME ZONE,
    status TEXT,
    tier TEXT,
    number_of_installments INTEGER,
    authorization_result TEXT,
    authorization_details JSONB,
    last_updated_at TIMESTAMP WITHOUT TIME ZONE NOT NULL
);

The assessmentreports table is partitioned using declarative partitioning with the following structure and indexes on all partitions:

CREATE TABLE IF NOT EXISTS assessmentreports (
    id TEXT PRIMARY KEY,
    payment_id TEXT,
    kind TEXT,
    created_at TIMESTAMP WITHOUT TIME ZONE,
    result TEXT,
    metadata JSONB,
    last_updated_at TIMESTAMP WITHOUT TIME ZONE NOT NULL
) PARTITION BY RANGE (created_at);

CREATE INDEX assessmentreports_kind_result ON public.assessmentreports USING btree (kind, result)
CREATE INDEX assessmentreports_payment_id ON public.assessmentreports USING btree (payment_id)
CREATE UNIQUE INDEX assessmentreports_pkey ON public.assessmentreports USING btree (created_at, id)
CREATE INDEX assessmentreports_last_updated_at_idx ON public.assessmentreports USING brin (last_updated_at)

There can be multiple entries in assessmentreports for one payment_id each with a different kind. Unfortunately there can be multiple of the same kind for a given payment_id.

Here is a representative example:

SELECT
  payments.id,
  status,
  consumer_id,
  tier,
  authorization_result,
  authorization_details,
  number_of_installments,
  max(result) FILTER (WHERE (kind = 'credit_check')) AS credit_result,
  max(result) FILTER (WHERE (kind = 'new_credit_check')) AS new_check_result,
  max(result) FILTER (WHERE (kind = 'address_assessment')) AS reject_inference_result,
  max(metadata ->> 'credit_record_id') FILTER (WHERE (kind = 'new_credit_check')) AS new_credit_record_id,
  max(metadata ->> 'credit_record_id') FILTER (WHERE (kind = 'credit_check')) AS credit_record_id
FROM payments
LEFT JOIN assessmentreports ON assessmentreports.payment_id = payments.id
  AND kind IN ('credit_check', 'new_credit_check', 'address_assessment')
  AND assessmentreports.created_at < now() -- To remove future partitions from plan
WHERE payments.last_updated >= now() - '2 hours'::INTERVAL
GROUP BY 1

If we keep the time window to 2 hours, the planner shows an index scan on payment_id = id and it's very fast (2 seconds). Note that I've removed the nodes for the partitions since they all look the same.

GroupAggregate  (cost=6231334.74..6231477.18 rows=2187 width=305)
  Group Key: payments.id
  ->  Sort  (cost=6231334.74..6231343.35 rows=3445 width=467)
        Sort Key: payments.id
        ->  Nested Loop Left Join  (cost=0.99..6231132.34 rows=3445 width=467)
              ->  Index Scan using payments_last_updated on payments (cost=0.56..1801.97 rows=2187 width=145)
                    Index Cond: (last_updated >= '2020-12-25 13:57:26.927564'::timestamp without time zone)
              ->  Append  (cost=0.43..2846.99 rows=135 width=344)
                    ->  Index Scan using assessmentreports_2016_payment_id on assessmentreports_2016  (cost=0.43..14.47 rows=1 width=181)
                          Index Cond: (payment_id = payments.id)
                          Filter: ((created_at < '2020-12-25 15:57:26.927564'::timestamp without time zone) AND (kind = ANY ('{credit_check,new_credit_check,address_assessment}'::text[])))
                    ->  Index Scan using assessmentreports_201701_payment_id on assessmentreports_201701  (cost=0.42..13.39 rows=1 width=161)
                          Index Cond: (payment_id = payments.id)
                          Filter: ((created_at < '2020-12-25 15:57:26.927564'::timestamp without time zone) AND (kind = ANY ('{credit_check,new_credit_check,address_assessment}'::text[])))
                    ->  Index Scan using assessmentreports_201702_payment_id on assessmentreports_201702  (cost=0.43..17.74 rows=1 width=192)
                          Index Cond: (payment_id = payments.id)
                          Filter: ((created_at < '2020-12-25 15:57:26.927564'::timestamp without time zone) AND (kind = ANY ('{credit_check,new_credit_check,address_assessment}'::text[])))

When increasing the time even a little bit over the 2 hours, the plan changes entirely and the performance is orders of magnitude worse (+40 minutes), even though the number of rows in payments after the last_updated_at filter hasn't increased that much. Again I've removed the nodes for the partitions since they all look the same:

GroupAggregate  (cost=35839796.76..35841181.68 rows=21288 width=304)
  Group Key: payments.id
  ->  Sort  (cost=35839796.76..35839880.48 rows=33487 width=463)
        Sort Key: payments.id
        ->  Hash Join  (cost=25622.81..35830297.49 rows=33487 width=463)
              Hash Cond: (assessmentreports_2016.payment_id = payments.id)
              ->  Append  (cost=7457.32..35712704.68 rows=37877067 width=341)
                    ->  Bitmap Heap Scan on assessmentreports_2016  (cost=7457.32..181237.37 rows=228085 width=181)
                          Recheck Cond: (kind = ANY ('{credit_check,new_credit_check,address_assessment}'::text[]))
                          Filter: (created_at <= '2020-12-25 17:02:52.008321'::timestamp without time zone)
                          ->  Bitmap Index Scan on assessmentreports_2016_kind_result  (cost=0.00..7400.30 rows=228085 width=0)
                                Index Cond: (kind = ANY ('{credit_check,new_credit_check,address_assessment}'::text[]))
                    ->  Bitmap Heap Scan on assessmentreports_201701  (cost=4018.67..32291.56 rows=130762 width=161)
                          Recheck Cond: (kind = ANY ('{credit_check,new_credit_check,address_assessment}'::text[]))
                          Filter: (created_at <= '2020-12-25 17:02:52.008321'::timestamp without time zone)
                          ->  Bitmap Index Scan on assessmentreports_201701_kind_result  (cost=0.00..3985.98 rows=130762 width=0)
                                Index Cond: (kind = ANY ('{credit_check,new_credit_check,address_assessment}'::text[]))
              ->  Hash  (cost=17899.39..17899.39 rows=21288 width=144)
                    ->  Index Scan using payments_last_updated on payments  (cost=0.56..17899.39 rows=21288 width=144)
                          Index Cond: (last_updated >= '2020-12-25 07:02:52.008321'::timestamp without time zone)

My understanding from the first plan is that it is filtering assessmentreports on payments.id = payment_id after it filters by payments by last_updated_at. This naively makes sense to me since that would pick out a lot fewer payments than a filter on kind alone.

What we've tried so far

The RANDOM_PAGE_COST is set to 4 but should be 1.1 or 1 since we are on SSDs. We try to set it for the transaction and that seems to extend the usage of the index up to a window of 10 hours. The 10 hour window finishes in 4 seconds compared to not being done after 10 minutes with the "bitmap" plan.
Create a custom dependencies statistic for (payment_id and kind), but that doesn't seem to really do anything.

Any ideas? Are we on the right track thinking that the index scan plan is better for these amounts of data and its just PostgreSQL being confused? Having looked at the statistics for that payment_id column I don't see anything that would help the planner understand how much filtering on payment_id would reduce the number of rows but it seems to understand pretty well what filtering on kind would do.

Approximate Statistics

I have to be intentionally vague to avoid sharing too much information here.

The total number rows in assessmentreports (all partitions) is well over half a billion, with a size of 202Gb.
The number of kinds has increased over time and there's over 20 or so now.
The number of assessmentreports per payment depends on the payment but for the vast majority all 20+ of them are present.
Payments is 33Gb in size of posting and is NOT partitioned.

Hardware

We are running PostgreSQL 12 (no Redshift) on RDS with 16 vCPU, 48gb of RAM, and SSDs.

Solution

Index

The query plan spills that you have an index payments_last_updated. That's all we need for payments.

As for assessmentreports:

There can be multiple entries in assessmentreports for one payment_id each with a different kind.

So there could (should) be this UNIQUE index:

CREATE UNIQUE INDEX assessmentreports_payment_id_kind_uni ON assessmentreports (payment_id, kind);

That should help to get index scans.

Query

I would try to move the aggregation down into LATERAL subquery. That should favor index scans - in addition to the new (?) tailored index above. Also saves the outer aggregation.

SELECT p.* --  or your list of payments columns
     , a.*
FROM   payments p
LEFT   JOIN LATERAL (
   SELECT max(result)                          FILTER (WHERE kind = 'credit_check')       AS credit_result
        , max(result)                          FILTER (WHERE kind = 'new_credit_check')   AS new_check_result
        , max(result)                          FILTER (WHERE kind = 'address_assessment') AS reject_inference_result
        , max(metadata ->> 'credit_record_id') FILTER (WHERE kind = 'new_credit_check')   AS new_credit_record_id
        , max(metadata ->> 'credit_record_id') FILTER (WHERE kind = 'credit_check')       AS credit_record_id
   FROM   assessmentreports
   WHERE  payment_id = p.id
   AND    kind IN ('credit_check', 'new_credit_check', 'address_assessment')
   AND    created_at < now()
   ) a ON true
WHERE  p.last_updated_at >= now() - interval '2 hours';

Setup

Over 500,000,000 rows in assessmentreports. But you operate with text PK and FK? That's typically more expensive than integer (or bigint if you must) for storage and processing.

You might gain some more from reordering table columns favorably. See:

Calculating and saving space in PostgreSQL

metadata JSONB is particularly suspicious for a table of that cardinality. Use dedicated columns instead if at all possible. Much more (space) efficient. NULL storage in columns is very cheap. See:

48 GB of RAM are probably not enough to keep most of your data cached. (More if you waste less storage.) So random_page_cost should be at least 1.1, even with SSD storage.

This index looks odd:

CREATE UNIQUE INDEX assessmentreports_pkey ON assessmentreports USING btree (created_at, id);

For one, don't call a unique index "pkey". Only the PK index should be named like that. And why would you have a unique index on (created_at, id) to begin with?

All the usual performance advice applies. In particular: effective_cache_size. Set it high. Like:

effective_cache_size = 36GB

The manual:

a higher value makes it more likely index scans will be used

Finally, I don't see parallelism in your query plans. Might help with a table of your caliber. Did you configure it properly? The manual:

Whenever PostgreSQL needs to combine rows from multiple sources into a single result set, it uses an Append or MergeAppend plan node. This commonly happens when implementing UNION ALL or when scanning a partitioned table. Such nodes can be used in parallel plans just as they can in any other plan. However, in a parallel plan, the planner may instead use a Parallel Append node.

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange