How does one interpret the following PostgreSQL query plan

https://stackoverflow.com/questions/20424053

29-08-2022
|

Question

Please, observe:

(Forgot to add order, the plan is updated)

The query:

EXPLAIN ANALYZE
SELECT DISTINCT(id), special, customer, business_no, bill_to_name, bill_to_address1, bill_to_address2, bill_to_postal_code, ship_to_name, ship_to_address1, ship_to_address2, ship_to_postal_code, 
purchase_order_no, ship_date::text, calc_discount_text(o) AS discount, discount_absolute, delivery, hst_percents, sub_total, total_before_hst, hst, total, total_discount, terms, rep, ship_via, 
item_count, version, to_char(modified, 'YYYY-MM-DD HH24:MI:SS') AS "modified", to_char(created, 'YYYY-MM-DD HH24:MI:SS') AS "created"
FROM invoices o
LEFT JOIN reps ON reps.rep_id = o.rep_id
LEFT JOIN terms ON terms.terms_id = o.terms_id
LEFT JOIN shipVia ON shipVia.ship_via_id = o.ship_via_id
JOIN invoiceItems items ON items.invoice_id = o.id 
WHERE items.qty < 5
ORDER BY modified
LIMIT 100

The result:

Limit  (cost=2931740.10..2931747.85 rows=100 width=635) (actual time=414307.004..414387.899 rows=100 loops=1)
  ->  Unique  (cost=2931740.10..3076319.37 rows=1865539 width=635) (actual time=414307.001..414387.690 rows=100 loops=1)
        ->  Sort  (cost=2931740.10..2936403.95 rows=1865539 width=635) (actual time=414307.000..414325.058 rows=2956 loops=1)
              Sort Key: (to_char(o.modified, 'YYYY-MM-DD HH24:MI:SS'::text)), o.id, o.special, o.customer, o.business_no, o.bill_to_name, o.bill_to_address1, o.bill_to_address2, o.bill_to_postal_code, o.ship_to_name, o.ship_to_address1, o.ship_to_address2, (...)
              Sort Method: external merge  Disk: 537240kB
              ->  Hash Join  (cost=11579.63..620479.38 rows=1865539 width=635) (actual time=1535.805..131378.864 rows=1872673 loops=1)
                    Hash Cond: (items.invoice_id = o.id)
                    ->  Seq Scan on invoiceitems items  (cost=0.00..78363.45 rows=1865539 width=4) (actual time=0.110..4591.117 rows=1872673 loops=1)
                          Filter: (qty < 5)
                          Rows Removed by Filter: 1405763
                    ->  Hash  (cost=5498.18..5498.18 rows=64996 width=635) (actual time=1530.786..1530.786 rows=64996 loops=1)
                          Buckets: 1024  Batches: 64  Memory Usage: 598kB
                          ->  Hash Left Join  (cost=113.02..5498.18 rows=64996 width=635) (actual time=0.214..1043.207 rows=64996 loops=1)
                                Hash Cond: (o.ship_via_id = shipvia.ship_via_id)
                                ->  Hash Left Join  (cost=75.35..4566.81 rows=64996 width=607) (actual time=0.154..754.957 rows=64996 loops=1)
                                      Hash Cond: (o.terms_id = terms.terms_id)
                                      ->  Hash Left Join  (cost=37.67..3800.33 rows=64996 width=579) (actual time=0.071..506.145 rows=64996 loops=1)
                                            Hash Cond: (o.rep_id = reps.rep_id)
                                            ->  Seq Scan on invoices o  (cost=0.00..2868.96 rows=64996 width=551) (actual time=0.010..235.977 rows=64996 loops=1)
                                            ->  Hash  (cost=22.30..22.30 rows=1230 width=36) (actual time=0.044..0.044 rows=4 loops=1)
                                                  Buckets: 1024  Batches: 1  Memory Usage: 1kB
                                                  ->  Seq Scan on reps  (cost=0.00..22.30 rows=1230 width=36) (actual time=0.027..0.032 rows=4 loops=1)
                                      ->  Hash  (cost=22.30..22.30 rows=1230 width=36) (actual time=0.067..0.067 rows=3 loops=1)
                                            Buckets: 1024  Batches: 1  Memory Usage: 1kB
                                            ->  Seq Scan on terms  (cost=0.00..22.30 rows=1230 width=36) (actual time=0.001..0.007 rows=3 loops=1)
                                ->  Hash  (cost=22.30..22.30 rows=1230 width=36) (actual time=0.043..0.043 rows=4 loops=1)
                                      Buckets: 1024  Batches: 1  Memory Usage: 1kB
                                      ->  Seq Scan on shipvia  (cost=0.00..22.30 rows=1230 width=36) (actual time=0.027..0.032 rows=4 loops=1)
Total runtime: 414488.582 ms

This is, obviously, awful. I am pretty new to interpreting query plans and would like to know how to extract the useful performance improvement hints from such a plan.

EDIT 1

Two kinds of entities are involved in this query - invoices and invoice items having the 1-many relationship.
An invoice item specifies the quantity of it within the parent invoice.
The given query returns 100 invoices which have at least one item with the quantity of less than 5.

That should explain why I need DISTINCT - an invoice may have several items satisfying the filter, but I do not want that same invoice returned multiple times. Hence the usage of DISTINCT. However, I am perfectly aware that there may be better means to accomplish the same semantics than using DISTINCT - I am more than willing to learn about them.

EDIT 2

Please, find below the indexes on the invoiceItems table at the time of the query:

CREATE INDEX invoiceitems_invoice_id_idx ON invoiceitems (invoice_id);
CREATE INDEX invoiceitems_invoice_id_name_index ON invoiceitems (invoice_id, name varchar_pattern_ops);
CREATE INDEX invoiceitems_name_index ON invoiceitems (name varchar_pattern_ops);
CREATE INDEX invoiceitems_qty_index ON invoiceitems (qty);

EDIT 3

The advice given by https://stackoverflow.com/users/808806/yieldsfalsehood as to how eliminate DISTINCT (and why) turns out to be a really good one. Here is the new query:

EXPLAIN ANALYZE
SELECT id, special, customer, business_no, bill_to_name, bill_to_address1, bill_to_address2, bill_to_postal_code, ship_to_name, ship_to_address1, ship_to_address2, ship_to_postal_code, 
purchase_order_no, ship_date::text, calc_discount_text(o) AS discount, discount_absolute, delivery, hst_percents, sub_total, total_before_hst, hst, total, total_discount, terms, rep, ship_via, 
item_count, version, to_char(modified, 'YYYY-MM-DD HH24:MI:SS') AS "modified", to_char(created, 'YYYY-MM-DD HH24:MI:SS') AS "created"
FROM invoices o
LEFT JOIN reps ON reps.rep_id = o.rep_id
LEFT JOIN terms ON terms.terms_id = o.terms_id
LEFT JOIN shipVia ON shipVia.ship_via_id = o.ship_via_id
WHERE EXISTS (SELECT 1 FROM invoiceItems items WHERE items.invoice_id = id AND items.qty < 5)
ORDER BY modified DESC
LIMIT 100

Here is the new plan:

Limit  (cost=64717.14..64717.39 rows=100 width=635) (actual time=7830.347..7830.869 rows=100 loops=1)
  ->  Sort  (cost=64717.14..64827.01 rows=43949 width=635) (actual time=7830.334..7830.568 rows=100 loops=1)
        Sort Key: (to_char(o.modified, 'YYYY-MM-DD HH24:MI:SS'::text))
        Sort Method: top-N heapsort  Memory: 76kB
        ->  Hash Left Join  (cost=113.46..63037.44 rows=43949 width=635) (actual time=2.322..6972.679 rows=64467 loops=1)
              Hash Cond: (o.ship_via_id = shipvia.ship_via_id)
              ->  Hash Left Join  (cost=75.78..50968.72 rows=43949 width=607) (actual time=0.650..3809.276 rows=64467 loops=1)
                    Hash Cond: (o.terms_id = terms.terms_id)
                    ->  Hash Left Join  (cost=38.11..50438.25 rows=43949 width=579) (actual time=0.550..3527.558 rows=64467 loops=1)
                          Hash Cond: (o.rep_id = reps.rep_id)
                          ->  Nested Loop Semi Join  (cost=0.43..49796.28 rows=43949 width=551) (actual time=0.015..3200.735 rows=64467 loops=1)
                                ->  Seq Scan on invoices o  (cost=0.00..2868.96 rows=64996 width=551) (actual time=0.002..317.954 rows=64996 loops=1)
                                ->  Index Scan using invoiceitems_invoice_id_idx on invoiceitems items  (cost=0.43..7.61 rows=42 width=4) (actual time=0.030..0.030 rows=1 loops=64996)
                                      Index Cond: (invoice_id = o.id)
                                      Filter: (qty < 5)
                                      Rows Removed by Filter: 1
                          ->  Hash  (cost=22.30..22.30 rows=1230 width=36) (actual time=0.213..0.213 rows=4 loops=1)
                                Buckets: 1024  Batches: 1  Memory Usage: 1kB
                                ->  Seq Scan on reps  (cost=0.00..22.30 rows=1230 width=36) (actual time=0.183..0.192 rows=4 loops=1)
                    ->  Hash  (cost=22.30..22.30 rows=1230 width=36) (actual time=0.063..0.063 rows=3 loops=1)
                          Buckets: 1024  Batches: 1  Memory Usage: 1kB
                          ->  Seq Scan on terms  (cost=0.00..22.30 rows=1230 width=36) (actual time=0.044..0.050 rows=3 loops=1)
              ->  Hash  (cost=22.30..22.30 rows=1230 width=36) (actual time=0.096..0.096 rows=4 loops=1)
                    Buckets: 1024  Batches: 1  Memory Usage: 1kB
                    ->  Seq Scan on shipvia  (cost=0.00..22.30 rows=1230 width=36) (actual time=0.071..0.079 rows=4 loops=1)
Total runtime: 7832.750 ms

Is it the best I can count on? I have restarted the server (to clean the database caches) and rerun the query without EXPLAIN ANALYZE. It takes almost 5 seconds. Can it be improved even further? I have 65,000 invoices and 3,278,436 invoice items.

EDIT 4

Found it. I was ordering by a computation result, modified = to_char(modified, 'YYYY-MM-DD HH24:MI:SS'). Adding an index on the modified invoice field and ordering by the field itself brings the result to under 100 ms !

The final plan is:

Limit  (cost=1.18..1741.92 rows=100 width=635) (actual time=3.002..27.065 rows=100 loops=1)
  ->  Nested Loop Left Join  (cost=1.18..765042.09 rows=43949 width=635) (actual time=2.989..25.989 rows=100 loops=1)
        ->  Nested Loop Left Join  (cost=1.02..569900.41 rows=43949 width=607) (actual time=0.413..16.863 rows=100 loops=1)
              ->  Nested Loop Left Join  (cost=0.87..386185.48 rows=43949 width=579) (actual time=0.333..15.694 rows=100 loops=1)
                    ->  Nested Loop Semi Join  (cost=0.72..202470.54 rows=43949 width=551) (actual time=0.017..13.965 rows=100 loops=1)
                          ->  Index Scan Backward using invoices_modified_index on invoices o  (cost=0.29..155543.23 rows=64996 width=551) (actual time=0.003..4.543 rows=100 loops=1)
                          ->  Index Scan using invoiceitems_invoice_id_idx on invoiceitems items  (cost=0.43..7.61 rows=42 width=4) (actual time=0.079..0.079 rows=1 loops=100)
                                Index Cond: (invoice_id = o.id)
                                Filter: (qty < 5)
                                Rows Removed by Filter: 1
                    ->  Index Scan using reps_pkey on reps  (cost=0.15..4.17 rows=1 width=36) (actual time=0.007..0.008 rows=1 loops=100)
                          Index Cond: (rep_id = o.rep_id)
              ->  Index Scan using terms_pkey on terms  (cost=0.15..4.17 rows=1 width=36) (actual time=0.003..0.004 rows=1 loops=100)
                    Index Cond: (terms_id = o.terms_id)
        ->  Index Scan using shipvia_pkey on shipvia  (cost=0.15..4.17 rows=1 width=36) (actual time=0.006..0.008 rows=1 loops=100)
              Index Cond: (ship_via_id = o.ship_via_id)
Total runtime: 27.572 ms

It is amazing! Thank you all for the help.

Solution

For starters, it's pretty standard to post explain plans to http://explain.depesz.com - that'll add some pretty formatting to it, give you a nice way to distribute the plan, and let you anonymize plans that might contain sensitive data. Even if you're not distributing the plan it makes it a lot easier to understand what's happening and can sometimes illustrate exactly where a bottleneck is.

There are countless resources that cover interpreting the details of postgres explain plans (see https://wiki.postgresql.org/wiki/Using_EXPLAIN). There are a lot of little details that get taken in to account when the database chooses a plan, but there are some general concepts that can make it easier. First, get a grasp of the page-based layout of data and indexes (you don't need to know the details of the page format, just how data and indexes get split in to pages). From there, get a feel for the two basic data access methods - full table scans and index scans - and with a little thought it should start to become clear the different situations where one would be preferred to the other (also keep in mind that an index scan isn't even always possible). At that point you can start looking in to some of the different configuration items that affect plan selection in the context of how they might tip the scale in favor of a table scan or an index scan.

Once you've got that down, move on up the plan and read in to the details of the different nodes you find - in this plan you've got a lot of hash joins, so read up on that to start with. Then, to compare apples to apples, disable hash joins entirely ("set enable_hashjoin = false;") and run your explain analyze again. Now what join method do you see? Read up on that. Compare the estimated cost of that method with the estimated cost of the hash join. Why might they be different? The estimated cost of the second plan will be higher than this first plan (otherwise it would have been preferred in the first place) but what about the real time that it takes to run the second plan? Is it lower or higher?

Finally, to address this plan specifically. With regards to that sort that's taking a long time: distinct is not a function. "DISTINCT(id)" does not say "give me all the rows that are distinct on only the column id", instead it is sorting the rows and taking the unique values based on all columns in the output (i.e. it is equivalent to writing "distinct id ..."). You should probably re-consider if you actually need that distinct in there. Normalization will tend to design away the need for distincts, and while they will occasionally be needed, whether they really are super truly needed is not always true.

OTHER TIPS

You begin by chasing down the node that takes the longest, and start optimizing there. In your case, that appears to be

Seq Scan on invoiceitems items

You should add an index there, and problem also to the other tables.

You could also try increasing work_mem to get rid of the external sort.

When you have done that, the new plan will probably look completely differently, so then start over.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow