Postgres prefers slow Seq Scan over fast Index Scan
Question
Tables
I have the following two tables.
books
, which contains ~5 million rows:
Table "public.books"
Column | Type | Modifiers
---------+--------------------------+----------------------------------------------------
id | integer | not null default nextval('books_id_seq'::regclass)
run__id | integer |
time | timestamp with time zone | not null
venue | character varying | not null
base | character varying | not null
quote | character varying | not null
Indexes:
"books_pkey" PRIMARY KEY, btree (id)
"run_books_index" UNIQUE, btree (run__id, id)
Foreign-key constraints:
"books_run__id_fkey" FOREIGN KEY (run__id) REFERENCES runs(id)
Referenced by:
TABLE "orders" CONSTRAINT "orders_book__id_fkey" FOREIGN KEY (book__id) REFERENCES books(id)
and orders
, which contains ~3 billion rows:
Table "public.orders"
Column | Type | Modifiers
----------+------------------+-----------
book__id | integer | not null
is_buy | boolean | not null
qty | double precision | not null
price | double precision | not null
Indexes:
"orders_pkey" PRIMARY KEY, btree (book__id, is_buy, price)
Foreign-key constraints:
"orders_book__id_fkey" FOREIGN KEY (book__id) REFERENCES books(id)
Query
I want to run the following query:
SELECT * FROM books b JOIN orders o ON o.book__id = b.id WHERE b.run__id = 1
Postgres suggests the following execution plan:
orderbooks=> EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM books b JOIN orders o ON o.book__id = b.id WHERE b.run__id = 1;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=2465.94..58122020.57 rows=762939 width=53) (actual time=1394110.723..2561879.775 rows=45216 loops=1)
Hash Cond: (o.book__id = b.id)
Buffers: shared hit=4080 read=18437586
-> Seq Scan on orders o (cost=0.00..47292761.72 rows=2885110272 width=21) (actual time=0.018..2265529.184 rows=2883798728 loops=1)
Buffers: shared hit=4073 read=18437586
-> Hash (cost=2448.52..2448.52 rows=1393 width=32) (actual time=0.024..0.024 rows=15 loops=1)
Buckets: 2048 Batches: 1 Memory Usage: 17kB
Buffers: shared hit=4
-> Index Scan using run_books_index on books b (cost=0.43..2448.52 rows=1393 width=32) (actual time=0.011..0.012 rows=15 loops=1)
Index Cond: (run__id = 1)
Buffers: shared hit=4
Planning time: 2.228 ms
Execution time: 2561882.272 ms
(13 rows)
Ie. sequentially scanning through the ~3 billion rows in orders
, which takes ~40 minutes.
If I disable sequential scans, Postgres suggests the following -- much faster -- query:
orderbooks=> SET enable_seqscan = OFF;
SET
orderbooks=> EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM books b JOIN orders o ON o.book__id = b.id WHERE b.run__id = 1;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=1.14..219271454.14 rows=762939 width=53) (actual time=0.024..15.234 rows=45216 loops=1)
Buffers: shared hit=707
-> Index Scan using run_books_index on books b (cost=0.43..2448.52 rows=1393 width=32) (actual time=0.011..0.014 rows=15 loops=1)
Index Cond: (run__id = 1)
Buffers: shared hit=4
-> Index Scan using orders_pkey on orders o (cost=0.70..156529.57 rows=87819 width=21) (actual time=0.007..0.551 rows=3014 loops=15)
Index Cond: (book__id = b.id)
Buffers: shared hit=703
Planning time: 0.302 ms
Execution time: 18.054 ms
(10 rows)
Question
Why does Postgres prefer the slow execution plan? I see that its cost is less than the fast one, how come?
Notes
I ran
VACUUM (VERBOSE, ANALYZE)
shortly before trying to execute/analyze the queries.If I don't request the column
orders.qty
in the output, then Postgres chooses the fast query (without needing to disable sequential scans):
orderbooks=> EXPLAIN SELECT b.id, b.run__id, b.time, b.venue, b.base, b.quote, o.book__id, o.is_buy, o.price FROM books b JOIN orders o ON o.book__id = b.id WHERE b.run__id = 1;
QUERY PLAN
-----------------------------------------------------------------------------------------------
Nested Loop (cost=1.14..6473136.93 rows=762939 width=45)
-> Index Scan using run_books_index on books b (cost=0.43..2448.52 rows=1393 width=32)
Index Cond: (run__id = 1)
-> Index Only Scan using orders_pkey on orders o (cost=0.70..3766.96 rows=87819 width=13)
Index Cond: (book__id = b.id)
(5 rows)
Domain
The data structure I'm trying to model is a limit order book. A row in the books
table contains information about a single order book, while a row in the orders
table describes a single order present in the given book.
I have chosen to not allow multiple orders with the same price for the same order book -- thus the (book__id, is_buy, price)
primary key on orders
. The reason is that I don't need to distinguish between multiple different orders (e.g. from different people) of the same price. So if my input data, for a given order book, contains two same-priced orders, I simply convert it into a single order whose quantity is the sum of the two order quantities.
Versions
PostgreSQL 9.6.19 on x86_64-pc-linux-gnu, compiled by Debian clang version 10.0.1, 64-bit
Solution
There seem to be several things off:
The estimates are wrong. Since you already
ANALYZE
d both tables, retry with a higherdefault_statistics_target
.Since index scans seem to be estimated too expensive, perhaps your setting of
random_page_cost
is too high. Also, perhapseffective_cache_size
is too low.
If you didn't select everything from books, a covering index might be an idea...
OTHER TIPS
To elaborate on your problem of bad estimates:
Postgres estimates to get 1393 rows from the index scan on run_books_index
, but actually only finds 15. That's off by a factor of ~ 100.
... (cost=0.43..2448.52 rows=1393 width=32) (actual time=0.011..0.012 rows=15 loops=1)
After you ran ANALYZE
, so we cannot blame outdated statistics.
Consequentially, Postgres is mislead into expecting a lot more rows from orders
than it will retrieve eventually, which heavily favors a sequential scan.
Your alternative query only fetches book__id
, is_buy
, and price
from table orders
. That can be read from the PK index orders_pkey
directly with an Index Only Scan
, which is as cheap as it gets, even when fetching a large part of the table.
A PK on (book__id, is_buy, price)
makes me wonder about your relational model, though? Each book can only have one order (or two, considering is_buy
) for the same price? Your query still finds ~ 3000 orders per book. I have a hard time making sense of that.
SELECT *
is typically unnecessary and a drag on performance on its own. Your alternative query seems smarter to begin with.
But while fetching more columns than your multicolumn index covers, an index on just (book__id)
would be more efficient as that's ~ 1/3 smaller in the minimum form and typically less bloated, too.
Raising the statistics target will gather more data on "most common values". But that also leads to better estimates about the rest.
Gathering and using more statistics also adds costs. For only few columns in a few big tables that are critical for query-planning, consider targeting those selectively and leave the global default_statistics_target
at its default. Like:
ALTER TABLE books ALTER run__id SET STATISTICS 2000;
Bear in mind that more statistics data can only help with irregular data distribution.
You mentioned ~3 billion rows in orders
, but not the cardinality of books
, so 2000 is a shot in the dark, assuming that books
is not much smaller (given the odd PK in orders
). Well above the default 100, but still below the maximum of 10000.
Related:
- Inconsistent statistics on jsonb column with btree index
- Optimizing queries on a range of timestamps (two columns)
Consider the rest of Laurenz' advise, too. Those settings are typically important.
If none of this helps, especially if your table is very big and with an unfortunate data distribution, you might try to force your luck with a manual setting for the ratio of distinct values:
ALTER TABLE books ALTER run__id SET (n_distinct = -0.07);
The negative value between -1 and 0 implies a fixed frequency ratio. Your query plan shows 15 rows for a single run__id
. 1/15 =~ 0.07. The setting forces to always expect that frequency. Details for n_distinct
in the manual. Related:
But forced estimates may come around to bite you later.
Another alternative, especially if you cannot (or will not) make any of the suggested changes to the DB configuration: use a different query with a LATERAL
subquery:
SELECT *
FROM books b
CROSS JOIN LATERAL (
SELECT *
FROM orders
WHERE book__id = b.id
) o
WHERE b.run__id = 1;
That's another workaround. Typically more expensive for the purpose, but it heavily favors index scans on orders
, which seems to be much more important. See:
Finally, Postgres has become gradually smarter about estimates and query plans, so upgrading to a current version might help, too. Version 9.6 is getting old.