Question

CREATE TABLE product (
  product_id     SERIAL,
  factory_key    VARCHAR(60),
  relevant       BOOLEAN
)
Indexes:
"product_factory_key_key" btree (factory_key);
"product_factory_key_relevant_key" btree (factory_key, relevant) WHERE relevant = false;
"product_relevant_key" btree (relevant);

Facts:

  1. We have about 100 million records in the product table
  2. There are a small number of factories. For example, 1 factory could have 5 million products.
  3. There are millions of factory keys
  4. Only a small number of rows are NOT relevant for each factory. For example, there is a factory with 5 million products, has about 100 products that are NOT relevant.
  5. However, There are million of rows of NOT relevant rows. As, the most common case is one factory key, 5 rows of products, and with maybe 2 rows of NOT relevant.

This is the problem query:

SELECT * FROM product WHERE factory_key='some_product_key' AND relevant=false LIMIT 10;

Explain analyze:

                                                        QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..23.06 rows=10 width=188) (actual time=2709.654..32252.961 rows=10 loops=1)
   ->  Seq Scan on product  (cost=0.00..7366785.34 rows=3194759 width=188) (actual time=2709.634..32252.904 rows=10 loops=1)
         Filter: ((NOT relevant) AND ((product_key)::text = 'some_product_key'::text))
         Rows Removed by Filter: 449486
 Total runtime: 32253.150 ms
(5 rows)

Problem:

This is problematic because:

  1. I believe the planner chose to use seq scan because there are so many rows that match this factory. (about 3.2 million rows match this factory or approx 3%)

  2. However, because only a EXTREMELY small number of rows are NOT relevant. And I am looking for NOT relevant. A seq scan end up being extremely expensive.

I have already created an composite index product_factory_key_relevant_key, however it is not taking advantage of the index.

EDIT:

I am trying to force postgres to use the composite key: product_factory_key_relevant_key

SET enable_seqscan=off

Although, it is now using the index scan. It is actually still slower than a seqscan. (so I guess the planner was correct in doing a seq scan)

                                                                       QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.57..34.03 rows=10 width=188) (actual time=8.088..469974.692 rows=10 loops=1)
   ->  Index Scan using product_factory_key_relevant_key on product  (cost=0.57..10689307.49 rows=3194776 width=188) (actual time=8.083..469974.655 rows=10 loops=1)
         Index Cond: (relevant = false)
         Filter: ((NOT relevant) AND ((product_key)::text = 'some_product_key'::text))
         Rows Removed by Filter: 2205295
 Total runtime: 469974.820 ms
(6 rows)
Was it helpful?

Solution

Overriding cost parameters

You can't ever force PostgreSQL to use a particular index, or totally prevent it from doing a seqscan.

However, you can tell it to avoid doing certain scan types if it possibly can, by setting the relevant enable_ parameters to off. It's really a feature intended only for debugging.

For testing, try:

SET enable_seqscan = off;

if Pg can possibly use an index scan (or something else) it will.

You may also want to consider:

SET random_page_cost = 1.1

i.e. tell PostgreSQL that random I/O is only slightly more expensive than sequential I/O. This is usually true on systems with SSDs, or where most of the DB is cached in RAM. It will be more likely to choose an index in this case.

Of course, if your system's random I/O is actually more expensive, then using an index is likely to be slower.

Selectivity, partial indexes

What you should really do is follow the advice you've already been given. Create the index in order of selectivity - if relevant is less common, use that. You can even go a step further and create a partial index:

CREATE INDEX idx_name_blah ON tbl_name_blah (factory_key) WHERE (NOT relevant);

This index only contains values for relevant = 'f'. It can only be used for queries where the planner knows relevant will be false. On the other hand, it will be a much smaller, faster index.

Statistics

You might also have inaccurate statistics, causing PostgreSQL to think value frequencies are different than they really are for your table. explain analyze will help show this.

You can also just ANALYZE my_table in case the stats are just out of date; if so, increase the frequency with which autovacuum runs because it's not keeping up.

If the stats are up to date but the planner is still making stats-based mis-estimations, increasing the statistics target for the table (see manual) and re-analyzing can help if it is actually a statistics mis-estimation problem.

Versions

Older PostgreSQL versions tend to be less smart about cost estimation, query optimization, statistics, query execution methods, and pretty much everything else.

If you're not on the latest version, upgrade.

For example, 9.2's index-only scans would allow you to create a partial index

(product_id, factory_key) WHERE (NOT relevant)

and then run a query:

SELECT product_id, factory_key FROM my_table WHERE NOT relevant;

that should only read the index, with no heap access at all.

OTHER TIPS

Imho, the problem in your query is the $1. I imagine this means you're using a prepared statement because you've read somewhere that it's best practice.

This is actually bad, because PG cannot know in advance if the cardinality of the criteria you're using is high or low. As such, it needs to pick a plan that'll fit most cases. If some values are all over the place in such a way that the left-most part of your btree index is useless, it'll have very good reasons to seq scan.

In contrast, if you run the query without preparing it first, it'll plan base on the values you've passed and pick the best plan for that specific value. Therefor, consider running the same query without preparing it, if it begins to use your existing index, you're in this pathological use-case.

If not, what Craig said... In particular the partial index.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top