How does Postgres compute the row count estimate during query planning

https://dba.stackexchange.com/questions/274052

06-03-2021
|

Question

I have a simple query such as:

SELECT 
    * 
FROM 
    example 
WHERE 
    filter_1 = ? 
    AND filter_2 = ? 
LIMIT 
    10

The table is quite large (about 100 millions row) and it has an index similar to the following (the actual index has one more column on the right side but it shouldn't make any difference):

CREATE INDEX example_idx
ON public.example 
USING btree (filter_1, filter_2, (...));

So now let's describe the issue: when I run my query in a prepared statement, the first 5 executions use a custom plan. Then the generic plan is seen as less costly and it is used for the reminder of the prepared statement's lifetime.

Here is an EXPLAIN ANALYZE when Postgres uses the custom plan:

Limit  (cost=0.57..39.35 rows=10 width=78) (actual time=0.024..0.034 rows=8 loops=1)
  ->  Index Scan using example_idx on example c0  (cost=0.57..12345.29 rows=3183 width=78) (actual time=0.024..0.032 rows=8 loops=1)
        Index Cond: (((filter_1)::text = 'rare_value_1'::text) AND (filter_2 = 'frequent_value_2'::custom_enum))
Planning Time: 0.098 ms
Execution Time: 0.045 ms

Here is an EXPLAIN when Postgres uses the generic plan:

Limit  (cost=0.00..11.31 rows=10 width=78)
  ->  Seq Scan on example_idx c0  (cost=0.00..3469262.28 rows=3067235 width=78)
        Filter: (((filter_1)::text = $1) AND (filter_2 = $2))

Here, we can clearly see that the cost of the generic plan is lower.

My problem is how the row count estimate in the Index Scan and the Seq Scan are computed.

The documentation explains how and if I follow their calculation, I arrive at 3183, which is the estimated row count for the custom plan:

rare_value_1 and frequent_value_2 are both in the MCV list. And their frequency is 0.00002667 and 0.99783 respectively. Also, the estimated table row count is 119622152.

0.00002667 * 0.99783 * 119622152 = 3183

The remaining question is, how is it done for the generic plan?

I found that, for some unknown reason, the MCV frequencies are ignored. And Postgresql just looks at the n_distinct values for columns filter_1 and filter_2 (13 and 3 respectively):

estimated row count = estimated total number of rows in table / ( n_distinct("filter_1") * n_distinct("filter_2") )
                    = 119622152 / (13 * 3)
                    = 3067235

My question is why? Why does Postgresql use such a primitive way to estimate the row count since it has access to better statistics in the form of MCV frequencies?

Postgresql version: 11 (so using the "force_custom_plan" option is not possible for us at the moment).

Solution

There is no better way to estimate the row count for the generic plan.

If you chose to go by the frequencies of a most common value (MCV), there would be several problems:

Which MCV do you choose?
The estimate would be better assuming that both parameters are MCV, but that is an unfounded assumption. If both were rare values, the estimate would be even farther off the mark than the current estimate.

Choosing the average is the best we can do: it is something of a middle ground. Your case is so far off the mark because one of the values is extremely rare, which lowers the custom plan estimate, but since there are few distinct values and one of them is very frequent, the selectivity is bad at average, and the generic plan estimate is far too high.

You have two choices:

Use dynamic SQL so that you always get a custom plan.
Change the query to read
```
WHERE filter_1 || '' = ?
```
so that it cannot use the index.

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange