Suppose I have an append-only table with columns customer_id (randomly generated string) and x, and lookups are always done on customer_id.

Say the data look like below, as if we get a batch of rows when a customer signs up for something initially, and then never again for that customer.

customer_id=XCVFY0001, x=...
customer_id=XCVFY0001, x=...
(continues for ~1 page with same customer_id)
customer_id=HUMBN0001, x=...
customer_id=HUMBN0001, x=...
(continues for ~1 page with same customer_id)
(and so on...)

So while customer_id's alphabetical ordering isn't correlated with physical rows, we could make statements like:

  • there are few distinct customer IDs per page
  • there are few pages per ID
  • there are long "runs" of IDs, or, if you need one customer_id, you're likely to find it on a few contiguous pages
  • in terms of information theory, I think they'd say there's no correlation, but there is high "mutual information"

Can the query planner use information like this in estimates, if one doesn't explicitly run CLUSTER? I'd assume that if there's low correlation as reported in pg_stats, it would guess that rows are uniformly distributed throughout pages, and might be pessimistic with various plans.

(In my real-world analogue, a plain non-clustered index made things nice and fast anyways, but I just got curious when I noticed the pattern in the data.)

有帮助吗?

解决方案

The planner is unaware of this type of clustering and so can't make decision based on it.

The two-step sampling method used by ANALYZE can generate skewed samples in this situation, possibly leading to a drastic underestimate of n_distinct. It is hard to predict what the consequences of that might be without digging into the details of individual queries.

许可以下: CC-BY-SA归因
不隶属于 dba.stackexchange
scroll top