Why does PostgreSQL scan a lot of rows?

https://dba.stackexchange.com/questions/139992

02-10-2020
|

Question

I have a small table of 135,000 rows.

Here is the schema:

id  integer NOT NULL    nextval('history_sum_id_seq'::regclass)
zone_id integer NOT NULL    
spot_id character varying(24)       NULL::character varying
customer_id integer     
broker_id   integer     
date    date    NOT NULL    
created_date    timestamp(0) without time zone  NOT NULL    
statistics_hits integer     0
statistics_real_hits    integer     0
statistics_paid_hits    integer     0
statistics_clicks   integer     0
earnings_eur    double precision        '0'::double precision
earnings_usd    double precision        '0'::double precision
earnings_rub    double precision        '0'::double precision
broker_name character varying(100)      NULL::character varying
customer_email  character varying(128)      NULL::character varying
customer_commission integer

This is my SQL:

SELECT
    "hs"."customer_id",
    "hs"."customer_email" AS "account",
    CASE WHEN max("monthlyAvg"."monthlyAvgValue") is not null THEN
        max("monthlyAvg"."monthlyAvgValue") ELSE 0::int END AS "monthlyAvg",
    CASE WHEN max("weeklyAvg"."weeklyAvgValue") is not null THEN
        max("weeklyAvg"."weeklyAvgValue") ELSE 0::int END AS "weeklyAvg"
FROM
    "history_sum" AS "hs"
        LEFT JOIN (
            SELECT
                "hs"."customer_id" AS "customer_id",
                round((SUM("hs"."statistics_real_hits")::numeric / 30::numeric), 2) AS "monthlyAvgValue"
            FROM
                "history_sum" AS "hs"
            WHERE
                "hs"."date" BETWEEN '2016-04-29' AND '2016-05-29'
            GROUP BY
                "hs"."customer_id",
                "hs"."date"
            ORDER BY
                "hs"."date" DESC
        ) AS "monthlyAvg" ON "monthlyAvg"."customer_id" = "hs"."customer_id"
        LEFT JOIN (
            SELECT
                "hs"."customer_id" AS "customer_id",
                round((SUM("hs"."statistics_real_hits")::numeric / 7::numeric), 2) AS "weeklyAvgValue"
            FROM
                "history_sum" AS "hs"
            WHERE
                "hs"."date" BETWEEN '2016-05-22' AND '2016-05-29'
            GROUP BY
                "hs"."customer_id",
                "hs"."date"
            ORDER BY
                "hs"."date" DESC
        ) AS "weeklyAvg" ON "weeklyAvg"."customer_id" = "hs"."customer_id"
WHERE
    "hs"."customer_email" is not null
GROUP BY
    "hs"."customer_id",
    "hs"."customer_email",
    "hs"."customer_commission"
ORDER BY
    account ASC
LIMIT 50

And this is the QUERY PLAN, which I couldn't understand:

QUERY PLAN
Limit  (cost=35325.56..35325.68 rows=50 width=92) (actual time=6548.611..6548.615 rows=47 loops=1)
  ->  Sort  (cost=35325.56..35341.42 rows=6347 width=92) (actual time=6548.609..6548.610 rows=47 loops=1)
        Sort Key: hs.customer_email
        Sort Method: quicksort  Memory: 28kB
        ->  HashAggregate  (cost=35051.24..35114.71 rows=6347 width=92) (actual time=6548.525..6548.546 rows=47 loops=1)
              Group Key: hs.customer_email, hs.customer_id, hs.customer_commission
              ->  Hash Left Join  (cost=7636.01..28512.35 rows=373651 width=92) (actual time=3.648..1710.014 rows=14053634 loops=1)
                    Hash Cond: (hs.customer_id = "monthlyAvg".customer_id)
                    ->  Hash Left Join  (cost=2833.16..10289.12 rows=135914 width=60) (actual time=1.035..104.126 rows=530354 loops=1)
                          Hash Cond: (hs.customer_id = "weeklyAvg".customer_id)
                          ->  Seq Scan on history_sum hs  (cost=0.00..5587.55 rows=135914 width=28) (actual time=0.003..35.919 rows=135925 loops=1)
                                Filter: (customer_email IS NOT NULL)
                                Rows Removed by Filter: 30
                          ->  Hash  (cost=2831.54..2831.54 rows=130 width=36) (actual time=1.024..1.024 rows=8 loops=1)
                                Buckets: 1024  Batches: 1  Memory Usage: 9kB
                                ->  Subquery Scan on "weeklyAvg"  (cost=2829.91..2831.54 rows=130 width=36) (actual time=1.020..1.021 rows=9 loops=1)
                                      ->  Sort  (cost=2829.91..2830.24 rows=130 width=12) (actual time=1.019..1.019 rows=9 loops=1)
                                            Sort Key: hs_1.date DESC
                                            Sort Method: quicksort  Memory: 25kB
                                            ->  HashAggregate  (cost=2823.07..2825.35 rows=130 width=12) (actual time=0.995..0.999 rows=9 loops=1)
                                                  Group Key: hs_1.date, hs_1.customer_id
                                                  ->  Bitmap Heap Scan on history_sum hs_1  (cost=41.66..2813.38 rows=1292 width=12) (actual time=0.186..0.678 rows=1484 loops=1)
                                                        Recheck Cond: ((date >= '2016-05-22'::date) AND (date <= '2016-05-29'::date))
                                                        Heap Blocks: exact=119
                                                        ->  Bitmap Index Scan on sum_date_customer_email  (cost=0.00..41.34 rows=1292 width=0) (actual time=0.169..0.169 rows=1484 loops=1)
                                                              Index Cond: ((date >= '2016-05-22'::date) AND (date <= '2016-05-29'::date))
                    ->  Hash  (cost=4795.97..4795.97 rows=550 width=36) (actual time=2.603..2.603 rows=36 loops=1)
                          Buckets: 1024  Batches: 1  Memory Usage: 10kB
                          ->  Subquery Scan on "monthlyAvg"  (cost=4789.10..4795.97 rows=550 width=36) (actual time=2.582..2.599 rows=39 loops=1)
                                ->  Sort  (cost=4789.10..4790.47 rows=550 width=12) (actual time=2.582..2.583 rows=39 loops=1)
                                      Sort Key: hs_2.date DESC
                                      Sort Method: quicksort  Memory: 26kB
                                      ->  HashAggregate  (cost=4754.44..4764.06 rows=550 width=12) (actual time=2.552..2.569 rows=39 loops=1)
                                            Group Key: hs_2.date, hs_2.customer_id
                                            ->  Bitmap Heap Scan on history_sum hs_2  (cost=176.71..4713.25 rows=5492 width=12) (actual time=0.404..1.467 rows=5783 loops=1)
                                                  Recheck Cond: ((date >= '2016-04-29'::date) AND (date <= '2016-05-29'::date))
                                                  Heap Blocks: exact=215
                                                  ->  Bitmap Index Scan on sum_date_customer_email  (cost=0.00..175.34 rows=5492 width=0) (actual time=0.378..0.378 rows=5783 loops=1)
                                                        Index Cond: ((date >= '2016-04-29'::date) AND (date <= '2016-05-29'::date))
Planning time: 0.643 ms
Execution time: 6548.802 ms
41 row(s)

Total runtime: 6,551.098 ms

As you can see, at some point Postgres increases the number of rows to the crazy rows=530354 and I don't know why it happens.

Subqueries, if I run them separately, are very fast, but when I combine them into one query, that rows explosion happens.

I need to add another 3 simple subqueries here, and after I do that, the number of rows for scan will expand to

Sort  (cost=4793.22..4794.60 rows=550 width=24) (actual time=2.881..127147.085 rows=3353783690 loops=1)

and as a result I will get

Planning time: 1.490 ms
Execution time: 1593005.255 ms

This is the FROM clause with the additional subqueries:

"history_sum" AS "hs"
        LEFT JOIN (
            SELECT
                "hs"."customer_id" AS "customer_id",
                round((SUM("hs"."statistics_real_hits")::numeric / 30::numeric), 2) AS "monthlyAvgValue"
            FROM
                "history_sum" AS "hs"
            WHERE
                "hs"."date" BETWEEN :minus30days AND :yesterday
            GROUP BY
                "hs"."customer_id",
                "hs"."date"
            ORDER BY
                "hs"."date" DESC
        ) AS "monthlyAvg" ON "monthlyAvg"."customer_id" = "hs"."customer_id"
        LEFT JOIN (
            SELECT
                "hs"."customer_id" AS "customer_id",
                round((SUM("hs"."statistics_real_hits")::numeric / 30::numeric), 2) AS "monthlyAvgValue"
            FROM
                "history_sum" AS "hs"
            WHERE
                "hs"."date" BETWEEN :minus60days AND :minus31days
            GROUP BY
                "hs"."customer_id",
                "hs"."date"
            ORDER BY
                "hs"."date" DESC
        ) AS "prevMonthlyAvg" ON "prevMonthlyAvg"."customer_id" = "hs"."customer_id"
        LEFT JOIN (
            SELECT
                "hs"."customer_id" AS "customer_id",
                round((SUM("hs"."statistics_real_hits")::numeric / 7::numeric), 2) AS "weeklyAvgValue"
            FROM
                "history_sum" AS "hs"
            WHERE
                "hs"."date" BETWEEN :minus7days AND :yesterday
            GROUP BY
                "hs"."customer_id",
                "hs"."date"
            ORDER BY
                "hs"."date" DESC
        ) AS "weeklyAvg" ON "weeklyAvg"."customer_id" = "hs"."customer_id"
        LEFT JOIN (
            SELECT
                "hs"."customer_id" AS "customer_id",
                round((SUM("hs"."statistics_real_hits")::numeric / 7::numeric), 2) AS "weeklyAvgValue"
            FROM
                "history_sum" AS "hs"
            WHERE
                "hs"."date" BETWEEN :minus15days AND :minus7days
            GROUP BY
                "hs"."customer_id",
                "hs"."date"
            ORDER BY
                "hs"."date" DESC
        ) AS "prevWeeklyAvg" ON "prevWeeklyAvg"."customer_id" = "hs"."customer_id"
        LEFT JOIN (
            SELECT
                "hs"."customer_id" AS "customer_id",
                SUM("hs"."statistics_real_hits")::int AS "yesterdayHitsValue"
            FROM
                "history_sum" AS "hs"
            WHERE
                "hs"."date" = :yesterday
            GROUP BY
                "hs"."customer_id"
            ORDER BY
                "hs"."customer_id" ASC
        ) AS "yesterdayHits" ON "yesterdayHits"."customer_id" = "hs"."customer_id"

Why does it happen? What am I doing wrong?

The PostgreSQL version is 9.5.

Solution

As far as I can tell, you only need a single derived table together with a conditional aggregation:

SELECT
    hs.customer_id,
    hs.customer_email AS account,
    coalesce(max(x.monthlyAvgValue),0) AS monthlyAvg,
    coalesce(max(x.prevMonthlyAvg),0) AS prevMonthlyAvg,
    coalesce(max(x.weeklyAvgValue),0) AS weeklyAvg
FROM history_sum AS hs
  LEFT JOIN (
    SELECT
        hs2.customer_id AS customer_id,
        round((SUM(hs2.statistics_real_hits)::numeric / 30::numeric), 2) filter (where hs2.date BETWEEN '2016-04-29' AND '2016-05-29') AS monthlyAvgValue,
        round((SUM(hs2.statistics_real_hits)::numeric / 30::numeric), 2) filter (where hs2.date BETWEEN :minus60days AND :minus31days) AS prevMonthlyAvg,
        round((SUM(hs2.statistics_real_hits)::numeric /  7::numeric), 2) filter (where hs2.date BETWEEN '2016-05-22' AND '2016-05-29') AS weeklyAvgValue, 
    FROM history_sum AS hs2
    GROUP BY hs2.customer_id, hs2.date
  ) x ON x.customer_id = hs.customer_id
WHERE
    hs.customer_email is not null
GROUP BY
    hs.customer_id,
    hs.customer_email,
    hs.customer_commission
ORDER BY
    account ASC;

You can add more aggregations without the need to add new derived tables.

Using an order by in a sub-select or derived table is useless and can be removed (not sure if Postgres optimizes that away).

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange