Postgresql expanding view multiple times instead of executing it once

https://dba.stackexchange.com/questions/254392

19-02-2021
|

Question

I have a rahter long subquery which I store as a view

CREATE VIEW scans_minmax AS
SELECT tmp.source,
       value ->> 'Ok'                                                    AS "Ok",
       value ->> 'NotOk'                                                 AS "NotOk",
       value ->> 'TotalCount'                                            AS "TotalCount",
       (CASE WHEN scans.timestamp = tmp.min_t THEN 'MIN' ELSE 'MAX' END) AS "Aggregation"
FROM (SELECT source, MIN(timestamp) AS min_t, MAX(timestamp) AS max_t
      FROM scans
      WHERE timestamp > NOW() - INTERVAL '30 days'
        AND value ->> 'Ok' IS NOT NULL
        AND value ->> 'NotOk' IS NOT NULL
        AND value ->> 'TotalCount' IS NOT NULL
      GROUP BY source) AS tmp,
     scans
WHERE scans.source = tmp.source
  AND (scans.timestamp = tmp.min_t OR scans.timestamp = tmp.max_t)

This gives me a nice table as result

| source | Ok | NotOk | TotalCount | Aggregation |
--------------------------------------------------
| A      | a  | b     | c          | MIN         |
| A      | x  | y     | z          | MAX         |

But as I wanted the table also "transposed" (MIN / MAX as columns next to each other for each value) I created the query

SELECT min_tbl.source,
       min_tbl."Ok" AS "min_ok", max_tbl."Ok" AS "max_ok",
       min_tbl."NotOk" AS "min_not_ok", max_tbl."NotOk" AS "max_not_ok",
       min_tbl."TotalCount" AS "min_total_count", max_tbl."TotalCount" AS "max_total_count"
FROM scans_minmax AS min_tbl JOIN
     scans_minmax AS max_tbl ON min_tbl.source = max_tbl.source
WHERE min_tbl."Aggregation" = 'MIN' AND max_tbl."Aggregation" = 'MAX'

which does exactly the right thing.

Now, my issue is the following: The View above has 16 entries (there are 8 different entries for source in my DB). So the second query should take roughly the same time as a simple SELECT * FROM scans_minmax. But it turns out that it takes more time, as the optimizer does not perform the expaneded view query once and simply manipulate this temporary table but expands the view two times and thus does the expensive aggregation two times.

EXPLAIN ANALYZE on the Query above gives

Nested Loop  (cost=173527.09..173543.19 rows=1 width=231) (actual time=2594.697..2594.852 rows=8 loops=1)
  ->  Merge Join  (cost=173518.21..173518.28 rows=1 width=617) (actual time=2594.580..2594.610 rows=8 loops=1)
        Merge Cond: ((scans_1.source)::text = (scans_3.source)::text)
        ->  Sort  (cost=86883.50..86883.50 rows=1 width=562) (actual time=1292.558..1292.561 rows=8 loops=1)
              Sort Key: scans_2.source
              Sort Method: quicksort  Memory: 31kB
              ->  Nested Loop  (cost=86643.23..86883.49 rows=1 width=562) (actual time=1292.406..1292.519 rows=8 loops=1)
                    ->  HashAggregate  (cost=86634.35..86634.45 rows=10 width=55) (actual time=1292.281..1292.285 rows=8 loops=1)
                          Group Key: scans_2.source
                          ->  Index Scan using scans_timestamp_index on scans scans_2  (cost=0.43..83024.66 rows=481292 width=47) (actual time=0.174..1036.297 rows=545244 loops=1)
                                Index Cond: ("timestamp" > (now() - '30 days'::interval))
                                Filter: (((value ->> 'Ok'::text) IS NOT NULL) AND ((value ->> 'NotOk'::text) IS NOT NULL) AND ((value ->> 'TotalCount'::text) IS NOT NULL))
                                Rows Removed by Filter: 98
                    ->  Bitmap Heap Scan on scans scans_1  (cost=8.88..24.88 rows=1 width=531) (actual time=0.025..0.026 rows=1 loops=8)
                          Recheck Cond: (("timestamp" = (min(scans_2."timestamp"))) OR ("timestamp" = (max(scans_2."timestamp"))))
                          Filter: (((scans_2.source)::text = (source)::text) AND (CASE WHEN ("timestamp" = (min(scans_2."timestamp"))) THEN 'MIN'::text ELSE 'MAX'::text END = 'MAX'::text))
                          Rows Removed by Filter: 1
                          Heap Blocks: exact=16
                          ->  BitmapOr  (cost=8.88..8.88 rows=4 width=0) (actual time=0.018..0.018 rows=0 loops=8)
                                ->  Bitmap Index Scan on asdf  (cost=0.00..4.44 rows=2 width=0) (actual time=0.010..0.010 rows=1 loops=8)
                                      Index Cond: ("timestamp" = (min(scans_2."timestamp")))
                                ->  Bitmap Index Scan on asdf  (cost=0.00..4.44 rows=2 width=0) (actual time=0.006..0.006 rows=1 loops=8)
                                      Index Cond: ("timestamp" = (max(scans_2."timestamp")))
        ->  Sort  (cost=86634.71..86634.74 rows=10 width=55) (actual time=1302.016..1302.016 rows=8 loops=1)
              Sort Key: scans_3.source
              Sort Method: quicksort  Memory: 26kB
              ->  HashAggregate  (cost=86634.35..86634.45 rows=10 width=55) (actual time=1301.967..1301.969 rows=8 loops=1)
                    Group Key: scans_3.source
                    ->  Index Scan using scans_timestamp_index on scans scans_3  (cost=0.43..83024.66 rows=481292 width=47) (actual time=0.080..1042.757 rows=545244 loops=1)
                          Index Cond: ("timestamp" > (now() - '30 days'::interval))
                          Filter: (((value ->> 'Ok'::text) IS NOT NULL) AND ((value ->> 'NotOk'::text) IS NOT NULL) AND ((value ->> 'TotalCount'::text) IS NOT NULL))
                          Rows Removed by Filter: 98
  ->  Bitmap Heap Scan on scans  (cost=8.88..24.88 rows=1 width=531) (actual time=0.023..0.024 rows=1 loops=8)
        Recheck Cond: (("timestamp" = (min(scans_3."timestamp"))) OR ("timestamp" = (max(scans_3."timestamp"))))
        Filter: (((scans_3.source)::text = (source)::text) AND (CASE WHEN ("timestamp" = (min(scans_3."timestamp"))) THEN 'MIN'::text ELSE 'MAX'::text END = 'MIN'::text))
        Rows Removed by Filter: 1
        Heap Blocks: exact=16
        ->  BitmapOr  (cost=8.88..8.88 rows=4 width=0) (actual time=0.016..0.016 rows=0 loops=8)
              ->  Bitmap Index Scan on asdf  (cost=0.00..4.44 rows=2 width=0) (actual time=0.009..0.009 rows=1 loops=8)
                    Index Cond: ("timestamp" = (min(scans_3."timestamp")))
              ->  Bitmap Index Scan on asdf  (cost=0.00..4.44 rows=2 width=0) (actual time=0.006..0.006 rows=1 loops=8)
                    Index Cond: ("timestamp" = (max(scans_3."timestamp")))
Planning time: 2.253 ms
Execution time: 2594.994 ms

Why is that so and are there ways to avoid this behavior? All I want the optimizer to do is to evaluate the VIEW as temporary table and do the join in my second query on this temporary table.

I am using PostgreSQL 9.6.15.

Thanks already!

Solution

This is expected behaviour: the view source is simply substituted for its name by the query rewrite, then analysed and executed as usual.

To tell the query compiler that you want to process the view only once you'd use a common table expression (CTE):

WITH scans_minmax (source, "Ok", "NotOk", "TotalCount", "Aggregation") AS (
SELECT tmp.source,
       value ->> 'Ok',
       value ->> 'NotOk',
       value ->> 'TotalCount',
       (CASE WHEN scans.timestamp = tmp.min_t THEN 'MIN' ELSE 'MAX' END)
FROM (SELECT source, MIN(timestamp) AS min_t, MAX(timestamp) AS max_t
      FROM scans
      WHERE timestamp > NOW() - INTERVAL '30 days'
        AND value ->> 'Ok' IS NOT NULL
        AND value ->> 'NotOk' IS NOT NULL
        AND value ->> 'TotalCount' IS NOT NULL
      GROUP BY source) AS tmp,
     scans
WHERE scans.source = tmp.source
  AND (scans.timestamp = tmp.min_t OR scans.timestamp = tmp.max_t)
)
SELECT min_tbl.source,
       min_tbl."Ok" AS "min_ok", max_tbl."Ok" AS "max_ok",
       min_tbl."NotOk" AS "min_not_ok", max_tbl."NotOk" AS "max_not_ok",
       min_tbl."TotalCount" AS "min_total_count", max_tbl."TotalCount" AS "max_total_count"
FROM scans_minmax AS min_tbl JOIN
     scans_minmax AS max_tbl ON min_tbl.source = max_tbl.source
WHERE min_tbl."Aggregation" = 'MIN' AND max_tbl."Aggregation" = 'MAX'

Postgres will materialise results of the CTE (prior to Postgres 12 anyway), causing your aggregation to execute only once.

Note that in general you need to exercise caution when using CTEs:

This will not necessarily make your overall query run faster, as CTE materialisation might require additional server resources -- probably not in your case though.
CTE is an optimisation fence, meaning that predicates from the outer select will not be pushed down into the CTE, which in some cases will cause suboptimal execution plans -- again, not in your case.

OTHER TIPS

Not sure whether it answers your question, but you don't need to reference scans_minmax twice to get the result you want:

SELECT source
     , case when "Aggregation" = 'MIN' then ok end as min_ok
     , ...
     , case when "Aggregation" = 'MAX' then ok end as max_ok
     , ...
FROM scans_minmax

Now you have a table with 8 columns, 3 in each row is null. You can now apply - say - MAX to get rid of the nulls:

SELECT source, MAX(min_ok), ..., MAX(max_ok), ...
FROM (
     -- query above
) AS T
GROUP BY source

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange