Identify proper index for group and order by date and hour (2 columns)

https://dba.stackexchange.com/questions/232749

25-01-2021
|

Question

PostgreSQL 10.4

I have a table that has already an index on date column. Current query plan is doing a Bitmap Heap Scan using the existing date index. I would like to add a new index for this query, no parameters are injected to the query, I started with a partial index to the status column, but I don't know how to handle group and sort by part.

select date, hour, sum(installs) as installs, sum(clicks) as clicks
from ho_aggregated_stats
where date > (current_date - interval '2 day')
and (status='approved' or status is null)
group by date, hour
order by date, hour;

explain https://explain.depesz.com/s/rnCW

QUERY PLAN                                                                                                                                                                            
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Finalize GroupAggregate  (cost=992433.95..992442.13 rows=1488 width=24) (actual time=3903.296..3903.337 rows=43 loops=1)                                                              
  Group Key: date, hour                                                                                                                                                               
  Buffers: shared hit=85314 read=11496                                                                                                                                                
  I/O Timings: read=2896.216                                                                                                                                                          
  ->  Sort  (cost=992433.95..992434.69 rows=1488 width=24) (actual time=3903.290..3903.298 rows=86 loops=1)                                                                           
        Sort Key: date, hour                                                                                                                                                          
        Sort Method: quicksort  Memory: 31kB                                                                                                                                          
        Buffers: shared hit=85314 read=11496                                                                                                                                          
        I/O Timings: read=2896.216                                                                                                                                                    
        ->  Gather  (cost=992265.00..992418.27 rows=1488 width=24) (actual time=3903.167..3903.233 rows=86 loops=1)                                                                   
              Workers Planned: 1                                                                                                                                                      
              Workers Launched: 1                                                                                                                                                     
              Buffers: shared hit=85314 read=11496                                                                                                                                    
              I/O Timings: read=2896.216                                                                                                                                              
              ->  Partial HashAggregate  (cost=991265.00..991269.47 rows=1488 width=24) (actual time=3899.779..3899.808 rows=43 loops=2)                                              
                    Group Key: date, hour                                                                                                                                             
                    Buffers: shared hit=149987 read=16557                                                                                                                             
                    I/O Timings: read=4694.060                                                                                                                                        
                    ->  Parallel Bitmap Heap Scan on ho_aggregated_stats  (cost=21995.80..990158.35 rows=553327 width=16) (actual time=1232.325..3623.710 rows=592709 loops=2)        
                          Recheck Cond: (date > (CURRENT_DATE - '2 days'::interval))                                                                                                  
                          Filter: (((status)::text = 'approved'::text) OR (status IS NULL))                                                                                           
                          Rows Removed by Filter: 3946                                                                                                                                
                          Heap Blocks: exact=91807                                                                                                                                    
                          Buffers: shared hit=149987 read=16557                                                                                                                       
                          I/O Timings: read=4694.060                                                                                                                                  
                          ->  Bitmap Index Scan on index_ho_aggregated_stats_on_date  (cost=0.00..21948.76 rows=1160433 width=0) (actual time=1194.685..1194.685 rows=1339010 loops=1)
                                Index Cond: (date > (CURRENT_DATE - '2 days'::interval))                                                                                              
                                Buffers: shared read=5003                                                                                                                             
                                I/O Timings: read=1082.452                                                                                                                            
Planning time: 0.611 ms                                                                                                                                                               
Execution time: 3948.178 ms

table schema

CREATE TABLE public.stats (
  id bigserial NOT NULL,
  "date" date NOT NULL,
  "hour" int4 NOT NULL,
  status varchar NULL,
  installs int4 NULL DEFAULT 0,
  clicks int4 NULL DEFAULT 0,
  CONSTRAINT stats_pkey PRIMARY KEY (id)
)
CREATE INDEX index_stats_on_date ON public.stats USING btree (date);

Estimate row count: ~40M

Update: I checked the distribution on the status column and 75% is null, 20% approved, 5% rejected, thinking the index on status is not necessary.

Solution

You provided crucial information in a later comment:

It is a very simple process that will alert when there was no data for one hour of the day or that the clicks or installs were 0

This allows for radically different, probably much faster queries:

SELECT ts AS no_installs
FROM   generate_series(date_trunc('day', localtimestamp - interval '2 day')
                     , localtimestamp
                     , interval '1 hour') ts
WHERE  NOT EXISTS (
   SELECT 
   FROM   stats
   WHERE  date   =  ts::date
   AND    hours  =  extract(hour FROM ts)::int  -- 0 to 23
   AND   (status = 'approved' or status is null)
   AND    installs > 0
   );

And:

SELECT ts AS no_clicks
FROM   ...  -- like above
   ...
   AND    clicks > 0
   );

As opposed to your original, this will also detect hours where there are no rows at all, which should be the most worrying case to begin with (I guess).

About generate_series():

Generating time series between two dates in PostgreSQL

Aside: there is a corner-case bug lurking in your query: the current date depends on the time zone setting of your current session. So the query might give different (possibly misleading) results depending on where you run it. It's generally best to operate with timestamptz to avoid any such complications. You might replace (date, hour) with a single timestamptz column. Same size. It's cheap to derive date / hour etc. from it.

Index

This should be massively faster even with a plain index on (date, hour), thus keeping the cost for the index itself low. The semi-anti-join resulting from NOT EXISTS can scan the index, and discard each hour as soon as the first matching row is found and look no further. No need to aggregate all qualifying rows like your original did.

I suggest to replace the index you have: index_ho_aggregated_stats_on_date on just (date) with one on (date, hour), or (date DESC, hour DESC), hardly matters for the case. It's exactly the same size as your old index, since date + integer occupy 8 bytes together. And it practically does everything your old index did, plus more. See:

Is a composite index also good for queries on the first field?

If and only if a majority of rows fails the additional condition WHERE (status = 'approved' or status is null), then it might make sense to add a partial index with that condition. Else it's cheaper not to create another index and let Postgres add a FILTER step to the index scan.

If your table is huge (around 0.5M rows per day?) it might make sense to cut of the bulk of old data from the index. See:

Optimize performance for queries on recent rows of a large table

Or consider a BRIN index if table rows are mostly sorted by (date, hour), physically. Related:

PostgreSQL planner choosing btree or gist index for few result rows

(Probably not worth it if you have an index on (date, hour) anyway.)

Finally, since "Data is being written to the table every 5mins via multiple UPSERT queries", consider a manual VACUUM ANALYZE (or just VACUUM) right after that to allow index-only scans in combination with a partial, multicolumn index covering all involved columns. You'll have to weigh cost and gain of this.

OTHER TIPS

This is likely to be you best index for that query:

(date, hour, installs) WHERE status::text = 'approved'::text OR status IS NULL

Starting with "date" and "hour" allows it to get the ordering from the index and so avoid the sort. Including "installs" enables an index-order-scan, which will be important to avoid jumping all around the table to fetch that value (especially important if the physical order of the rows in the table are not correlated with the "date" value), and will need the table to be well vacuumed for best results.

There is no way to check without your actual data, but the first thing I would try is an index on Status, Date, Hour, Installs. This should allow for an equality filter on status with a range for date, followed by an ordered scan for date and hour. the installs is there to cover the query. If the optimizer isn't smart enough to double-dip the status filter for the NULL, try to remove it from the original query and use a UNION ALL like:

** Corrected as per @jjanes comment.

WITH UnionedResults
AS
(
    select date, hour, installs
    from stats
    where date > (current_date - interval '2 day')
    and (status='approved')
    UNION ALL
    select date, hour, installs 
    from stats
    where date > (current_date - interval '2 day')
    and (status is null)
)
SELECT date, hour, SUM(installs) AS installs
FROM UnionedResults
GROUP BY date, hour
ORDER BY date, hour;

HTH

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange