Identify proper index for group and order by date and hour (2 columns)
-
25-01-2021 - |
Question
PostgreSQL 10.4
I have a table that has already an index on date
column. Current query plan is doing a Bitmap Heap Scan using the existing date index. I would like to add a new index for this query, no parameters are injected to the query, I started with a partial index to the status
column, but I don't know how to handle group and sort by part.
select date, hour, sum(installs) as installs, sum(clicks) as clicks
from ho_aggregated_stats
where date > (current_date - interval '2 day')
and (status='approved' or status is null)
group by date, hour
order by date, hour;
explain https://explain.depesz.com/s/rnCW
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Finalize GroupAggregate (cost=992433.95..992442.13 rows=1488 width=24) (actual time=3903.296..3903.337 rows=43 loops=1)
Group Key: date, hour
Buffers: shared hit=85314 read=11496
I/O Timings: read=2896.216
-> Sort (cost=992433.95..992434.69 rows=1488 width=24) (actual time=3903.290..3903.298 rows=86 loops=1)
Sort Key: date, hour
Sort Method: quicksort Memory: 31kB
Buffers: shared hit=85314 read=11496
I/O Timings: read=2896.216
-> Gather (cost=992265.00..992418.27 rows=1488 width=24) (actual time=3903.167..3903.233 rows=86 loops=1)
Workers Planned: 1
Workers Launched: 1
Buffers: shared hit=85314 read=11496
I/O Timings: read=2896.216
-> Partial HashAggregate (cost=991265.00..991269.47 rows=1488 width=24) (actual time=3899.779..3899.808 rows=43 loops=2)
Group Key: date, hour
Buffers: shared hit=149987 read=16557
I/O Timings: read=4694.060
-> Parallel Bitmap Heap Scan on ho_aggregated_stats (cost=21995.80..990158.35 rows=553327 width=16) (actual time=1232.325..3623.710 rows=592709 loops=2)
Recheck Cond: (date > (CURRENT_DATE - '2 days'::interval))
Filter: (((status)::text = 'approved'::text) OR (status IS NULL))
Rows Removed by Filter: 3946
Heap Blocks: exact=91807
Buffers: shared hit=149987 read=16557
I/O Timings: read=4694.060
-> Bitmap Index Scan on index_ho_aggregated_stats_on_date (cost=0.00..21948.76 rows=1160433 width=0) (actual time=1194.685..1194.685 rows=1339010 loops=1)
Index Cond: (date > (CURRENT_DATE - '2 days'::interval))
Buffers: shared read=5003
I/O Timings: read=1082.452
Planning time: 0.611 ms
Execution time: 3948.178 ms
table schema
CREATE TABLE public.stats (
id bigserial NOT NULL,
"date" date NOT NULL,
"hour" int4 NOT NULL,
status varchar NULL,
installs int4 NULL DEFAULT 0,
clicks int4 NULL DEFAULT 0,
CONSTRAINT stats_pkey PRIMARY KEY (id)
)
CREATE INDEX index_stats_on_date ON public.stats USING btree (date);
Estimate row count: ~40M
Update: I checked the distribution on the status column and 75% is null, 20% approved, 5% rejected, thinking the index on status is not necessary.
Solution
You provided crucial information in a later comment:
It is a very simple process that will alert when there was no data for one hour of the day or that the clicks or installs were 0
This allows for radically different, probably much faster queries:
SELECT ts AS no_installs
FROM generate_series(date_trunc('day', localtimestamp - interval '2 day')
, localtimestamp
, interval '1 hour') ts
WHERE NOT EXISTS (
SELECT
FROM stats
WHERE date = ts::date
AND hours = extract(hour FROM ts)::int -- 0 to 23
AND (status = 'approved' or status is null)
AND installs > 0
);
And:
SELECT ts AS no_clicks
FROM ... -- like above
...
AND clicks > 0
);
As opposed to your original, this will also detect hours where there are no rows at all, which should be the most worrying case to begin with (I guess).
About generate_series()
:
Aside: there is a corner-case bug lurking in your query: the current date depends on the time zone setting of your current session. So the query might give different (possibly misleading) results depending on where you run it. It's generally best to operate with timestamptz
to avoid any such complications. You might replace (date, hour)
with a single timestamptz
column. Same size. It's cheap to derive date / hour etc. from it.
Index
This should be massively faster even with a plain index on (date, hour)
, thus keeping the cost for the index itself low. The semi-anti-join resulting from NOT EXISTS
can scan the index, and discard each hour as soon as the first matching row is found and look no further. No need to aggregate all qualifying rows like your original did.
I suggest to replace the index you have: index_ho_aggregated_stats_on_date
on just (date)
with one on (date, hour)
, or (date DESC, hour DESC)
, hardly matters for the case. It's exactly the same size as your old index, since date
+ integer
occupy 8 bytes together. And it practically does everything your old index did, plus more. See:
If and only if a majority of rows fails the additional condition WHERE (status = 'approved' or status is null)
, then it might make sense to add a partial index with that condition. Else it's cheaper not to create another index and let Postgres add a FILTER
step to the index scan.
If your table is huge (around 0.5M rows per day?) it might make sense to cut of the bulk of old data from the index. See:
Or consider a BRIN index if table rows are mostly sorted by (date, hour)
, physically. Related:
(Probably not worth it if you have an index on (date, hour)
anyway.)
Finally, since "Data is being written to the table every 5mins via multiple UPSERT queries", consider a manual VACUUM ANALYZE
(or just VACUUM
) right after that to allow index-only scans in combination with a partial, multicolumn index covering all involved columns. You'll have to weigh cost and gain of this.
OTHER TIPS
This is likely to be you best index for that query:
(date, hour, installs) WHERE status::text = 'approved'::text OR status IS NULL
Starting with "date" and "hour" allows it to get the ordering from the index and so avoid the sort. Including "installs" enables an index-order-scan, which will be important to avoid jumping all around the table to fetch that value (especially important if the physical order of the rows in the table are not correlated with the "date" value), and will need the table to be well vacuumed for best results.
There is no way to check without your actual data, but the first thing I would try is an index on Status, Date, Hour, Installs
.
This should allow for an equality filter on status with a range for date, followed by an ordered scan for date and hour. the installs is there to cover the query.
If the optimizer isn't smart enough to double-dip the status filter for the NULL, try to remove it from the original query and use a UNION ALL like:
** Corrected as per @jjanes comment.
WITH UnionedResults
AS
(
select date, hour, installs
from stats
where date > (current_date - interval '2 day')
and (status='approved')
UNION ALL
select date, hour, installs
from stats
where date > (current_date - interval '2 day')
and (status is null)
)
SELECT date, hour, SUM(installs) AS installs
FROM UnionedResults
GROUP BY date, hour
ORDER BY date, hour;
HTH