performances of distinct/group by
-
10-10-2020 - |
Question
I'm trying to select the most recent authors of messages in a room, using this (simplified) table:
Table "public.message"
Column | Type | Nullable | Default
---------+---------+----------+-------------------------------------
id | bigint | not null | nextval('message_id_seq'::regclass)
room | integer | not null |
author | integer | not null |
created | integer | not null |
Indexes:
"message_pkey" PRIMARY KEY, btree (id)
"message_author_created_room" btree (author, created, room)
"message_room_author_created" btree (room, author, created)
"message_room_created" btree (room, created)
"message_room_id" btree (room, id)
The problem is a query like this is slow:
select message.author as id, max(message.created) as mc from message
where room=12 group by message.author order by mc desc limit 50;
And here's the explain(analyze, verbose, buffers)
:
miaou=> explain (analyze, verbose, buffers) select message.author as id, max(message.created) as mc from message
where room=12 group by message.author order by mc desc limit 50;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=10627.14..10627.26 rows=50 width=8) (actual time=54.887..54.901 rows=50 loops=1)
Output: author, (max(created))
Buffers: shared hit=490
-> Sort (cost=10627.14..10629.19 rows=820 width=8) (actual time=54.885..54.891 rows=50 loops=1)
Output: author, (max(created))
Sort Key: (max(message.created)) DESC
Sort Method: top-N heapsort Memory: 29kB
Buffers: shared hit=490
-> Finalize GroupAggregate (cost=1000.46..10599.90 rows=820 width=8) (actual time=14.019..54.788 rows=160 loops=1)
Output: author, max(created)
Group Key: message.author
Buffers: shared hit=490
-> Gather Merge (cost=1000.46..10583.50 rows=1640 width=8) (actual time=14.007..54.636 rows=248 loops=1)
Output: author, (PARTIAL max(created))
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=490
-> Partial GroupAggregate (cost=0.43..9394.18 rows=820 width=8) (actual time=3.439..34.733 rows=83 loops=3)
Output: author, PARTIAL max(created)
Group Key: message.author
Buffers: shared hit=2989
Worker 0: actual time=0.297..49.593 rows=116 loops=1
Buffers: shared hit=1550
Worker 1: actual time=6.624..40.612 rows=60 loops=1
Buffers: shared hit=949
-> Parallel Index Only Scan using message_room_author_created on public.message (cost=0.43..8904.09 rows=96377 width=8) (actual time=0.030..20.067 rows=73907 loops=3)
Output: author, created
Index Cond: (message.room = 12)
Heap Fetches: 139
Buffers: shared hit=2989
Worker 0: actual time=0.035..28.355 rows=109834 loops=1
Buffers: shared hit=1550
Worker 1: actual time=0.030..23.723 rows=79112 loops=1
Buffers: shared hit=949
Planning time: 0.211 ms
Execution time: 57.071 ms
I'd like to know how to make that faster, the real important goal being to get the N most recent authors. Is there a faster way to query that information?
Solution
select count(author) from message where room=12;
Because author can not be null just do SELECT count(*)
. Doing so may result in this being an index-only scan on either message_room_created
or message_room_id
. If you need the unique authors, which you don't have you'll have to use count( DISTINCT author )
.
Also this being planned as a parallel query. That's relatively new functionality. You may want to try SET max_parallel_workers_per_gather = 0
to disable parallel query functionality and reposting the results.
Either way
this one is slow (same duration, about 45ms)
If you've got millions rows an index scan and heap fetches, it's not really fair to call 45ms "slow."