increase speed of group by query on table in postgres
-
04-12-2019 - |
Pergunta
I have a join table with the following structure:
CREATE TABLE adjectives_friends
(
adjective_id integer,
friend_id integer
)
WITH (
OIDS=FALSE
);
ALTER TABLE adjectives_friends
OWNER TO rails;
CREATE UNIQUE INDEX index_adjectives_friends_on_adjective_id_and_friend_id
ON adjectives_friends
USING btree
(adjective_id , friend_id );
CREATE UNIQUE INDEX index_adjectives_friends_on_friend_id_and_adjective_id
ON adjectives_friends
USING btree
(friend_id , adjective_id );
ALTER TABLE adjectives_friends CLUSTER ON index_adjectives_friends_on_friend_id_and_adjective_id;
This table contains around ~50 million records.
The adjectives table is a look up table of ~150 entries. What I would like to do is find the friend that most closely matches a list of adjectives. Assume that the maximum number of adjectives a friend has is 10. So, I tried this query:
SELECT count(friend_id) count, friend_id
FROM adjectives_friends
where adjective_id in (1,2,3,4,5,6,7,8,9,10)
group by friend_id
order by count desc
limit 100
This takes around ~10 seconds on my dev machine, with query plan
"Limit (cost=831652.00..831652.25 rows=100 width=4)"
" -> Sort (cost=831652.00..831888.59 rows=94634 width=4)"
" Sort Key: (count(friend_id))"
" -> GroupAggregate (cost=804185.31..828035.16 rows=94634 width=4)"
" -> Sort (cost=804185.31..811819.81 rows=3053801 width=4)"
" Sort Key: friend_id"
" -> Bitmap Heap Scan on adjectives_friends (cost=85958.72..350003.24 rows=3053801 width=4)"
" Recheck Cond: (adjective_id = ANY ('{1,2,3,4,5,6,7,8,9,10}'::integer[]))"
" -> Bitmap Index Scan on index_adjectives_friends_on_adjective_id_and_friend_id (cost=0.00..85195.26 rows=3053801 width=0)"
" Index Cond: (adjective_id = ANY ('{1,2,3,4,5,6,7,8,9,10}'::integer[]))"
The order by is what is killing me, but I don't know of a good way to avoid it. The count can't be precomputed becasue the adjectives to be selected are completely arbitrary, and there are > 150 choose 10 combinations. Right now, I think that the best option is to grab the 100 best results on friend creation, save the results, then update it every n time intervals. This would be acceptable as the adjectives are expected to be switched that often, and I don't the exact 100 best results. But, if I could get the query speed to around 1 - 2 seconds, that wouldn't be neccessary. Any suggestions?
Solução
I don't think you'll do much better with that query plan. I'll take your word that the count can't be precomputed.
I think your best bets are
- Table tuning
- Server tuning
- Faster hardware
If you can use smallint instead of integer, your tables and indexes will be narrower, more will fit into a page, and your queries should run faster. But smallint is a 2-byte integer, ranging from -32768 to +32767. If you need more id numbers than that, smallint won't work.