Optimizing a “categorized” search table using trigrams
-
31-12-2020 - |
문제
I've been reading through this excellent answer to understand how pg_trgm works a bit, but I'm still unclear on the most efficient way of solving this query (efficient in terms of speed of search):
I have a table search
that I run trigram searches on that looks like this:
Column | Type | Modifiers
------------+---------+-----------
id | bpchar | collate C
user_id | integer |
type | text |
search_on | text | collate C
data | json |
Indexes:
"index_search_id" UNIQUE, btree (id)
"index_search_search_on" gist (search_on gist_trgm_ops)
"index_search_type" btree (type)
"index_search_user_id" btree (user_id)
In this scenario, user_id
is NULLable
and type
is also NULLable
. The queries I'd run amount to these possibilities:
- Search for rows
(WHERE user_id = 123 OR user_id IS NULL) AND search_on % 'mystring'
- Search for rows
(WHERE user_id = 123 OR user_id IS NULL) AND type='my-type' AND search_on % 'mystring'
In plain words, I want all rows that have my user_id or NULL user_id, optionally are categorized by type
, and match the term being passed in.
Right now I just have individual indexes on the 3 columns (as shown above) that can change based on the query. I understand however that a single index is generally more efficient.
Is it possible to use a single index that does trigram searches, but also exact match on user_id
and type
where they can optionally be NULL
.
해결책
Is it possible to use a single index that does trigram searches, but also exact match on
user_id
andtype
where they can optionally be NULL.
Yes, NULL is included in indexes. And you can search for it like for any other value.
Yes, you can have a multicolumn trigram GiST index. But GiST indexes typically don't make sense for the data type integer
. Btree indexes are better in every respect - except for your case of a multicolumn index. So Postgres does not install the required operator class by default. You need to install the additional module btree_gist
first, once per database:
CREATE EXTENSION IF NOT EXISTS btree_gist; -- only if not installed, yet
Then you can create your multicolumn index:
CREATE INDEX foo ON search USING gist (user_id, type, search_on gist_trgm_ops);
Related (with detailed instructions):
- PostgreSQL EXCLUDE USING error: Data type integer has no default operator class
- Optimizing queries on a range of timestamps (two columns)
And get operator precedence in your WHERE
clause right:
WHERE (user_id = 123 OR user_id IS NULL) -- parentheses!
AND search_on % 'mystring'
Or:
WHERE (user_id = 123 OR user_id IS NULL)
AND (type = 'my-type' OR type IS NULL)
AND search_on % 'mystring'
Depending on data distribution, cardinalities, selectivity of predicates, cost settings etc. Postgres may still prefer an index on one (or two) column(s) (occasionally).
다른 팁
Sorry for the delay on this one. I'm not sure if protocol dictates that I answer my own question to post these details, but comments (to Erwin's answer) don't provide enough space.
So I was noticing pretty poor performance using the above individual indexes when I'd run queries. I have 2 main uses cases:
- Query for all things in the 'public sphere' or my own:
(user_id = 123 OR user_id IS NULL) AND search_on % 'my_string'
- Query for all things by type, in the 'public sphere' or my own
(user_id = 123 OR user_id IS NULL) AND type='my-type' AND search_on % 'mystring'
I forgot to mention that type IS NOT NULL
in all of this, so there's never a case where I want to fetch NULL typed rows, only NULL user_id rows.
Using the separate indexes from the original post, I'd see queries taking anywhere from 1s to 10+s. It seemed as if "warming up" and running a few queries would bring the time down, but even 1 second is insufficient for a typeahead without negatively affecting the UX.
Playing with Erwin's suggestions, I added first:
CREATE INDEX idx_typed_search ON search USING gist (user_id, type, search_on gist_trgm_ops);
And ran both typed and non-typed queries. The typed queries were insanely fast (~ 50 ms) and the non-typed were still pretty slow, but definitely faster than the original posting with separate indexes (~ 400 ms)
Then I added:
CREATE INDEX idx_search ON search USING gist (user_id, search_on gist_trgm_ops);
and ran both typed and non-typed queries. As expected, typed queries weren't affected, they used the other index, but non-typed queries actually got worse (~ 1000+ ms). Dropping that index brought them back down to (~ 400ms)
So I wasn't sure if it was the presence of two potentially competing indexes to choose from that caused performance issues. Since I'm absolutely satisfied with the typed indexing (~ 50ms is roughly the speed I was hoping for) I decided to focus just on the untyped queries.
With no indexes, the baseline query using a sequence scan runs in about 8s (roughly 700,000 total rows)
With the individual btree index on user_id
plus the gist_trgm on search_on
, the performance is in the 400ms range. Interestingly it never uses the user_id index, because by the time it index scans on the search term, there's probably only a few rows to filter out from other user_ids (given my current data). This could likely change over time with more user data though, so I think the user_id index still makes sense even though its unused in this case.
With the combined gist (user_id, search_on) index, I get performance roughly similar to the 2 separate indexes (~ 400ms)
I should mention that the shape of the data is such that there would be 100's of records owned by each user_id, but 100's of 1000's (a majority) of records in the 'public sphere' (user_id IS NULL).
Given that data shape, it seems best to simply optimize for the search term and type themselves, as the user_id filtering at the end of the query is cheap.
Since a specific non-typed index doesn't actually yield any performance benefit, it seems the best option is to stick with a single gist index on the type
and search_on
columns only:
CREATE INDEX idx_typed_global_search ON searchables USING gist (type, search_on gist_trgm_ops);
I don't have enough users currently to test the user_id filtering, but my guess is that as we grow our user base, an index on user_id would yield fruit.