Optimizing a “categorized” search table using trigrams

https://dba.stackexchange.com/questions/206617

31-12-2020
|

문제

I've been reading through this excellent answer to understand how pg_trgm works a bit, but I'm still unclear on the most efficient way of solving this query (efficient in terms of speed of search):

I have a table search that I run trigram searches on that looks like this:

Column      |  Type   | Modifiers
------------+---------+-----------
id          | bpchar  | collate C
user_id     | integer |
type        | text    |
search_on   | text    | collate C
data        | json    |
Indexes:
 "index_search_id" UNIQUE, btree (id)
 "index_search_search_on" gist (search_on gist_trgm_ops)
 "index_search_type" btree (type)
 "index_search_user_id" btree (user_id)

In this scenario, user_id is NULLable and type is also NULLable. The queries I'd run amount to these possibilities:

Search for rows (WHERE user_id = 123 OR user_id IS NULL) AND search_on % 'mystring'
Search for rows (WHERE user_id = 123 OR user_id IS NULL) AND type='my-type' AND search_on % 'mystring'

In plain words, I want all rows that have my user_id or NULL user_id, optionally are categorized by type, and match the term being passed in.

Right now I just have individual indexes on the 3 columns (as shown above) that can change based on the query. I understand however that a single index is generally more efficient.

Is it possible to use a single index that does trigram searches, but also exact match on user_id and type where they can optionally be NULL.

해결책

Is it possible to use a single index that does trigram searches, but also exact match on user_id and type where they can optionally be NULL.

Yes, NULL is included in indexes. And you can search for it like for any other value.

Yes, you can have a multicolumn trigram GiST index. But GiST indexes typically don't make sense for the data type integer. Btree indexes are better in every respect - except for your case of a multicolumn index. So Postgres does not install the required operator class by default. You need to install the additional module btree_gist first, once per database:

CREATE EXTENSION IF NOT EXISTS btree_gist;  -- only if not installed, yet

Then you can create your multicolumn index:

CREATE INDEX foo ON search USING gist (user_id, type, search_on gist_trgm_ops);

Related (with detailed instructions):

And get operator precedence in your WHERE clause right:

WHERE (user_id = 123 OR user_id IS NULL)  -- parentheses!
AND    search_on % 'mystring'

Or:

WHERE (user_id = 123 OR user_id IS NULL)
AND   (type = 'my-type' OR type IS NULL)
AND    search_on % 'mystring'

Depending on data distribution, cardinalities, selectivity of predicates, cost settings etc. Postgres may still prefer an index on one (or two) column(s) (occasionally).

다른 팁

Sorry for the delay on this one. I'm not sure if protocol dictates that I answer my own question to post these details, but comments (to Erwin's answer) don't provide enough space.

So I was noticing pretty poor performance using the above individual indexes when I'd run queries. I have 2 main uses cases:

Query for all things in the 'public sphere' or my own: (user_id = 123 OR user_id IS NULL) AND search_on % 'my_string'
Query for all things by type, in the 'public sphere' or my own (user_id = 123 OR user_id IS NULL) AND type='my-type' AND search_on % 'mystring'

I forgot to mention that type IS NOT NULL in all of this, so there's never a case where I want to fetch NULL typed rows, only NULL user_id rows.

Using the separate indexes from the original post, I'd see queries taking anywhere from 1s to 10+s. It seemed as if "warming up" and running a few queries would bring the time down, but even 1 second is insufficient for a typeahead without negatively affecting the UX.

Playing with Erwin's suggestions, I added first:

CREATE INDEX idx_typed_search ON search USING gist (user_id, type, search_on gist_trgm_ops);

And ran both typed and non-typed queries. The typed queries were insanely fast (~ 50 ms) and the non-typed were still pretty slow, but definitely faster than the original posting with separate indexes (~ 400 ms)

Then I added:

CREATE INDEX idx_search ON search USING gist (user_id, search_on gist_trgm_ops);

and ran both typed and non-typed queries. As expected, typed queries weren't affected, they used the other index, but non-typed queries actually got worse (~ 1000+ ms). Dropping that index brought them back down to (~ 400ms)

So I wasn't sure if it was the presence of two potentially competing indexes to choose from that caused performance issues. Since I'm absolutely satisfied with the typed indexing (~ 50ms is roughly the speed I was hoping for) I decided to focus just on the untyped queries.

With no indexes, the baseline query using a sequence scan runs in about 8s (roughly 700,000 total rows)

With the individual btree index on user_id plus the gist_trgm on search_on, the performance is in the 400ms range. Interestingly it never uses the user_id index, because by the time it index scans on the search term, there's probably only a few rows to filter out from other user_ids (given my current data). This could likely change over time with more user data though, so I think the user_id index still makes sense even though its unused in this case.

With the combined gist (user_id, search_on) index, I get performance roughly similar to the 2 separate indexes (~ 400ms)

I should mention that the shape of the data is such that there would be 100's of records owned by each user_id, but 100's of 1000's (a majority) of records in the 'public sphere' (user_id IS NULL).

Given that data shape, it seems best to simply optimize for the search term and type themselves, as the user_id filtering at the end of the query is cheap.

Since a specific non-typed index doesn't actually yield any performance benefit, it seems the best option is to stick with a single gist index on the type and search_on columns only:

CREATE INDEX idx_typed_global_search ON searchables USING gist (type, search_on gist_trgm_ops);

I don't have enough users currently to test the user_id filtering, but my guess is that as we grow our user base, an index on user_id would yield fruit.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 dba.stackexchange