Optimizing a “categorized” search table using trigrams

https://dba.stackexchange.com/questions/206617

31-12-2020
|

Domanda

I've been reading through this excellent answer to understand how pg_trgm works a bit, but I'm still unclear on the most efficient way of solving this query (efficient in terms of speed of search):

I have a table search that I run trigram searches on that looks like this:

Column      |  Type   | Modifiers
------------+---------+-----------
id          | bpchar  | collate C
user_id     | integer |
type        | text    |
search_on   | text    | collate C
data        | json    |
Indexes:
 "index_search_id" UNIQUE, btree (id)
 "index_search_search_on" gist (search_on gist_trgm_ops)
 "index_search_type" btree (type)
 "index_search_user_id" btree (user_id)

In this scenario, user_id is NULLable and type is also NULLable. The queries I'd run amount to these possibilities:

Search for rows (WHERE user_id = 123 OR user_id IS NULL) AND search_on % 'mystring'
Search for rows (WHERE user_id = 123 OR user_id IS NULL) AND type='my-type' AND search_on % 'mystring'

In plain words, I want all rows that have my user_id or NULL user_id, optionally are categorized by type, and match the term being passed in.

Right now I just have individual indexes on the 3 columns (as shown above) that can change based on the query. I understand however that a single index is generally more efficient.

Is it possible to use a single index that does trigram searches, but also exact match on user_id and type where they can optionally be NULL.

Soluzione

Is it possible to use a single index that does trigram searches, but also exact match on user_id and type where they can optionally be NULL.

Yes, NULL is included in indexes. And you can search for it like for any other value.

Yes, you can have a multicolumn trigram GiST index. But GiST indexes typically don't make sense for the data type integer. Btree indexes are better in every respect - except for your case of a multicolumn index. So Postgres does not install the required operator class by default. You need to install the additional module btree_gist first, once per database:

CREATE EXTENSION IF NOT EXISTS btree_gist;  -- only if not installed, yet

Then you can create your multicolumn index:

CREATE INDEX foo ON search USING gist (user_id, type, search_on gist_trgm_ops);

Related (with detailed instructions):

And get operator precedence in your WHERE clause right:

WHERE (user_id = 123 OR user_id IS NULL)  -- parentheses!
AND    search_on % 'mystring'

Or:

WHERE (user_id = 123 OR user_id IS NULL)
AND   (type = 'my-type' OR type IS NULL)
AND    search_on % 'mystring'

Depending on data distribution, cardinalities, selectivity of predicates, cost settings etc. Postgres may still prefer an index on one (or two) column(s) (occasionally).

Altri suggerimenti

Sorry for the delay on this one. I'm not sure if protocol dictates that I answer my own question to post these details, but comments (to Erwin's answer) don't provide enough space.

So I was noticing pretty poor performance using the above individual indexes when I'd run queries. I have 2 main uses cases:

Query for all things in the 'public sphere' or my own: (user_id = 123 OR user_id IS NULL) AND search_on % 'my_string'
Query for all things by type, in the 'public sphere' or my own (user_id = 123 OR user_id IS NULL) AND type='my-type' AND search_on % 'mystring'

I forgot to mention that type IS NOT NULL in all of this, so there's never a case where I want to fetch NULL typed rows, only NULL user_id rows.

Using the separate indexes from the original post, I'd see queries taking anywhere from 1s to 10+s. It seemed as if "warming up" and running a few queries would bring the time down, but even 1 second is insufficient for a typeahead without negatively affecting the UX.

Playing with Erwin's suggestions, I added first:

CREATE INDEX idx_typed_search ON search USING gist (user_id, type, search_on gist_trgm_ops);

And ran both typed and non-typed queries. The typed queries were insanely fast (~ 50 ms) and the non-typed were still pretty slow, but definitely faster than the original posting with separate indexes (~ 400 ms)

Then I added:

CREATE INDEX idx_search ON search USING gist (user_id, search_on gist_trgm_ops);

and ran both typed and non-typed queries. As expected, typed queries weren't affected, they used the other index, but non-typed queries actually got worse (~ 1000+ ms). Dropping that index brought them back down to (~ 400ms)

So I wasn't sure if it was the presence of two potentially competing indexes to choose from that caused performance issues. Since I'm absolutely satisfied with the typed indexing (~ 50ms is roughly the speed I was hoping for) I decided to focus just on the untyped queries.

With no indexes, the baseline query using a sequence scan runs in about 8s (roughly 700,000 total rows)

With the individual btree index on user_id plus the gist_trgm on search_on, the performance is in the 400ms range. Interestingly it never uses the user_id index, because by the time it index scans on the search term, there's probably only a few rows to filter out from other user_ids (given my current data). This could likely change over time with more user data though, so I think the user_id index still makes sense even though its unused in this case.

With the combined gist (user_id, search_on) index, I get performance roughly similar to the 2 separate indexes (~ 400ms)

I should mention that the shape of the data is such that there would be 100's of records owned by each user_id, but 100's of 1000's (a majority) of records in the 'public sphere' (user_id IS NULL).

Given that data shape, it seems best to simply optimize for the search term and type themselves, as the user_id filtering at the end of the query is cheap.

Since a specific non-typed index doesn't actually yield any performance benefit, it seems the best option is to stick with a single gist index on the type and search_on columns only:

CREATE INDEX idx_typed_global_search ON searchables USING gist (type, search_on gist_trgm_ops);

I don't have enough users currently to test the user_id filtering, but my guess is that as we grow our user base, an index on user_id would yield fruit.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a dba.stackexchange