Arbitrary queries on n:m relationship, including “all” and “any”
Question
I'm using postgres >= 9.6. I have tasks, tags and task_tags tables, for a typical n:m relationship between tasks and tags.
I'd like to be able to programmatically build queries against the tasks table that supports queries against the actual fields of tasks, but also on the tags (tag names) of a task.
Queries on the task fields themselves are straight-forward. Queries on the tags for a "does it have tag A?" are also straight-forward. What I am struggling with is coming up with a select/query structure that would allow me to also match things like "does it have tag A and tag B"?
The best I've come up with is a lateral join on a subquery with an array aggregation, and then using array matching functions, e.g.:
SELECT DISTINCT ON (tasks.id) tasks.*
FROM tasks, LATERAL (SELECT array_agg(tags.name) AS tags
FROM task_tags
INNER JOIN tags
ON task_tags.tag_id = tags.id
WHERE task_tags.task_id = tasks.id
GROUP BY task_tags.task_id) tt
WHERE tt.tags @> array['tag1'::varchar, 'tag3'::varchar];
That way, it should be possible to programmatically build a WHERE clause (using tasks.* and tt.tags) satisfying all of the conditions of the user-provided "query".
However, I'm not sure if this is the best way of doing it - thoughts? Is this query any efficient? Is there any index I could create that would improve it?
Similarly, is there any way at all of making it work with wildcards against the tag names? Normal array matching wouldn't allow that, and solutions I've seen suggest using unnest (or, well, not using arrays in the first place), but then I'd lose the ability of saying "it needs to have both tagA and tagB".
Is there any other way of building a query on these relationships that would allow that kind of "both tagA and tagB" matching?
Solution
Is this query any efficient?
No, it is extremely inefficient for multiple reasons.
0. Test schema
Basing my answer on this simplified schema:
CREATE TABLE tag (
tag_id serial PRIMARY KEY
, tag text NOT NULL
);
CREATE TABLE task (
task_id serial PRIMARY KEY
, task text NOT NULL
);
CREATE TABLE task_tag (
task_id int
, tag_id int
, PRIMARY KEY (tag_id, task_id) -- columns in this order
, FOREIGN KEY (task_id) REFERENCES task
, FOREIGN KEY (tag_id) REFERENCES tag;
);
1. The query is very inefficient
Your original query, adapted to test schema:
SELECT DISTINCT ON (task.task_id) task.* -- DISTINCT is dead freight
FROM task, LATERAL (
SELECT array_agg(tag.tag) AS tags -- array_agg more expensive than array constructor
FROM task_tag
JOIN tag ON task_tag.tag_id = tag.tag_id
WHERE task_tag.task_id = task.task_id
GROUP BY task_tag.task_id -- redundant noise
) tt
WHERE tt.tags @> array['tag1'::text, 'tag3'::text]; -- or array literal.
What's wrong? For starters:
- No reason to add
DISTINCT ON
. The lateral join only joins to a single row after the aggregation. - An array constructor is cheaper for this simple array aggregation.
- No need to add
GROUP BY
, after theWHERE
clause already filters onetask_id
. - Depending on how you call the query, it may be more convenient to pass a single array literal instead of an array constructor.
Equivalent query 1:
SELECT ts.*
FROM task ts
JOIN LATERAL (
SELECT ARRAY (
SELECT tg.tag
FROM task_tag tt
JOIN tag tg USING (tag_id)
WHERE tt.task_id = ts.task_id
) AS tags
) tt ON tt.tags @> '{tag1, tag3}'::text[];
But that's still a waste. The LATERAL
does more harm than good while all rows have to be aggregated anyway.
Equivalent query 2:
SELECT ts.*
FROM task ts
JOIN (
SELECT task_id
FROM task_tag
GROUP BY 1
HAVING array_agg(tag_id) @> '{1, 3}'::int[];
) AS tags USING (task_id);
As you can see, I am not building array of tags, just tag_id
. Much more efficient. Resolve your input tags to IDs before you pass them to the query.
Much faster than before but still extremely inefficient for tables of non-trivial size.
2. The whole approach is very inefficient
The above approach cannot use an index. The dynamically generated array does work with indexes. The predicate comes after the aggregation.
2.1 MATERIALIZED VIEW
for read-only (or mostly) tables
To make it work with an index you would have to add a MATERIALIZED VIEW
with readily aggregated tag arrays (or IDs, preferably) per task_id
. And a GIN index on the array column. Only good for read-only (or mostly) tables, since the MV is just a snapshot, quickly outdated with writes to the underlying tables.
Related:
The MATERIALIZED VIEW
:
CREATE MATERIALIZED VIEW task_tags_mv AS
SELECT task_id, array_agg(tag_id) AS tag_ids
FROM task_tag
GROUP BY 1;
Of course, we work with tag_id
again, not raw tags. And for integer
arrays there are much faster, specialized operator classes and indexes available from the additional module intarray
. Related:
So we use this specialized index:
CREATE INDEX task_tags_mv_arr_idx ON task_tags_mv USING GIN(tag_ids gin__int_ops);
And this query:
SELECT ts.*
FROM task ts
JOIN task_tags_mv mv USING (task_id)
WHERE mv.tag_ids @> '{1, 3}'::int[]; -- uses intarray operator
This is fast, now. And the same setup works for matching ANY tag, using the overlap operator &&
instead of the contains operator:
...
WHERE mv.tag_ids && '{1, 3}'::int[]; -- uses intarray operator
2.2 Generic queries
Else you need completely different query techniques that can work with indexes. Yper* pointed to a related question on SO in his comment:
You specific requirement is to make is simple to provide a list of tags to a generic query.
A recursive CTE is one way without dynamic SQL. (There are many others.) Nested in an SQL function with VARIADIC
parameter for convenience:
CREATE OR REPLACE FUNCTION f_tasks_with_tags(VARIADIC _tags text[])
RETURNS SETOF task AS
$func$
WITH RECURSIVE cte AS (
SELECT task_id, 1 AS idx
FROM task_tag
WHERE tag_id = (SELECT tag_id FROM tag WHERE tag = _tags[1])
UNION
SELECT task_id, c.idx + 1
FROM cte c
JOIN task_tag tt USING (task_id)
WHERE tag_id = (SELECT tag_id FROM tag WHERE tag = _tags[c.idx + 1])
)
SELECT t.*
FROM cte c
JOIN task t USING (task_id)
WHERE c.idx = array_length(_tags, 1)
$func$ LANGUAGE sql STABLE;
Call:
SELECT * FROM f_tasks_with_tags('tag1', 'tag3'); -- add any number of tags
About VARIADIC
(follow links for more):
Depending on value frequencies, it helps performance (a lot) to pass sparse tags first. It is cheaper to eliminate non-qualifying tasks early.
Another potentially even faster approach: dynamic SQL. Build and execute a query like:
SELECT t.*
FROM (
SELECT task_id
FROM task_tag t1
JOIN task_tag t2 USING (task_id)
-- JOIN task_tag t3 USING (task_id)
-- more ...
WHERE t1.tag_id = 1
AND t2.tag_id = 3
-- AND t3.tag_id = 789
-- more ...
) tt
JOIN task t USING (task_id);
In this query, Postgres can optimize the sequence of joins automatically to evaluate sparse tags first - up to a maximum of join_collapse_limit
tables. See:
There are many code examples for building queries dynamically in a plpgsql function on this site. Try a search.
SQL Fiddle (also has intarray module).
dbfiddle here