Arbitrary queries on n:m relationship, including “all” and “any”

https://dba.stackexchange.com/questions/194269

10-10-2020
|

Question

I'm using postgres >= 9.6. I have tasks, tags and task_tags tables, for a typical n:m relationship between tasks and tags.

I'd like to be able to programmatically build queries against the tasks table that supports queries against the actual fields of tasks, but also on the tags (tag names) of a task.

Queries on the task fields themselves are straight-forward. Queries on the tags for a "does it have tag A?" are also straight-forward. What I am struggling with is coming up with a select/query structure that would allow me to also match things like "does it have tag A and tag B"?

The best I've come up with is a lateral join on a subquery with an array aggregation, and then using array matching functions, e.g.:

SELECT DISTINCT ON (tasks.id) tasks.* 
FROM   tasks, LATERAL (SELECT array_agg(tags.name) AS tags 
                       FROM   task_tags 
                       INNER JOIN tags 
                       ON task_tags.tag_id = tags.id 
                       WHERE task_tags.task_id = tasks.id 
                       GROUP BY task_tags.task_id) tt
WHERE  tt.tags @> array['tag1'::varchar, 'tag3'::varchar];

That way, it should be possible to programmatically build a WHERE clause (using tasks.* and tt.tags) satisfying all of the conditions of the user-provided "query".

However, I'm not sure if this is the best way of doing it - thoughts? Is this query any efficient? Is there any index I could create that would improve it?

Similarly, is there any way at all of making it work with wildcards against the tag names? Normal array matching wouldn't allow that, and solutions I've seen suggest using unnest (or, well, not using arrays in the first place), but then I'd lose the ability of saying "it needs to have both tagA and tagB".

Is there any other way of building a query on these relationships that would allow that kind of "both tagA and tagB" matching?

Solution

Is this query any efficient?

No, it is extremely inefficient for multiple reasons.

0. Test schema

Basing my answer on this simplified schema:

CREATE TABLE tag (
  tag_id serial PRIMARY KEY
, tag    text NOT NULL
);

CREATE TABLE task (
  task_id  serial PRIMARY KEY
, task     text NOT NULL
);

CREATE TABLE task_tag (
  task_id    int
, tag_id     int
, PRIMARY KEY (tag_id, task_id)  -- columns in this order
, FOREIGN KEY (task_id) REFERENCES task
, FOREIGN KEY (tag_id)  REFERENCES tag;
);

1. The query is very inefficient

Your original query, adapted to test schema:

SELECT DISTINCT ON (task.task_id) task.*  -- DISTINCT is dead freight
FROM   task, LATERAL (
   SELECT array_agg(tag.tag) AS tags  -- array_agg more expensive than array constructor
   FROM   task_tag 
   JOIN   tag ON task_tag.tag_id = tag.tag_id 
   WHERE  task_tag.task_id = task.task_id 
   GROUP  BY task_tag.task_id  -- redundant noise
   ) tt
WHERE  tt.tags @> array['tag1'::text, 'tag3'::text];  -- or array literal.

What's wrong? For starters:

No reason to add DISTINCT ON. The lateral join only joins to a single row after the aggregation.
An array constructor is cheaper for this simple array aggregation.
- Why is array_agg() slower than the non-aggregate ARRAY() constructor?
No need to add GROUP BY, after the WHERE clause already filters one task_id.
Depending on how you call the query, it may be more convenient to pass a single array literal instead of an array constructor.

Equivalent query 1:

SELECT ts.* 
FROM   task ts
JOIN   LATERAL (
   SELECT ARRAY (
      SELECT tg.tag
      FROM   task_tag tt
      JOIN   tag      tg USING (tag_id)
      WHERE  tt.task_id = ts.task_id
      ) AS tags 
   ) tt ON tt.tags @> '{tag1, tag3}'::text[];

But that's still a waste. The LATERAL does more harm than good while all rows have to be aggregated anyway.

Equivalent query 2:

SELECT ts.* 
FROM   task ts
JOIN  (
   SELECT task_id
   FROM   task_tag
   GROUP  BY 1
   HAVING array_agg(tag_id) @> '{1, 3}'::int[];
   ) AS tags USING (task_id);

As you can see, I am not building array of tags, just tag_id. Much more efficient. Resolve your input tags to IDs before you pass them to the query.

Much faster than before but still extremely inefficient for tables of non-trivial size.

2. The whole approach is very inefficient

The above approach cannot use an index. The dynamically generated array does work with indexes. The predicate comes after the aggregation.

2.1 `MATERIALIZED VIEW` for read-only (or mostly) tables

To make it work with an index you would have to add a MATERIALIZED VIEW with readily aggregated tag arrays (or IDs, preferably) per task_id. And a GIN index on the array column. Only good for read-only (or mostly) tables, since the MV is just a snapshot, quickly outdated with writes to the underlying tables.

Can PostgreSQL index array columns?

The MATERIALIZED VIEW:

CREATE MATERIALIZED VIEW task_tags_mv AS
SELECT task_id, array_agg(tag_id) AS tag_ids
FROM   task_tag
GROUP  BY 1;

Of course, we work with tag_id again, not raw tags. And for integer arrays there are much faster, specialized operator classes and indexes available from the additional module intarray. Related:

GIN index on smallint[] column not used or error “operator is not unique”

So we use this specialized index:

CREATE INDEX task_tags_mv_arr_idx ON task_tags_mv USING GIN(tag_ids gin__int_ops);

And this query:

SELECT ts.* 
FROM   task ts
JOIN   task_tags_mv mv USING (task_id)
WHERE  mv.tag_ids @> '{1, 3}'::int[];  -- uses intarray operator

This is fast, now. And the same setup works for matching ANY tag, using the overlap operator && instead of the contains operator:

...
WHERE  mv.tag_ids && '{1, 3}'::int[];  -- uses intarray operator

2.2 Generic queries

Else you need completely different query techniques that can work with indexes. Yper* pointed to a related question on SO in his comment:

How to filter SQL results in a has-many-through relation

You specific requirement is to make is simple to provide a list of tags to a generic query.

A recursive CTE is one way without dynamic SQL. (There are many others.) Nested in an SQL function with VARIADIC parameter for convenience:

CREATE OR REPLACE FUNCTION f_tasks_with_tags(VARIADIC _tags text[])
  RETURNS SETOF task AS
$func$
   WITH RECURSIVE cte AS (
      SELECT task_id, 1 AS idx
      FROM   task_tag
      WHERE  tag_id = (SELECT tag_id FROM tag WHERE tag = _tags[1])

      UNION
      SELECT task_id, c.idx + 1
      FROM   cte c
      JOIN   task_tag tt USING (task_id)
      WHERE  tag_id = (SELECT tag_id FROM tag WHERE tag = _tags[c.idx + 1])
      )
   SELECT t.*
   FROM   cte  c
   JOIN   task t USING (task_id)
   WHERE  c.idx = array_length(_tags, 1)
$func$  LANGUAGE sql STABLE;

Call:

SELECT * FROM f_tasks_with_tags('tag1', 'tag3'); -- add any number of tags

About VARIADIC (follow links for more):

Passing multiple values in single parameter

Depending on value frequencies, it helps performance (a lot) to pass sparse tags first. It is cheaper to eliminate non-qualifying tasks early.

Another potentially even faster approach: dynamic SQL. Build and execute a query like:

SELECT t.*
FROM  (
   SELECT task_id
   FROM   task_tag t1
   JOIN   task_tag t2 USING (task_id)
-- JOIN   task_tag t3 USING (task_id)
-- more ...
   WHERE  t1.tag_id  = 1
   AND    t2.tag_id  = 3
-- AND    t3.tag_id  = 789
-- more ...
   ) tt
JOIN  task t USING (task_id);

In this query, Postgres can optimize the sequence of joins automatically to evaluate sparse tags first - up to a maximum of join_collapse_limit tables. See:

Postgres JOIN conditions vs WHERE conditions

There are many code examples for building queries dynamically in a plpgsql function on this site. Try a search.

SQL Fiddle (also has intarray module).

dbfiddle here

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange