Postgres join where foreign table has ALL records

https://dba.stackexchange.com/questions/168460

06-10-2020
|

题

I have this people and tags table, like this,

CREATE TABLE people
AS
  SELECT *
  FROM ( VALUES
    (1,'Joe'),
    (2,'Jane')
  ) AS t(id,name);

CREATE TABLE tags
AS
  SELECT * FROM ( VALUES
    (1, 1, 'np'),
    (2, 1, 'yw'),
    (3, 2, 'np')
  ) AS t(id, people_id, tag);

If I want to find all people that contain both the np and yw tags in the tags table using a join, how would I do this efficiently in Postgres 9.6?

In this scenario, I should only get Joe's record from the people table.

解决方案

A small variant on mendosi's answer, which avoids WITH:

SELECT *
FROM people 
WHERE id IN 
(    
      SELECT people_id
      FROM tags
      WHERE tag IN ('np', 'yw')
      GROUP BY people_id
      HAVING COUNT(DISTINCT tag) = 2
);

 id | name
 -: | :---
  1 | Joe

This approach has some small differences with regard to his/her approach:

If you use a database doesn't handle WITH statements (not the case with PostgreSQL since a long time ago)
You don't feel comfortable with WITH
You want to avoid the fact that, in PostgreSQL, WITH are optimization fences; and (as of today) prevent, eventually, the database from performing some optimizations.
This should be very close to fully SQL Standard, and works in all databases available at DBFiddle (as of today).

dbfiddle here

If you're looking for the fastest possible solution, I'd check different approaches under practical conditions, and decide based on the timings you actually get. My proposed query is very standard, and shouldn't be slower than the one with a WITH, but whether it is slower or faster than other approaches, I don't really know in advance.

其他提示

Here are a couple of alternate approaches which don't involve using array_agg.

Use the INTERSECT operator against the the sets of people_id returned for each tag:

WITH both_tags AS (
    SELECT people_id FROM tags WHERE tag = 'np'
    INTERSECT 
    SELECT people_id FROM tags WHERE tag = 'yw')
SELECT *
  FROM people 
  WHERE id IN (SELECT people_id FROM both_tags);

Or you could use a COUNT(DISTINCT tag) = 2 to find people with both tags. (Note that the DISTINCT was added to handle the case that a person may have the same tag twice. If that is impossible, it's safe to remove.)

WITH both_tags AS (
    SELECT people_id
      FROM tags
      WHERE tag IN ('np', 'yw')
      GROUP BY people_id
      HAVING COUNT(DISTINCT tag) = 2)
SELECT *
  FROM people 
  WHERE id IN (SELECT people_id FROM both_tags);

This second approach would be easier to extend to accept an arbitrary number of tags, though the first approach would not be impossible.

Two more ways - that use joins or correlated subqueries - and no GROUP BY:

The first uses EXISTS subqueries:

select p.id, p.name
from people as p 
where exists (select from tags as t where t.people_id = p.id and t.tag = 'np')
  and exists (select from tags as t where t.people_id = p.id and t.tag = 'yw')
;

The second assumes a UNIQUE constraint on (tag, people_id):

select p.id, p.name
from people as p 
  join tags as t1 on t1.people_id = p.id and t1.tag = 'np'
  join tags as t2 on t1.people_id = p.id and t2.tag = 'yw'
;

Tests at dbfiddle.uk.

Check also this question at SO, with more than 10 ways to solve this type of problem - and performance analysis: How to filter SQL results in a has-many-through relation.

There is even a tag for them: relational-division

It may be surprising but most often the many join method, the many exists and similar ones (like the one that uses INTERSECT) are more efficient than the group by / count methods. But of course there are many details that matter for performance. Query parameters, table sizes, indexes, data distributions and many more can affect performance of the various methods.

Here we select all people and array-agg the tags. We do it in a single-pass. Then we wrap that in a subselect and find all matches where they have both np, and yw.

SELECT people_id, name, array_agg(tag) AS tags
FROM people
JOIN tags ON (people_id = people.id)
GROUP BY people_id, name
HAVING array_agg(tag) @> ARRAY['np', 'yw'];

 id | name |  tags   
----+------+---------
  1 | Joe  | {np,yw}
(1 row)

You can sometimes make this faster by pushing down the condition

SELECT people_id, name, array_agg(tag) AS tags
FROM people
JOIN tags ON (people_id = people.id)

-- push down
WHERE tag IN ('np', 'yw')

GROUP BY people_id, name
HAVING array_agg(tag) @> ARRAY['np', 'yw'];

You could also just put the tag array on the people directly. Then querying it becomes dirt simple.

Another way with a simple equi-join:

select p.id, name
from people p join tags on tags.people_id=p.id
where tag in ('np','yw')
group by p.id, name
having count(distinct tag)=2;

id | name
-: | :---
 1 | Joe

dbfiddle here

许可以下： CC-BY-SA 和归因

不隶属于 dba.stackexchange