Question

I have the following data structure and data:

CREATE TABLE `parent` (
  `id` int(11) NOT NULL auto_increment,
  `name` varchar(10) NOT NULL,
  PRIMARY KEY  (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

INSERT INTO `parent` VALUES(1, 'parent 1');
INSERT INTO `parent` VALUES(2, 'parent 2');

CREATE TABLE `other` (
  `id` int(11) NOT NULL auto_increment,
  `name` varchar(10) NOT NULL,
  PRIMARY KEY  (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

INSERT INTO `other` VALUES(1, 'other 1');
INSERT INTO `other` VALUES(2, 'other 2');

CREATE TABLE `relationship` (
  `id` int(11) NOT NULL auto_increment,
  `parent_id` int(11) NOT NULL,
  `other_id` int(11) NOT NULL,
  PRIMARY KEY  (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

INSERT INTO `relationship` VALUES(1, 1, 1);
INSERT INTO `relationship` VALUES(2, 1, 2);
INSERT INTO `relationship` VALUES(3, 2, 1);

I want to find the the parent records with both other's 1 & 2.

This is what I've figured out, but I'm wondering if there is a better way:

SELECT p.id, p.name
FROM parent AS p
    LEFT JOIN relationship AS r1 ON (r1.parent_id = p.id)
    LEFT JOIN relationship AS r2 ON (r2.parent_id = p.id)
WHERE r1.other_id = 1 AND r2.other_id = 2;

The result is 1, "parent 1" which is correct. The problem is that once you get a list of 5+ joins, it gets messy and as the relationship table grows, it gets slow.

Is there a better way?

I'm using MySQL and PHP, but this is probably pretty generic.

Was it helpful?

Solution

Ok, I tested this. The queries from best to worst were:

Query 1: Joins (0.016s; basically instant)

SELECT p.id, name
FROM parent p
JOIN relationship r1 ON p.id = r1.parent_id AND r1.other_id = 100
JOIN relationship r2 ON p.id = r2.parent_id AND r2.other_id = 101
JOIN relationship r3 ON p.id = r3.parent_id AND r3.other_id = 102
JOIN relationship r4 ON p.id = r4.parent_id AND r4.other_id = 103

Query 2: EXISTS (0.625s)

SELECT id, name
FROM parent p
WHERE EXISTS (SELECT 1 FROM relationship WHERE parent_id = p.id AND other_id = 100)
AND EXISTS (SELECT 1 FROM relationship WHERE parent_id = p.id AND other_id = 101)
AND EXISTS (SELECT 1 FROM relationship WHERE parent_id = p.id AND other_id = 102)
AND EXISTS (SELECT 1 FROM relationship WHERE parent_id = p.id AND oth

Query 3: Aggregate (1.016s)

SELECT p.id, p.name FROM parent p WHERE (SELECT COUNT(*) FROM relationship WHERE parent_id = p.id AND other_id IN (100,101,102,103))

Query 4: UNION Aggregate (2.39s)

SELECT id, name FROM (
  SELECT p1.id, p1.name
  FROM parent AS p1 LEFT JOIN relationship as r1 ON(r1.parent_id=p1.id)
  WHERE r1.other_id = 100
  UNION ALL
  SELECT p2.id, p2.name
  FROM parent AS p2 LEFT JOIN relationship as r2 ON(r2.parent_id=p2.id)
  WHERE r2.other_id = 101
  UNION ALL
  SELECT p3.id, p3.name
  FROM parent AS p3 LEFT JOIN relationship as r3 ON(r3.parent_id=p3.id)
  WHERE r3.other_id = 102
  UNION ALL
  SELECT p4.id, p4.name
  FROM parent AS p4 LEFT JOIN relationship as r4 ON(r4.parent_id=p4.id)
  WHERE r4.other_id = 103
) a
GROUP BY id, name
HAVING count(*) = 4

Actually the above was producing the wrong data so it's either wrong or I did something wrong with it. Whatever the case, the above is just a bad idea.

If that's not fast then you need to look at the explain plan for the query. You're probably just lacking appropriate indices. Try it with:

CREATE INDEX ON relationship (parent_id, other_id)

Before you go down the route of aggregation (SELECT COUNT(*) FROM ...) you should read SQL Statement - “Join” Vs “Group By and Having”.

Note: The above timings are based on:

CREATE TABLE parent (
  id INT PRIMARY KEY,
  name VARCHAR(50)
);

CREATE TABLE other (
  id INT PRIMARY KEY,
  name VARCHAR(50)
);

CREATE TABLE relationship (
  id INT PRIMARY KEY,
  parent_id INT,
  other_id INT
);

CREATE INDEX idx1 ON relationship (parent_id, other_id);
CREATE INDEX idx2 ON relationship (other_id, parent_id);

and nearly 800,000 records created with:

<?php
ini_set('max_execution_time', 600);

$start = microtime(true);

echo "<pre>\n";
mysql_connect('localhost', 'scratch', 'scratch');
if (mysql_error()) {
    echo "Connect error: " . mysql_error() . "\n";
}
mysql_select_db('scratch');
if (mysql_error()) {
    echo "Selct DB error: " . mysql_error() . "\n";
}

define('PARENTS', 100000);
define('CHILDREN', 100000);
define('MAX_CHILDREN', 10);
define('SCATTER', 10);
$rel = 0;
for ($i=1; $i<=PARENTS; $i++) {
    query("INSERT INTO parent VALUES ($i, 'Parent $i')");
    $potential = range(max(1, $i - SCATTER), min(CHILDREN, $i + SCATTER));
    $elements = sizeof($potential);
    $other = rand(1, min(MAX_CHILDREN, $elements - 4));
    $j = 0;
    while ($j < $other) {
        $index = rand(0, $elements - 1);
        if (isset($potential[$index])) {
            $c = $potential[$index];
            $rel++;
            query("INSERT INTO relationship VALUES ($rel, $i, $c)");
            unset($potential[$index]);
            $j++;
        }
    }
}
for ($i=1; $i<=CHILDREN; $i++) {
    query("INSERT INTO other VALUES ($i, 'Other $i')");
}

$count = PARENTS + CHILDREN + $rel;
$stop = microtime(true);
$duration = $stop - $start;
$insert = $duration / $count;

echo "$count records added.\n";
echo "Program ran for $duration seconds.\n";
echo "Insert time $insert seconds.\n";
echo "</pre>\n";

function query($str) {
    mysql_query($str);
    if (mysql_error()) {
        echo "$str: " . mysql_error() . "\n";
    }
}
?>

So once again joins carry the day.

OTHER TIPS

Given that parent table contains unique key on (parent_id, other_id) you can do this:

select p.id, p.name 
  from parent as p 
 where (select count(*) 
        from relationship as r 
       where r.parent_id = p.id 
         and r.other_id in (1,2)
        ) >= 2

Simplifying a bit, this should work, and efficiently.

SELECT DISTINCT p.id, p.name
FROM parent p
INNER JOIN relationship r1 ON p.id = r1.parent_id AND r1.other_id = 1
INNER JOIN relationship r2 ON p.id = r2.parent_id AND r2.other_id = 2

will require at least one joined record for each "other" value. And the optimizer should know it only has to find one match each, and it only needs to read the index, not either of the subsidiary tables, one of which isn't even referenced at all.

I haven't actually tested it, but something along the lines of:

SELECT id, name FROM (
  SELECT p1.id, p1.name
  FROM parent AS p1 LEFT JOIN relationship as r1 ON(r1.parent_id=p1.id)
  WHERE r1.other_id = 1
  UNION ALL
  SELECT p2.id, p2.name
  FROM parent AS p2 LEFT JOIN relationship as r2 ON(r2.parent_id=p2.id)
  WHERE r2.other_id = 2
   -- etc
) GROUP BY id, name
HAVING count(*) = 2

The idea is you don't have to do multi-way joins; just concatenate the results of regular joins, group by your ids, and pick the rows that showed up in every segment.

This is a common problem when searching multiple associates via a many to many join. This is often encountered in services using the 'tag' concept e.g. Stackoverflow

See my other post on a better architecture for tag (in your case 'other') storage

Searching is a two step process:

  1. Find all possible candiates of TagCollections that have any/all the tags you require (may be easier using a cursor of loop construct)
  2. Select data based that matches TagCollection

Performance is always faster due to there being significantly less TagCollections than data items to search

You can do it with a nested select , I tested it in MSSQL 2005 but as you said it should be pretty generic

SELECT * FROM parent p
WHERE p.id in(
    SELECT r.parent_Id 
    FROM relationship r 
    WHERE r.parent_id in(1,2) 
    GROUP BY r.parent_id
    HAVING COUNT(r.parent_Id)=2
)

and the number 2 in COUNT(r.parent_Id)=2 is according to the number of joins you need)

If you can put your list of other_id values into a table that would be ideal. The code below looks for parents with AT LEAST the ids given. If you want it to have EXACTLY the same ids (i.e. no extras) you would have to change the query slightly.

SELECT
     p.id,
     p.name
FROM
     My_Other_IDs MOI
INNER JOIN Relationships R ON
     R.other_id = MOI.other_id
INNER JOIN Parents P ON
     P.parent_id = R.parent_id
GROUP BY
     p.parent_id,
     p.name
HAVING
     COUNT(*) = (SELECT COUNT(*) FROM My_Other_IDs)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top