Question

I have the following tables: users, tags, tags_data.
tags_data contains tag_id and user_id columns to link the users with tags in a 1 user to many tags relationship.

What is the best way of listing all users that have either tag_id 1001 AND 1003, OR tag_id 1004?
EDIT: By this I mean there could be other related tags as well, or not, just so long as there is definitely either 1004 OR (1001 AND 1003).

At the moment I've got two methods of doing this, both using a UNION in a derived table, either in the FROM clause or in an INNER JOIN clause...

SELECT subsel.user_id, users.name 
FROM   ( SELECT user_id 
         FROM   tags_data
         WHERE  tag_id IN (1001, 1003) 
         GROUP  BY user_id 
         HAVING COUNT(tag_id)=2
        UNION 
         SELECT user_id 
         FROM   tags_data 
         WHERE  tag_id=1004
       ) AS subsel 
LEFT JOIN users ON subsel.user_id=users.user_id

Or

SELECT users.user_id, users.name
FROM   users
INNER JOIN ( SELECT user_id
             FROM   tags_data
             WHERE  tag_id  IN (1001, 1003) 
             GROUP  BY user_id
             HAVING COUNT(tag_id)=2
            UNION 
             SELECT user_id
             FROM   tags_data
             WHERE  tag_id=1004
           ) AS subsel ON users.user_id=subsel.user_id

There are other tables which I'll be LEFT JOINing on to this. 50k+ rows in the users table and 150k+ rows in the tags_data table.

This is a batch job to export data to another system so not a real-time query run by an end user, so performance isn't massively critical. However I'd like to try and get the best result I can. The query for the derived table should actually be pretty fast and it makes sense to narrow the scope of the result set down before I then add further joins, functions and calculated fields to the results returned to the client. I will be running these on a larger dataset later to see if there is any performance difference but running EXPLAIN shows an almost identical execution plan.

Generally I try and avoid UNIONs unless absolutely necessary. But I think in this case I almost have to have a UNION somewhere by definition, because of the two effectively unrelated criteria.

Is there another method that I could be using here?
And is there some sort of specific database terminology for this sort of problem?

Full example schema:

CREATE TABLE IF NOT EXISTS `tags` (
  `tag_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `tag_name` varchar(255) NOT NULL,
  PRIMARY KEY (`tag_id`)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8 AUTO_INCREMENT=1006 ;

INSERT INTO `tags` (`tag_id`, `tag_name`) VALUES
(1001, 'tag1001'),
(1002, 'tag1002'),
(1003, 'tag1003'),
(1004, 'tag1004'),
(1005, 'tag1005');

CREATE TABLE IF NOT EXISTS `tags_data` (
  `tags_data_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `user_id` int(11) NOT NULL,
  `tag_id` int(11) NOT NULL,
  PRIMARY KEY (`tags_data_id`),
  KEY `user_id` (`user_id`,`tag_id`)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8 AUTO_INCREMENT=11 ;

INSERT INTO `tags_data` (`tags_data_id`, `user_id`, `tag_id`) VALUES
(1, 1, 1001),
(2, 1, 1002),
(3, 1, 1003),
(4, 5, 1001),
(5, 5, 1003),
(6, 5, 1005),
(7, 8, 1004),
(8, 9, 1001),
(9, 9, 1002),
(10, 9, 1004);

CREATE TABLE IF NOT EXISTS `users` (
  `user_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `name` varchar(255) NOT NULL,
  PRIMARY KEY (`user_id`)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8 AUTO_INCREMENT=11 ;

INSERT INTO `users` (`user_id`, `name`) VALUES
(1, 'user1'),
(2, 'user2'),
(3, 'user3'),
(4, 'user4'),
(5, 'user5'),
(6, 'user6'),
(7, 'user7'),
(8, 'user8'),
(9, 'user9'),
(10, 'user10');

No correct solution

OTHER TIPS

If you are looking for performance on MySQL you should definitely avoid using nested queries and unions — most of them result in a temporary table creation and scanning without indexes. There are rare examples that the derived temporary table still uses indexes and that only work on some specific circumstances and MySQL distributions.

My suggestion would be to rewrite the query to inner/outer joins only, like this:

select distinct u.* from users as u 
  left outer join tags_data as t on 
    t.user_id=u.user_id and t.tag_id=1003 
  inner join tags_data as t2 on 
    t2.user_id=u.user_id 
    and (t2.tag_id=1004 or (t2.tag_id=1001 and t.tag_id=1003));

If you can be sure that no user can have both 1004 and (1001 and 1003) tags, you may also remove the "distinct" from this query, which would avoid a temporary table creation.

You should also definitely use indexes, like these:

create index tags_data__user_id__idx on tags_data(user_id);
create index tags_data__tag_id__idx on tags_data(tag_id);

This would make a 150k+ result set very easy to query.

Use an inner query that groups up all tags for each user into one value, then use a simple filter in the where clause:

select u.*
from users u
join (
  select user_id, group_concat(tag_id order by tag_id) tags
  from tags_data
  group by user_id
) t on t.user_id = u.user_id
where tags rlike '1001.*1003|1004'

See SQLFiddle of this query running against your sample data.

If there where many tags, you could add where tag_id in (1001, 1003, 1004) to the inner query to reduce the size of the tags list as a small optimization. Testing will show whether this makes much difference.

This should perform pretty well, because each table is scanned only once.

Efficient, but inelegant, and not flexible at all:

SELECT users.*
FROM users
LEFT JOIN tags_data AS tag1001
    ON (tag1001.user_id = users.user_id AND tag1001.tag_id = 1001)
LEFT JOIN tags_data AS tag1003
    ON (tag1003.user_id = users.user_id AND tag1003.tag_id = 1003)
LEFT JOIN tags_data AS tag1004
    ON (tag1004.user_id = users.user_id AND tag1004.tag_id = 1004)
WHERE (tag1001.tag_id AND tag1003.tag_id) OR (tag1004.tag_id);
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top