Question

I'm currently using MySQL 5.6.10.

My actual query is more complicated, but here is a simple way to reproduce the problem. I know the query below is useless (select id from x where id in (select id from x...)), but it proves my point.

I created this table:

CREATE  TABLE test (
  id INT NOT NULL AUTO_INCREMENT ,
  PRIMARY KEY (id));

Then ran this command 5 times - it created 50 rows in the table:

INSERT INTO test (id) VALUES(null),(null),(null),(null),(null),(null),(null),(null),(null),(null);

And then ran this explain:

EXPLAIN SELECT id FROM test WHERE 
       id in (SELECT id FROM test WHERE id < 5);

And got this: 4 Rows

Which makes perfect sense to me. But then if I add an OR to the WHERE clause with another IN, like this:

EXPLAIN SELECT id FROM test WHERE 
       id IN (SELECT id FROM test WHERE id < 5)
    OR id IN (SELECT id FROM test WHERE id > 45);

suddenly MySQL is looking at all 50 rows: 50 Rows

I know that the query could be re-written as SELECT id FROM test WHERE id < 5 OR id > 45, or a UNION etc, again, that's not the point. The point is MySQL is examining far too many rows.

If I run a FLUSH STATUS / SHOW STATUS LIKE "Handler%" on the first query, this is what I get:

Handler_read_key 5
Handler_external_lock 4
Handler_read_next 4
Handler_read_first 1

But if I do that to the second query, I get:

Handler_read_key 99
Handler_write 9
Handler_external_lock 6
Handler_read_next 59
Handler_read_first 2

Why the big difference? I wonder if it is the optimizer, and if so, is there some option I can include in the query that will prevent this "optimization"? This has real practical implications for a query I'm developing. Instead of examining only a few hundred rows, MySQL is examining 120,000.

Was it helpful?

Solution

Generally speaking, RDBMS are not able to optimise subqueries as well as they can optimise proper table joins. As documented under Rewriting Subqueries as Joins (emphasis added):

Sometimes there are other ways to test membership in a set of values than by using a subquery. Also, on some occasions, it is not only possible to rewrite a query without a subquery, but it can be more efficient to make use of some of these techniques rather than to use subqueries. One of these is the IN() construct:

For example, this query:

SELECT * FROM t1 WHERE id IN (SELECT id FROM t2);

Can be rewritten as:

SELECT DISTINCT t1.* FROM t1, t2 WHERE t1.id=t2.id;

In your abstract case (i.e. ignoring other obvious improvements one would make to this query in reality):

SELECT DISTINCT t1.*
FROM   test t1
  JOIN test t2 USING (id)
  JOIN test t3 USING (id)
WHERE  t2.id < 5
    OR t3.id > 45;

For which the execution plan is:

+----+-------------+-------+--------+---------------+---------+---------+------------------+------+-------------------------------------------+
| ID | SELECT_TYPE | TABLE |  TYPE  | POSSIBLE_KEYS |   KEY   | KEY_LEN |       REF        | ROWS |                   EXTRA                   |
+----+-------------+-------+--------+---------------+---------+---------+------------------+------+-------------------------------------------+
|  1 | SIMPLE      | t1    | range  | PRIMARY       | PRIMARY |       4 | (null)           |    9 | Using where; Using index; Using temporary |
|  1 | SIMPLE      | t2    | eq_ref | PRIMARY       | PRIMARY |       4 | db_2_129b4.t1.id |    1 | Using index; Distinct                     |
|  1 | SIMPLE      | t3    | eq_ref | PRIMARY       | PRIMARY |       4 | db_2_129b4.t1.id |    1 | Using index; Distinct                     |
+----+-------------+-------+--------+---------------+---------+---------+------------------+------+-------------------------------------------+

See it on sqlfiddle.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top