MySQL subquery slows down drastically, but they work fine independently
-
16-10-2019 - |
Question
Query 1:
select distinct email from mybigtable where account_id=345
takes 0.1s
Query 2:
Select count(*) as total from mybigtable where account_id=123 and email IN (<include all from above result>)
takes 0.2s
Query 3:
Select count(*) as total from mybigtable where account_id=123 and email IN (select distinct email from mybigtable where account_id=345)
takes 22 minutes and 90% its in the "preparing" state. Why does this take so much time.
Table is innodb with 3.2mil rows on MySQL 5.0
Solution
In Query 3, you are basically executing a subquery for every row of mybigtable against itself.
To avoid this, you need to make two major changes:
MAJOR CHANGE #1 : Refactor the Query
Here is your original query
Select count(*) as total from mybigtable
where account_id=123 and email IN
(select distinct email from mybigtable where account_id=345)
You could try
select count(*) EmailCount from
(
select tbl123.email from
(select email from mybigtable where account_id=123) tbl123
INNER JOIN
(select distinct email from mybigtable where account_id=345) tbl345
using (email)
) A;
or maybe the count per email
select email,count(*) EmailCount from
(
select tbl123.email from
(select email from mybigtable where account_id=123) tbl123
INNER JOIN
(select distinct email from mybigtable where account_id=345) tbl345
using (email)
) A group by email;
MAJOR CHANGE #2 : Proper Indexing
I think you have this already since Query 1 and Query 2 run fast. Make sure you have a compound index on (account_id,email). Do SHOW CREATE TABLE mybigtable\G
and make sure you have one. If you don't have it or if you are not sure, then create the index anyway:
ALTER TABLE mybigtable ADD INDEX account_id_email_ndx (account_id,email);
UPDATE 2012-03-07 13:26 EST
If you want to do a NOT IN(), change the INNER JOIN
to a LEFT JOIN
and check for the right side being NULL, like this:
select count(*) EmailCount from
(
select tbl123.email from
(select email from mybigtable where account_id=123) tbl123
LEFT JOIN
(select distinct email from mybigtable where account_id=345) tbl345
using (email)
WHERE tbl345.email IS NULL
) A;
UPDATE 2012-03-07 14:13 EST
Please read these two links on doing JOINs
Here is a great YouTube Video where I learned to refactor queries and the book it was based on
OTHER TIPS
In MySQL, subselects within the IN clause are re-executed for every row in the outer query, thus creating O(n^2). The short story is, don't use IN (SELECT).
Do you have an index on account_id?
The second problem may be with the nested sub-queries which have terrible performance in 5.0.
GROUP BY with a having clause is faster than DISTINCT.
What are you trying to do which may be better done through joins in addition to Item #3?
There is a lot of processing involved when handling an IN() subquery such as yours. You can read more about it here.
My first suggestion would be to attempt to re-write the subquery into a JOIN instead. Something like (not tested):
SELECT COUNT(*) AS total FROM mybigtable AS t1
INNER JOIN
(SELECT DISTINCT email FROM mybigtable WHERE account_id=345) AS t2
ON t2.email=t1.email
WHERE account_id=123