Question

Query 1:

select distinct email from mybigtable where account_id=345

takes 0.1s

Query 2:

Select count(*) as total from mybigtable where account_id=123 and email IN (<include all from above result>)

takes 0.2s

Query 3:

Select count(*) as total from mybigtable where account_id=123 and email IN (select distinct email from mybigtable where account_id=345)

takes 22 minutes and 90% its in the "preparing" state. Why does this take so much time.

Table is innodb with 3.2mil rows on MySQL 5.0

Was it helpful?

Solution

In Query 3, you are basically executing a subquery for every row of mybigtable against itself.

To avoid this, you need to make two major changes:

MAJOR CHANGE #1 : Refactor the Query

Here is your original query

Select count(*) as total from mybigtable
where account_id=123 and email IN
(select distinct email from mybigtable where account_id=345)

You could try

select count(*) EmailCount from
(
    select tbl123.email from
    (select email from mybigtable where account_id=123) tbl123
    INNER JOIN
    (select distinct email from mybigtable where account_id=345) tbl345
    using (email)
) A;

or maybe the count per email

select email,count(*) EmailCount from
(
    select tbl123.email from
    (select email from mybigtable where account_id=123) tbl123
    INNER JOIN
    (select distinct email from mybigtable where account_id=345) tbl345
    using (email)
) A group by email;

MAJOR CHANGE #2 : Proper Indexing

I think you have this already since Query 1 and Query 2 run fast. Make sure you have a compound index on (account_id,email). Do SHOW CREATE TABLE mybigtable\G and make sure you have one. If you don't have it or if you are not sure, then create the index anyway:

ALTER TABLE mybigtable ADD INDEX account_id_email_ndx (account_id,email);

UPDATE 2012-03-07 13:26 EST

If you want to do a NOT IN(), change the INNER JOIN to a LEFT JOIN and check for the right side being NULL, like this:

select count(*) EmailCount from
(
    select tbl123.email from
    (select email from mybigtable where account_id=123) tbl123
    LEFT JOIN
    (select distinct email from mybigtable where account_id=345) tbl345
    using (email)
    WHERE tbl345.email IS NULL
) A;

UPDATE 2012-03-07 14:13 EST

Please read these two links on doing JOINs

Here is a great YouTube Video where I learned to refactor queries and the book it was based on

OTHER TIPS

In MySQL, subselects within the IN clause are re-executed for every row in the outer query, thus creating O(n^2). The short story is, don't use IN (SELECT).

  1. Do you have an index on account_id?

  2. The second problem may be with the nested sub-queries which have terrible performance in 5.0.

  3. GROUP BY with a having clause is faster than DISTINCT.

  4. What are you trying to do which may be better done through joins in addition to Item #3?

There is a lot of processing involved when handling an IN() subquery such as yours. You can read more about it here.

My first suggestion would be to attempt to re-write the subquery into a JOIN instead. Something like (not tested):

SELECT COUNT(*) AS total FROM mybigtable AS t1
 INNER JOIN 
   (SELECT DISTINCT email FROM mybigtable WHERE account_id=345) AS t2 
   ON t2.email=t1.email
WHERE account_id=123
Licensed under: CC-BY-SA with attribution
Not affiliated with dba.stackexchange
scroll top