MySQL versus SQL Server Express Performance Comparison

https://stackoverflow.com/questions/405795

03-07-2019
|

Question

I have a somewhat complex query with roughly 100K rows.

The query runs in 13 seconds in SQL Server Express (run on my dev box)

The same query with the same indexing and tables takes over 15+ minutes to run on MySQL 5.1 (run on my production box - much more powerful and tested with 100% resources) And sometimes the query crashes the machine with an out of memory error.

What am I doing wrong in MySQL? Why does it take so long?

select e8.*
from table_a e8
inner join (
    select max(e6.id) as id, e6.category, e6.entity, e6.service_date
    from (
        select e4.* 
        from table_a e4
        inner join (
            select max(e2.id) as id, e3.rank, e2.entity, e2.provider_id, e2.service_date
            from table_a e2
            inner join (
                select min(e1.rank) as rank, e1.entity, e1.provider_id, e1.service_date
                from table_a e1
                where e1.site_id is not null
                group by e1.entity, e1.provider_id, e1.service_date 
            ) as e3
            on e2.rank= e3.rank
            and e2.entity = e3.entity
            and e2.provider_id = e3.provider_id
            and e2.service_date = e3.service_date
            and e2.rank= e3.rank
            group by e2.entity, e2.provider_id, e2.service_date, e3.rank
        ) e5
        on e4.id = e5.id
        and e4.rank= e5.rank                            
    ) e6
    group by e6.category, e6.entity, e6.service_date 
) e7
on e8.id = e7.id and e7.category = e8.category

Solution

This answer I originally attempted to post to your deleted question which did not indicate that it was a problem with MySQL. I would still go ahead and use SQL Server to refactor the query using the CTEs and then convert back to nested queries (if any remain). Sorry about the formatting, Jeff Atwood sent me the original posted text and I had to reformat it again.

It's hard to do without data, expected results and good names, but I would convert all the nested queries into CTEs, stack them up, name them meaningfully and refactor - starting with excluding the columns which you aren't using. Removing the columns is not going to result in the improvement, because the optimizer is pretty smart - but it WILL give you the ability to improve your query - probably factoring out some or all of the CTEs. I'm not sure what your code is doing, but you may find the new RANK()-type functions useful, because it appears you are using a seek-back type of pattern with all these self-joins.

So start from here instead. I've looked at e7 improvements for you, the columns unused from e7 may indicate either a defect or incomplete thinking about the grouping possibilities, but if those columns are truly unnecessary, then this may trickle all the way back through your logic in e6, e5 and e3. If the grouping in e7 is correct then you can eliminate everything but max(id) in the results and the join. I cannot see why you would have multiple MAX(id) per category, because this would multiply your results when you join, so the MAX(id) must be unique within the category, in which case the category is redundant in the join.

WITH e3 AS (
select min(e1.rank) as rank,
e1.entity,
e1.provider_id,
e1.service_date
from table_a e1
where e1.site_id is not null
group by e1.entity, e1.provider_id, e1.service_date
)

,e5 AS (
select max(e2.id) as id,
e3.rank,
e2.entity,
e2.provider_id,
e2.service_date
from table_a e2
inner join e3
on e2.rank= e3.rank
and e2.entity = e3.entity
and e2.provider_id = e3.provider_id
and e2.service_date = e3.service_date
and e2.rank= e3.rank
group by e2.entity, e2.provider_id, e2.service_date, e3.rank
)

,e6 AS (
select e4.* -- switch from * to only the columns you are actually using
from table_a e4
inner join e5
on e4.id = e5.id
and e4.rank= e5.rank
)

,e7 AS (
select max(e6.id) as id, e6.category -- unused, e6.entity, e6.service_date
from e6
group by e6.category, e6.entity, e6.service_date
-- This instead
-- select max(e6.id) as id
-- from e6
-- group by e6.category, e6.entity, e6.service_date
)

select e8.*
from table_a e8
inner join e7
on e8.id = e7.id
and e7.category = e8.category
-- THIS INSTEAD on e8.id = e7.id

OTHER TIPS

100,000 rows shouldn't take 13 seconds if efficient indexes were available. I suspect the difference is due to the fact that SQL server has a much more robust query optimizer than MySQL. What MySQL has is more on the order of an SQL Parser than an Optimizer.

You'll need to provide a lot more information - full schemas of all participating tables, and full list of indexes on each, for starters.

Then some idea of what the data is about, and what the query is intended to produce. Something on the order of a Use Case.

It'd be interesting to EXPLAIN PLAN with both to see what the differences were. I'm not sure if it's an apple and orange comparison, but I'd be curious.

I don't know if this can help, but this was the first hit on a search for "mysql query optimizer".

Here's another one that might be worthwhile.

The only open source database that I know who have CTEs is Firebird (http://www.firebirdsql.org/rlsnotesh/rlsnotes210.html#rnfb210-cte)

Postgres will have in 8.4 I think

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow