MySQL; LEFT OUTER joining on DATEDIFF with indexes taking 300+ seconds (but only the first time)

https://stackoverflow.com/questions/16018131

04-04-2022
|

Domanda

MySQL's built in cache really makes this question moot for most of the day, but the very first time the following query is run, the performance is terrible: Taking over 300 seconds the first time whereas subsequent querying can complete in milliseconds. Running this with SQL_NO_CACHE then takes 2-4 seconds (!) which is very acceptable in this instance -- but the initial run-time is not.

SELECT DATEDIFF( bt.`datetime`, st.`datetime`) AS 'day_separation'
FROM `smallerTable` AS st
LEFT OUTER JOIN `bigTable` AS bt ON bt.item_id = st.item_id
  AND bt.code = 'X'
  AND bt.`datetime` > st.`datetime`
  AND DATEDIFF ( bt.datetime, st.datetime) < 11
  AND st.`datetime` > '2012-07-01' AND st.`datetime` < 'yesterdays-date 23:59:59'

I have introduced multi-column indexes (thanks to this question) but it still could not address this particular problem. This solution looks inspired but I don't think it is applicable since I'm not sure how I could union these results.

The smaller table has ~8000 records and I want to count / include all of them right now. It will eventually grow bigger and contain items prior to 2012-07-01.

bigTable has 10 million records and I only want to match the "pairing" of those records to the smaller table. Part of the trouble is that they cannot share a direct key or reference linking them together so I am left with a LEFT OUTER JOIN and guessing that if the timestamp of the two events are < 11 days apart (and share the other conditions) that they they must be related.

Excluding the test DATEDIFF ( bt.datetime, st.datetime) < 11 created 14k 'results' illustrating that the number of DATEDIFF calculations that "need to occur" is 14k-8k (a.k.a. 6k).

INDEXES: the datetime fields of each table, the code, and the item_ids.

I have compound indexes on both tables in the order of (item_id, datetime). From my understanding, that is the necessary order because we use the datetime fields in the select statement in the form of DATEDIFF( bt.datetime, st.datetime).

Would a combined index on (code, item_id, datetime) revolutionize this query? (Yes it did!)

The explain reveals little to my untrained eye other than that it is using a temporary table which I understand can be time-consuming.

id * select_type * table * type  * possible_keys * key                * key_len * ref           * rows * extra
1  * SIMPLE      * st    * index * NULL          * items_for_datetime * 59      * NULL          * 8295 * using index; using temporary; using filesort
1  * SIMPLE      * BT    * ref   * [many]        * items_for_datetime * 51      * master.st.item_id * 3    *

Depending on MySQL's whims, bigTable sometimes shows that it prefers the item_id key over items_for_datetime. Should I encourage the use of my joint index believing that I know better?

Some extra info:

The inserts into these tables occur once each day (1~5k records into BT)
No updates or deletions ever occur
I could probably run two queries -- Change this one to INNER JOIN and then run a second one to subtract the number of results from the total records to find the number that didn't have a corresponding result in BT
We have already executed phpmyadmin's Check Table, Defragmentation, and Optimize Table on BT

[aside] Could this be a good scenario for using a NoSQL database such as Mongo?

Why is there be such a disparity on the first run and the second? More importantly: What can be done to improve the timing of the first run?

Update: New attempts require a new day to find out their efficacy. Tomorrow I will attempt Barmar's suggestion using BETWEEN and DATE_ADD. I have also created a combined index on (code, item_id, datetime). I will report back tomorrow the result but welcome any other ideas.

Update: Success! The first run of the query now only spent 6 seconds which is amazing considering where it came from. Subsequent querying took only .035 seconds! What a dream. The combined index on (code, item_id, datetime) no doubt had a hand in this success. Here is the new query: Thanks everyone!

SELECT DATEDIFF( bt.`datetime`, st.`datetime` ) AS  'day_separation'
FROM  `smallerTable` AS st
LEFT OUTER JOIN bigTable AS bt USE INDEX (  `cmd_item_time` ) 
ON bt.item_id = st.item_id
  AND bt.code =  'X'
  AND bt.`datetime` BETWEEN st.`datetime` AND DATE_ADD( st.`datetime`, INTERVAL 10 DAY ) 
  AND st.datetime BETWEEN '2012-07-01' AND  'yesterdays-date 23:59:59'

Soluzione

Try changing:

AND bt.`datetime` > st.`datetime`
AND DATEDIFF ( bt.datetime, st.datetime) < 11

to:

AND bt.`datetime` BETWEEN st.`datetime` AND date_add(st.`datetime`, interval 11 day)

This may allow the index on bt.datetime to be used.

If code = 'X' filters out a large portion of bigTable, a compound index on (code, item_id) should help.

Altri suggerimenti

The problem of you query is most probably that line : AND st.datetime > '2012-07-01' AND st.datetime < 'yesterdays-date 23:59:59'

By casting datetime to string (in order to compare) , you're loosing the advantages of indexes...

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow