How can we optimise this OR and ORDER BY query?
Question
We have a table with the following structure:
+----+----------+-------------+---------------------+-----------+---------+----------+
| id | Batch_id | Merchant_id | ExpiryDate | Allocated | MaxUses | IsBarred |
+----+----------+-------------+---------------------+-----------+---------+----------+
| 1 | 1 | 5793 | 2019-11-03 01:23:40 | 1 | 1 | 0 |
| 2 | 1 | 1238 | 2019-11-07 01:23:40 | 0 | 1 | 0 |
| 3 | 2 | 9427 | 2019-11-09 04:49:40 | 1 | 1 | 0 |
+----+----------+-------------+---------------------+-----------+---------+----------+
In DDL terms this is:
CREATE TABLE `merchant_coupon` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`ExpiryDate` datetime DEFAULT NULL,
`MaxUses` int(11) NOT NULL,
`Batch_id` int(11) DEFAULT NULL,
`Merchant_id` int(11) DEFAULT NULL,
`Allocated` int(11) NOT NULL,
`IsBarred` tinyint(1) NOT NULL
PRIMARY KEY (`id`),
KEY `IDX_8C9078EBAF5CDCE0` (`Batch_id`),
KEY `IDX_8C9078EBE10A5A83` (`Merchant_id`),
CONSTRAINT `__FK_8C9078EBAF5CDCE0` FOREIGN KEY (`Batch_id`) REFERENCES `batch` (`id`),
CONSTRAINT `__FK_8C9078EBE10A5A83` FOREIGN KEY (`Merchant_id`) REFERENCES `merchant` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=10388795 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
We would like to find the row which:
- Is not Barred
- Has
Allocated < Max Uses
- Has an Expiry Date greater than a certain date
- Is a member of one or more batches
Is the soonest expiring (i.e. ExpiryDate is closest to now)
We only need 1 of the rows.
Our query at the moment looks like this:
SELECT mc.*
FROM merchant_coupon mc
WHERE 1 = 1
AND mc.IsBarred = 0
AND mc.AllocatedCount < mc.MaxUses
AND mc.ExpiryDate > '2018-11-13 18:44:17'
AND (mc.Batch_id = 288
OR mc.Batch_id = 446
OR mc.Batch_id = 561
...
OR mc.Batch_id = 4167
OR mc.Batch_id = 5731
)
ORDER BY mc.ExpiryDate ASC
LIMIT 1
This works, but we suspect there are opportunities to improve it.
The explain looks like this:
id,select_type,table,type,possible_keys,key,key_len,ref,rows,Extra
1,SIMPLE,mc,range,"IDX_8C9078EBAF5CDCE0,IDX_8C9078EBE10A5A83",IDX_8C9078EBAF5CDCE0,5,,184481,Using where; Using filesort
Some other things to note:
- The database is MySQL
- The
IDX_8C9078EBAF5CDCE0
is on theBatch_id
. We also have an index on theMerchant_id
- The Batch_id normally groups the rows into groups of between 2-10,000 items
- The Batch that the Batch_id refers to also refers to the Merchant_id, so we could eliminate the
AND Merchant_id = <x>
, but we keep it as a safety check, just in case - The table contains about ~10 million rows, and is growing in the order of ~20,000-60,000 a week atm.
- Once a row is selected, the
Allocated
column will be incremented - we do aSELECT ... FOR UPDATE
in fact. IsBarred
is mostly 0.- In most cases
MaxUses
is 1, but in some it can be a number in the range of 1-1000.
What indexes would most likely optimise the query? Or, what changes could we make to the query to optimise it?
Solution 3
RickJames gave me the clue to how to answer this, along with remembering that MySQL uses a BTree for indexes:
You are missing one thing -- A "composite" key. Alas, there is no 'perfect' one for this query. – Rick James
We can make one though, and so we have.
We modified our application and our database schema to create a new column.
We also exploited a fact about the ExpiryDate that I didn't realise was important in the question: the ExpiryDate is normally set at a point 7 days to 10 years in the future. It is never changed. Due to the (generally) forward moving nature of time, it will eventually be in the past and can therefore be excluded from all queries.
We therefore:
- Created a composite key based on
IsBarred
,ExpiryDate
and(MaxUses - AllocatedCount)
- Modified the value of the composite key whenever any of the above columns changed (you could use a trigger, but we've done it in application code)
- Implemented a cron job which will mark the
Composite
key as 0 for rows which are currently haveComposite
valued at 1 and which have anExpiryDate
in the past - Modified the query to use
Composite = 1
instead of queryingIsBarred
,ExpiryDate
and(MaxUses - AllocatedCount)
. - Adding a new index which uses the
Batch_id
,Composite
andExpiryDate
Thinking about the index as a BTree rather than phone book (which is how I've always thought of them, and as how Jerb referenced), helped me to realise that by making an index containing the Composite
I could help MySQL eliminate 1/2 the tree quickly.
We've just deployed to production, and initial performance suggests a nice improvement of ~100ms in the worst cases.
OTHER TIPS
I've been reading dba stackexchange for a while now, and these kind of questions come up often, and they always kind of frustrate me.
People often seem to go straight to explain plans, and or looking for special indexes like they're magic bullets - but they're really not.
What you should get into doing (everybody) is thinking these problems through from a logical standpoint - if you, as a human, wanted to search through ~10 million lines of data to achieve this, how would you best organise the data to help you do this? That's all an index is, basically, is an ordering of the data. The computer does the work a bit faster than you, that's all.
By all means use those other tools and experimentation to help you with your understanding and refinement of yours skills, but don't use it as the primary go-to. That's a crutch.
Maybe I'm a bit of a idealist, and anyone who is more than a couple of years younger than me just isn't used to manually dealing with these kind of tasks, you just type them in to google - but think of how you might use a phonebook*
- you don't open it at page 1, find Andrew Aardvark, ask youself "is this Walter Williams?" . no... next
- line 2 - Arthur Abraham - is that Walter Williams? ... no.
- line 3 - Alex Acer - Walter Williams? no
- And so on.
No, you use an algorithm.
- these were big fat books, made of paper, with everyone's phone number in. We had them in the 20th century.
So anyway, here's how I always approach your problem.
[1] Continue thinking of this problem like a phonebook.
[2] What columns could I sort the book by to help me search the data? - you're right in that it's those columns affecting the selection - columns in the WHERE clause, and columns in the ORDER BY
[3] So systematically... column by column
[4] mc.IsBarred - I didn't get as far reading your notes (IsBarred is mostly 0) before I disounted this. The name gave me the clue - this is presumably effectively a binary flag - Low cardinality data is generally not a good sorting column
Think of the phonebook. If we sorted the phonebook not by name, but by hair colour - even if we knew Walter Williams had red hair, and could home straight in on the 10% of the book containing just those guys (me included) it'd still take a while to search them one by one.
[5] mc.AllocatedCount or mc.MaxUses - no. The problem here comes from your comparing one column against the other. There's no ordering you could apply to the list by either column, or one then the other, which could zero you in quickly on to a small subset of data.
[6] mc.BatchID - this doesn' appear to be an unreasonable candidate. Although your code doeesn't actually show exactly how many values you're selecting from, or the cardinality of the data.
If you could search through just those sections of the phonebook relating to batch 446, and then those for batch 561, etc. That would certainly narrow the amount of searching you did - but you would still have to search through and compare every single row in that set, and then find the one which matches the rest of the clause, AND has the earliest date.
OK, we'll pencil it that column as a possible...
[7] mc.ExpiryDate - appears in both your where clause AND your order by ...
interesting...
If you sorted your phonebook by expiry date, you could turn to the page corresponding to 18:44 on 2018-11-13, and find the first record. If it matched all of your where criteria - you would have to look no further. Since it is by definition the earliest expiring date after that date - you know this is the record you're looking for.
If it's not, you have to look at the second row, then rinse and repeat. AS soon as you find a matching row, you know it's got the earliest expiry date and can stop.
This is the only potential indexing which can find you your single row in one attempt.
mc.ExpiryDate IS YOUR CANDIDATE FOR INDEXING
I have a few caveats.
[A] I'm not overly familiar with MySQL. Exactly how the query engine will interpret your code telling it to ORDER BY and then LIMIT, I don't know for sure. As a human I can think this through as detailed above - but whether MySQL will attempt to fetch all matching rows before finding earliest I can't say. I sesiously doubt it'd be that silly, but have no firm evidence.
[B] If you're adding 60,000 rows of data to the table in a week - that's 60,000 operations which will run marginally slower each time, as they have to maintain an extra index. Think through the implications of that.
There are always thade-offs with these things...
[C] If your where clause still filters out many many rows, the query may still have to search through many records. It depends a lot on the actual data. But that's where your own knowledge is key in this kind of situation. Use that!
Add these; the Optimizer will pick the one that seems best at the time:
INDEX(IsBarred, ExpiryDate),
INDEX(IsBarred, Batch_id, ExpiryDate),
The order of the columns is important -- =
, then IN
, then range. More discussion: http://mysql.rjweb.org/doc.php/index_cookbook_mysql .