Question

We have a table with the following structure:

+----+----------+-------------+---------------------+-----------+---------+----------+
| id | Batch_id | Merchant_id |     ExpiryDate      | Allocated | MaxUses | IsBarred |
+----+----------+-------------+---------------------+-----------+---------+----------+
|  1 |        1 |        5793 | 2019-11-03 01:23:40 |         1 |       1 |        0 |
|  2 |        1 |        1238 | 2019-11-07 01:23:40 |         0 |       1 |        0 |
|  3 |        2 |        9427 | 2019-11-09 04:49:40 |         1 |       1 |        0 |
+----+----------+-------------+---------------------+-----------+---------+----------+

In DDL terms this is:

CREATE TABLE `merchant_coupon` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `ExpiryDate` datetime DEFAULT NULL,
  `MaxUses` int(11) NOT NULL,
  `Batch_id` int(11) DEFAULT NULL,
  `Merchant_id` int(11) DEFAULT NULL,
  `Allocated` int(11) NOT NULL,
  `IsBarred` tinyint(1) NOT NULL
  PRIMARY KEY (`id`),
  KEY `IDX_8C9078EBAF5CDCE0` (`Batch_id`),
  KEY `IDX_8C9078EBE10A5A83` (`Merchant_id`),
  CONSTRAINT `__FK_8C9078EBAF5CDCE0` FOREIGN KEY (`Batch_id`) REFERENCES `batch` (`id`),
  CONSTRAINT `__FK_8C9078EBE10A5A83` FOREIGN KEY (`Merchant_id`) REFERENCES `merchant` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=10388795 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

We would like to find the row which:

  • Is not Barred
  • Has Allocated < Max Uses
  • Has an Expiry Date greater than a certain date
  • Is a member of one or more batches
  • Is the soonest expiring (i.e. ExpiryDate is closest to now)

  • We only need 1 of the rows.

Our query at the moment looks like this:

SELECT mc.* FROM merchant_coupon mc WHERE 1 = 1 AND mc.IsBarred = 0 AND mc.AllocatedCount < mc.MaxUses AND mc.ExpiryDate > '2018-11-13 18:44:17' AND (mc.Batch_id = 288 OR mc.Batch_id = 446 OR mc.Batch_id = 561 ... OR mc.Batch_id = 4167 OR mc.Batch_id = 5731 ) ORDER BY mc.ExpiryDate ASC LIMIT 1

This works, but we suspect there are opportunities to improve it.

The explain looks like this:

id,select_type,table,type,possible_keys,key,key_len,ref,rows,Extra
1,SIMPLE,mc,range,"IDX_8C9078EBAF5CDCE0,IDX_8C9078EBE10A5A83",IDX_8C9078EBAF5CDCE0,5,,184481,Using where; Using filesort

Some other things to note:

  • The database is MySQL
  • The IDX_8C9078EBAF5CDCE0 is on the Batch_id. We also have an index on the Merchant_id
  • The Batch_id normally groups the rows into groups of between 2-10,000 items
  • The Batch that the Batch_id refers to also refers to the Merchant_id, so we could eliminate the AND Merchant_id = <x>, but we keep it as a safety check, just in case
  • The table contains about ~10 million rows, and is growing in the order of ~20,000-60,000 a week atm.
  • Once a row is selected, the Allocated column will be incremented - we do a SELECT ... FOR UPDATE in fact.
  • IsBarred is mostly 0.
  • In most cases MaxUses is 1, but in some it can be a number in the range of 1-1000.

What indexes would most likely optimise the query? Or, what changes could we make to the query to optimise it?

Was it helpful?

Solution 3

RickJames gave me the clue to how to answer this, along with remembering that MySQL uses a BTree for indexes:

You are missing one thing -- A "composite" key. Alas, there is no 'perfect' one for this query. – Rick James

We can make one though, and so we have.

We modified our application and our database schema to create a new column.

We also exploited a fact about the ExpiryDate that I didn't realise was important in the question: the ExpiryDate is normally set at a point 7 days to 10 years in the future. It is never changed. Due to the (generally) forward moving nature of time, it will eventually be in the past and can therefore be excluded from all queries.

We therefore:

  • Created a composite key based on IsBarred, ExpiryDate and (MaxUses - AllocatedCount)
  • Modified the value of the composite key whenever any of the above columns changed (you could use a trigger, but we've done it in application code)
  • Implemented a cron job which will mark the Composite key as 0 for rows which are currently have Composite valued at 1 and which have an ExpiryDate in the past
  • Modified the query to use Composite = 1 instead of querying IsBarred, ExpiryDate and (MaxUses - AllocatedCount).
  • Adding a new index which uses the Batch_id, Composite and ExpiryDate

Thinking about the index as a BTree rather than phone book (which is how I've always thought of them, and as how Jerb referenced), helped me to realise that by making an index containing the Composite I could help MySQL eliminate 1/2 the tree quickly.

We've just deployed to production, and initial performance suggests a nice improvement of ~100ms in the worst cases.

OTHER TIPS

I've been reading dba stackexchange for a while now, and these kind of questions come up often, and they always kind of frustrate me.

People often seem to go straight to explain plans, and or looking for special indexes like they're magic bullets - but they're really not.

What you should get into doing (everybody) is thinking these problems through from a logical standpoint - if you, as a human, wanted to search through ~10 million lines of data to achieve this, how would you best organise the data to help you do this? That's all an index is, basically, is an ordering of the data. The computer does the work a bit faster than you, that's all.

By all means use those other tools and experimentation to help you with your understanding and refinement of yours skills, but don't use it as the primary go-to. That's a crutch.

Maybe I'm a bit of a idealist, and anyone who is more than a couple of years younger than me just isn't used to manually dealing with these kind of tasks, you just type them in to google - but think of how you might use a phonebook*

  • you don't open it at page 1, find Andrew Aardvark, ask youself "is this Walter Williams?" . no... next
  • line 2 - Arthur Abraham - is that Walter Williams? ... no.
  • line 3 - Alex Acer - Walter Williams? no
  • And so on.

No, you use an algorithm.

  • these were big fat books, made of paper, with everyone's phone number in. We had them in the 20th century.

So anyway, here's how I always approach your problem.

[1] Continue thinking of this problem like a phonebook.

[2] What columns could I sort the book by to help me search the data? - you're right in that it's those columns affecting the selection - columns in the WHERE clause, and columns in the ORDER BY

[3] So systematically... column by column

[4] mc.IsBarred - I didn't get as far reading your notes (IsBarred is mostly 0) before I disounted this. The name gave me the clue - this is presumably effectively a binary flag - Low cardinality data is generally not a good sorting column

Think of the phonebook. If we sorted the phonebook not by name, but by hair colour - even if we knew Walter Williams had red hair, and could home straight in on the 10% of the book containing just those guys (me included) it'd still take a while to search them one by one.

[5] mc.AllocatedCount or mc.MaxUses - no. The problem here comes from your comparing one column against the other. There's no ordering you could apply to the list by either column, or one then the other, which could zero you in quickly on to a small subset of data.

[6] mc.BatchID - this doesn' appear to be an unreasonable candidate. Although your code doeesn't actually show exactly how many values you're selecting from, or the cardinality of the data.

If you could search through just those sections of the phonebook relating to batch 446, and then those for batch 561, etc. That would certainly narrow the amount of searching you did - but you would still have to search through and compare every single row in that set, and then find the one which matches the rest of the clause, AND has the earliest date.

OK, we'll pencil it that column as a possible...

[7] mc.ExpiryDate - appears in both your where clause AND your order by ...

interesting...

If you sorted your phonebook by expiry date, you could turn to the page corresponding to 18:44 on 2018-11-13, and find the first record. If it matched all of your where criteria - you would have to look no further. Since it is by definition the earliest expiring date after that date - you know this is the record you're looking for.

If it's not, you have to look at the second row, then rinse and repeat. AS soon as you find a matching row, you know it's got the earliest expiry date and can stop.

This is the only potential indexing which can find you your single row in one attempt.

mc.ExpiryDate IS YOUR CANDIDATE FOR INDEXING

I have a few caveats.

[A] I'm not overly familiar with MySQL. Exactly how the query engine will interpret your code telling it to ORDER BY and then LIMIT, I don't know for sure. As a human I can think this through as detailed above - but whether MySQL will attempt to fetch all matching rows before finding earliest I can't say. I sesiously doubt it'd be that silly, but have no firm evidence.

[B] If you're adding 60,000 rows of data to the table in a week - that's 60,000 operations which will run marginally slower each time, as they have to maintain an extra index. Think through the implications of that.

There are always thade-offs with these things...

[C] If your where clause still filters out many many rows, the query may still have to search through many records. It depends a lot on the actual data. But that's where your own knowledge is key in this kind of situation. Use that!

Add these; the Optimizer will pick the one that seems best at the time:

INDEX(IsBarred, ExpiryDate),
INDEX(IsBarred, Batch_id, ExpiryDate),

The order of the columns is important -- =, then IN, then range. More discussion: http://mysql.rjweb.org/doc.php/index_cookbook_mysql .

Licensed under: CC-BY-SA with attribution
Not affiliated with dba.stackexchange
scroll top