Question

I have a MySQL database with innodb as the storage engine, and I have a number of queries that take the basic form:

SELECT bd.billing,
  SUM(CASE WHEN tc.transaction_class = 'c'  THEN bd.amount ELSE 0 END) AS charges,
  SUM(CASE WHEN tc.transaction_class = 'a' THEN bd.amount ELSE 0 END) AS adjustments,
  SUM(CASE WHEN tc.transaction_class = 'p' THEN bd.amount ELSE 0 END) AS payments,
  SUM(bd.amount) AS balance_this_month
FROM billing_details bd
JOIN transaction_classes tc ON tc.transaction_code = bd.transaction_code
WHERE bd.entry_date BETWEEN '2013-06-04' AND '2013-07-01'
GROUP BY billing;

I am trying to work out the best strategy for indexing the columns for queries that take this form. Before I started, there were only indexes on single columns, and an explain revealed that 1.5M rows were being read (for, as you can see here, what is only a month's worth of data).

My first attempt got this number down to ~300,000, which was achieved by indexing (entry_date, billing, transaction_code). After doing some more reading (in particular High Performance MySQL) I decided that having entry_date (typically a range expression) as my left-most column was not optimal, so I tried (billing, transaction_code, entry_date) and explain revealed something more like 4-500,000 rows. Still an improvement over the first number, but as I dig deeper, I've come to wonder:

What could I reasonably expect from an optimal index for a query of this kind? I am guessing that since I am performing an aggregate function, it is always going to build a temp table and do a filesort ... or is it? The more I read, the more confused I get. My instinct was to use entry_date as the leftmost column, since it is the only stipulation in my where clause. More research led me to believe I should put it right-most, since I am querying a range of dates. But then what I've read only really talks about the where clause - which only has entry_date: what about a sum/case query such as this? And could I add amount to this index in a way that is beneficial, or am I going to be stuck with what I have unless I redesign the schema/query?

Était-ce utile?

La solution

From your query, it's not clear which table the unqualified columns (e.g. entry_date) refer to. (Best practice is to qualify ALL column references in a query, for the benefit of readers, and to future proof your query from an "ambiguous column" exception when columns of the same name are added to other tables in the query.)

I'm going to assume that the unqualified columns are from the billing_details table.

The most likely candidates for covering indexes are:

... ON billing_details (entry_date, billing, transaction_code, amount)

... ON transaction_classes (transaction_code, transaction_class)

An EXPLAIN should show "Using index" in the extra column, for both table accesses. (If the transaction_classes table is small enough, an index may not matter at all.)

A "covering index" means that the query can be satisfied entirely from the index, without a need to reference the pages of the underlying table.

Optimizing Queries with EXPLAIN http://dev.mysql.com/doc/refman/5.5/en/using-explain.html

The strategy here is to get the columns in the predicate first in the index, so an index range scan operation can be performed. I think the order of the other columns is less critical. Having the billing column next may help MySQL with the GROUP BY, but I think testing will reveal that it doesn't matter.

The JOIN operation may benefit from an index on the columns in the join predicate, in this case, on the smaller transaction_classes "lookup" table. If, however, the inner join is actually filtering out rows from the billing_details table (rows that don't have a matching value in the transaction_classes table, then we might consider that as a filtering predicate, and have an index. I suspect however, that there is a foreign key relationship, and that this column is NOT NULL in the billing_details table, such that every row in the billing_details table has a matching row in the transaction_classes table.

If a majority of the rows in the billing_details table is being accessed, it may be beneficial to have the columns referenced in the GROUP BY first, rather the columns in the predicate, for example:

... ON billing_details (billing, entry_date, transaction_code, amount)

In this case, MySQL may be able to avoid a "Using file sort" operation to get the rows grouped together. Again, I don't think the order of the other columns after that one matters. In this case, it's going to be full index scan, rather than a range scan. Every row from the index will need to be checked for the entry_date, to determine whether it is included or not.

If the predicate on entry_date returns a small percentage (for example, less than 10%) of the rows, an access plan using an index with that column first is likely going to perform better.


Summary

In terms of performance for this query, getting an index on the predicate can significantly reduce the amount of work required to identify rows to be included, without visiting every row.

The next "big rock" is the GROUP BY. If you were accessing every row in the table (with no predicate at all), then the best index is on the columns in the GROUP BY clause. Because the values are ordered by this column, MySQL can avoid having to perform a sort operation, which can be expensive on large sets.

Aside from an appropriate index on billing_details table, the next best thing you could do is eliminate the join to the transaction_classes table, and use just the value in transaction_code column.

The processing of the conditionals in the CASE aren't contributing significantly to the query time. What takes the time is accessing the values that need to be processed, and getting the rows sorted so that they can be "grouped".


Followup

The 'Using temporary; using filesort' in the plan is due to the GROUP BY operation. MySQL used an index for the WHERE clause, to whittle down the number of rows. Now MySQL has to take those rows and sort them. This is expected.

At least the 'Using index' shows that MySQL is getting the rows entirely from the index, with no access to the underlying table (and that is usually a performance boost.)

The only way to get avoid the "Using filesort" for the GROUP BY (AFAIK) is an index with the column(s) referenced in the GROUP BY as leading columns.

To see if MySQL will use an index like that, you can try disabling MySQL's ability to use the index for the WHERE clause. The easiest way to do this (for testing) is to wrap the bd.entry_date column reference in the WHERE clause in a function.

Change that predicate, and try EXPLAIN, using some of these variations

WHERE DATE(bd.entry_date) BETWEEN 
WHERE DATE(bd.entry_date) + INTERVAL 0 DAY BETWEEN
WHERE DATE_FORMAT(bd.entry_date,'%Y-%m-%d') BETWEEN

Some (or all) of those should be sufficient to disable MySQL from using the index with entry_date leading to satisfy the WHERE clause.

With that index effectively disabled as an option, the MySQL may decide to make use of the index with the billing column as the leading column, to avoid the "Using filesort" operation. (In this case, it's almost imperative that the index also include the entry_date column, because that column is going to need to be checked on every row in the table, effectively a "full scan" of all the rows.

Which, again, is this query plan is likely going to be more expensive, for a small subset of rows. This will likely run slower, but it really needs to be tested. (If the query didn't have a WHERE clause at all, and it was pulling all the rows, then this type of plan would (very likely) be MUCH faster than performing a sort operation.)

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top