Optimizing a SQL query to avoid full table scan
-
30-09-2019 - |
Pergunta
Consider the following query:
SELECT * FROM Transactions
WHERE day(Stamp - interval 3 hour) = 1;
The Stamp column in the Transactions table is a TIMESTAMP and there is an index on it. How could I change this query so it avoids full table scans? (that is, using Stamp outside of the day() function)
Thanks!
Solução
This is how I would do it:
add some extra fields: YEAR, MONTH, DAY or even HOUR, MINUTE depending on the traffic you expect. Then build a trigger to populate the extra fields, maybe subtracting the 3 hour interval in advance. Finally build some index on the extra fields.
Outras dicas
If the goal is just to avoid full table scans and you have a PRIMARY KEY (say named PK) for Transactions, consider adding covering index
ALTER TABLE Transactions ADD INDEX cover_1 (PK, Stamp)
Then
SELECT * FROM Transactions WHERE PK IN (SELECT PK FROM Transactions
WHERE day(Stamp - interval 3 hour) = 1
)
This query should not use full table scans (however optimizer may decide to use full scan, if number of rows in table is small or for whatever other statistical reason :) )
Better way may be is to use temporary table instead of subquery.
You can often rewrite the function so you have something that looks like WHERE Stamp=XXXX
and XXXX is some expression. You could create a series of BETWEEN statements for each month, WHERE Stamp BETWEEN timestamp('2010-01-01 00:00:00') AND timestamp ('2010-01-01 23:59:59') OR Stamp BETWEEN ...
, but I'm not certain this would use the index in this case. I'd build a column that was the day of the month as @petr suggests.
Calculate your desired Stamp value separately before you run your main query, i.e.
Step 1 - calculate the desired Stamp value
Step 2 - run a query where Stamp > (calculated value)
Because there's no calculation in step 2, you should be able to use your index.
If I understand it correctly, you basically want to return all rows where the stamp falls on the first in each month (having subtracted the 3 hours)? If (and this is a big if), you have a fixed window of, say the latest 6 months, you could just enumerate 6 range tests. But still, I'm not sure indexed access will be faster anyways.
select *
from transactions
where stamp between timestamp '2010-06-01 03:00:00' and timestamp '2010-06-02 02:59:59'
or stamp between timestamp '2010-07-01 03:00:00' and timestamp '2010-07-02 02:59:59'
or stamp between timestamp '2010-08-01 03:00:00' and timestamp '2010-08-02 02:59:59'
or stamp between timestamp '2010-09-01 03:00:00' and timestamp '2010-09-02 02:59:59'
or stamp between timestamp '2010-10-01 03:00:00' and timestamp '2010-10-02 02:59:59'
or stamp between timestamp '2010-11-01 03:00:00' and timestamp '2010-11-02 02:59:59'
or stamp between timestamp '2010-12-01 03:00:00' and timestamp '2010-12-02 02:59:59';
NB! I'm not sure how the millisecond part of the timestamp works. You may need to pad it accordingly.
Reworking petr's answer a bit to avoid the IN clause, and to make it for MyISAM or InnoDB.
For MyISAM
ALTER TABLE Transactions ADD INDEX cover_1 (PK, Stamp)
Or, for InnoDB, where the PK is implicitly included in every index,
ALTER TABLE Transactions ADD INDEX Stamp (Stamp)
Then
SELECT *
FROM Transactions LEFT JOIN
(
SELECT PK
FROM Transactions
WHERE DAYOFMONTH(Stamp - interval 3 hour) = 1
) a ON Transactions.PK=a.PK
The subquery will have an index only execution, and the outer query will only pull the rows from the table where a.PK came through.