Improving speed of SQL query with MAX, WHERE, and GROUP BY on three different columns

https://stackoverflow.com/questions/21439342

04-10-2022
|

Question

I am attempting to speed up a query that takes around 60 seconds to complete on a table of ~20 million rows.

For this example, the table has three columns (id, dateAdded, name). id is the primary key. The indexes I have added to the table are:

(dateAdded)
(name)
(id, name)
(id, name, dateAdded)

The query I am trying to run is:

SELECT MAX(id) as id, name 
FROM exampletable 
WHERE dateAdded <= '2014-01-20 12:00:00' 
GROUP BY name 
ORDER BY NULL;

The date is variable from query to query.

The objective of this is to get the most recent entry for each name at or before the date added.

When I use explain on the query it tells me that it is using the (id, name, dateAdded) index.

+----+-------------+------------------+-------+------------------+----------------------------------------------+---------+------+----------+-----------------------------------------------------------+
| id | select_type | table            | type  | possible_keys    | key                                          | key_len | ref  | rows     | Extra                                                     |
+----+-------------+------------------+-------+------------------+----------------------------------------------+---------+------+----------+-----------------------------------------------------------+
|  1 | SIMPLE      | exampletable     | index | date_added_index | id_element_name_date_added_index             | 162     | NULL | 22016957 | Using where; Using index; Using temporary; Using filesort |
+----+-------------+------------------+-------+------------------+----------------------------------------------+---------+------+----------+-----------------------------------------------------------+

Edit: Added two new indexes from comments:

(dateAdded, name, id)
(name, id)

+----+-------------+------------------+-------+---------------------------------------------------------------+----------------------------------------------+---------+------+----------+-------------------------------------------+
| id | select_type | table            | type  | possible_keys                                                 | key                                          | key_len | ref  | rows     | Extra                                     |
+----+-------------+------------------+-------+---------------------------------------------------------------+----------------------------------------------+---------+------+----------+-------------------------------------------+
|  1 | SIMPLE      | exampletable     | index | date_added_index,date_added_name_id_index                     | id__name_date_added_index                    | 162     | NULL | 22040469 | Using where; Using index; Using temporary |
+----+-------------+------------------+-------+---------------------------------------------------------------+----------------------------------------------+---------+------+----------+-------------------------------------------+

Edit: Added create table script.

CREATE TABLE `exampletable` (
  `id` int(10) NOT NULL auto_increment,
  `dateAdded` timestamp NULL default CURRENT_TIMESTAMP,
  `name` varchar(50) character set utf8 default '',
  PRIMARY KEY  (`id`),
  KEY `date_added_index` (`dateAdded`),
  KEY `name_index` USING BTREE (`name`),
  KEY `id_name_index` USING BTREE (`id`,`name`),
  KEY `id_name_date_added_index` USING BTREE (`id`,`dateAdded`,`name`),
  KEY `date_added_name_id_index` USING BTREE (`dateAdded`,`name`,`id`),
  KEY `name_id_index` USING BTREE (`name`,`id`)
) ENGINE=MyISAM AUTO_INCREMENT=22046064 DEFAULT CHARSET=latin1

Edit: Here is the Explain from the answer provided by HeavyE.

+----+-------------+--------------+-------+------------------------------------------------------------------------------------------+--------------------------+---------+--------------------------------------------------+------+---------------------------------------+
| id | select_type | table        | type  | possible_k                                                                               | key                      | key_len | ref                                              | rows | Extra                                 |
+----+-------------+--------------+-------+------------------------------------------------------------------------------------------+--------------------------+---------+--------------------------------------------------+------+---------------------------------------+
|  1 | PRIMARY     | <derived2>   | ALL   | NULL                                                                                     | NULL                     | NULL    | NULL                                             | 1732 | Using temporary; Using filesort       |
|  1 | PRIMARY     | example1     | ref   | date_added_index,name_index,date_added_name_id_index,name_id_index,name_date_added_index | date_added_name_id_index | 158     | maxDateByElement.dateAdded,maxDateByElement.name |    1 | Using where; Using index              |
|  2 | DERIVED     | exampletable | range | date_added_index,date_added_name_id_index                                                | name_date_added_index    | 158     | NULL                                             | 1743 | Using where; Using index for group-by |
+----+-------------+--------------+-------+------------------------------------------------------------------------------------------+--------------------------+---------+--------------------------------------------------+------+---------------------------------------+

Solution

There is a great Stack Overflow post on optimization of Selecting rows with the max value in a column: https://stackoverflow.com/a/7745635/633063

This seems a little messy but works very well:

SELECT example1.name, MAX(example1.id)
FROM exampletable example1
INNER JOIN (
select name, max(dateAdded) dateAdded
from exampletable
where dateAdded  <= '2014-01-20 12:00:00' 
group by name
) maxDateByElement on example1.name = maxDateByElement.name AND example1.dateAdded = maxDateByElement.dateAdded
GROUP BY name;

OTHER TIPS

why are you using index on many keys?? if your where clause contains only one column, then use that index only, put index on dateAdded and on name separately and then use in sql statement like this:

SELECT MAX(id) as id, name 
FROM exampletable 
USE INDEX (dateAdded_index) USE INDEX FOR GROUP BY (name_index) 
WHERE dateAdded <= '2014-01-20 12:00:00' 
GROUP BY name
ORDER BY NULL;

here is the link if you want to know more. Please let me know, whether it is giving some positive results or not.

IF the where command makes no difference, then its either the max(id) or the name. I would test the indexes by eliminating Max(id) completely, and see if the group by name is fast. Then I would add Min(id) to see if its any faster than Max(id). (I have seen this make a difference).

Also, you should test the order by NULL. Try Order by name desc, or Order by name asc. Clark Vera

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow