Question

I have a huge table like

CREATE TABLE IF NOT EXISTS `object_search` (
  `keyword` varchar(40) COLLATE latin1_german1_ci NOT NULL,
  `object_id` int(10) unsigned NOT NULL,
  PRIMARY KEY (`keyword`,`media_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 COLLATE=latin1_german1_ci;

with around 39 million rows (using over 1 GB space) containing the indexed data for 1 million records in the object table (where object_id points at).

Now searching through this with a query like

SELECT object_id, COUNT(object_id) AS hits
FROM object_search
WHERE keyword = 'woman' OR keyword = 'house'
GROUP BY object_id
HAVING hits = 2

is already significantly faster than doing a LIKE search on the composed keywords field in the object table but still takes up to 1 minute.

It's explain looks like:

+----+-------------+--------+------+---------------+---------+---------+-------+--------+----------+--------------------------+
| id | select_type | table  | type | possible_keys | key     | key_len | ref   | rows   | filtered | Extra                    |
+----+-------------+--------+------+---------------+---------+---------+-------+--------+----------+--------------------------+
|  1 | SIMPLE      | search | ref  | PRIMARY       | PRIMARY | 42      | const | 345180 |   100.00 | Using where; Using index |
+----+-------------+--------+------+---------------+---------+---------+-------+--------+----------+--------------------------+

The full explain with joined object and object_color and object_locale table, while the above query is run in a subquery to avoid overhead, looks like:

+----+-------------+-------------------+--------+---------------+-----------+---------+------------------+--------+----------+---------------------------------+
| id | select_type | table             | type   | possible_keys | key       | key_len | ref              | rows   | filtered | Extra                           |
+----+-------------+-------------------+--------+---------------+-----------+---------+------------------+--------+----------+---------------------------------+
|  1 | PRIMARY     | <derived2>        | ALL    | NULL          | NULL      | NULL    | NULL             | 182544 |   100.00 | Using temporary; Using filesort |
|  1 | PRIMARY     | object_color      | eq_ref | object_id     | object_id | 4       | search.object_id |      1 |   100.00 |                                 |
|  1 | PRIMARY     | locale            | eq_ref | object_id     | object_id | 4       | search.object_id |      1 |   100.00 |                                 |
|  1 | PRIMARY     | object            | eq_ref | PRIMARY       | PRIMARY   | 4       | search.object_id |      1 |   100.00 |                                 |
|  2 | DERIVED     | search            | ref    | PRIMARY       | PRIMARY   | 42      |                  | 345180 |   100.00 | Using where; Using index        |
+----+-------------+-------------------+--------+---------------+-----------+---------+------------------+--------+----------+---------------------------------+

My top goal would be to be able to scan through this within 1 or 2 seconds.

So, are there further techniques to improve search speed for keywords?


Update 2013-08-06:

Applying most of Neville K's suggestion I now have the following setup:

CREATE TABLE `object_search_keyword` (
  `keyword_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `keyword` varchar(64) COLLATE latin1_german1_ci NOT NULL,
  PRIMARY KEY (`keyword_id`),
  FULLTEXT KEY `keyword_ft` (`keyword`)
) ENGINE=MyISAM  DEFAULT CHARSET=latin1 COLLATE=latin1_german1_ci;

CREATE TABLE `object_search` (
  `keyword_id` int(10) unsigned NOT NULL,
  `object_id` int(10) unsigned NOT NULL,
  PRIMARY KEY (`keyword_id`,`media_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

The new query's explain looks like this:

+----+-------------+----------------+----------+--------------------+------------+---------+---------------------------+---------+----------+----------------------------------------------+
| id | select_type | table          | type     | possible_keys      | key        | key_len | ref                       | rows    | filtered | Extra                                        |
+----+-------------+----------------+----------+--------------------+------------+---------+---------------------------+---------+----------+----------------------------------------------+
|  1 | PRIMARY     | <derived2>     | ALL      | NULL               | NULL       | NULL    | NULL                      |   24381 |   100.00 | Using temporary; Using filesort              |
|  1 | PRIMARY     | object_color   | eq_ref   | object_id          | object_id  | 4       | object_search.object_id   |       1 |   100.00 |                                              |
|  1 | PRIMARY     | object         | eq_ref   | PRIMARY            | PRIMARY    | 4       | object_search.object_id   |       1 |   100.00 |                                              |
|  1 | PRIMARY     | locale         | eq_ref   | object_id          | object_id  | 4       | object_search.object_id   |       1 |   100.00 |                                              |
|  2 | DERIVED     | <derived4>     | system   | NULL               | NULL       | NULL    | NULL                      |       1 |   100.00 |                                              |
|  2 | DERIVED     | <derived3>     | ALL      | NULL               | NULL       | NULL    | NULL                      |   24381 |   100.00 |                                              |
|  4 | DERIVED     | NULL           | NULL     | NULL               | NULL       | NULL    | NULL                      |    NULL |     NULL | No tables used                               |
|  3 | DERIVED     | object_keyword | fulltext | PRIMARY,keyword_ft | keyword_ft | 0       |                           |       1 |   100.00 | Using where; Using temporary; Using filesort |
|  3 | DERIVED     | object_search  | ref      | PRIMARY            | PRIMARY    | 4       | object_keyword.keyword_id | 2190225 |   100.00 | Using index                                  |
+----+-------------+----------------+----------+--------------------+------------+---------+---------------------------+---------+----------+----------------------------------------------+

The many derives are coming from the keyword comparing subquery being nested into another subquery which does nothing but count the amount of rows returned:

SELECT SQL_NO_CACHE object.object_id, ..., @rn AS numrows
FROM (
    SELECT *, @rn := @rn + 1
    FROM (
        SELECT SQL_NO_CACHE search.object_id, COUNT(turbo.object_id) AS hits
        FROM object_keyword AS kwd
        INNER JOIN object_search AS search ON (kwd.keyword_id = search.keyword_id)
        WHERE MATCH (kwd.keyword) AGAINST ('+(woman) +(house)')
        GROUP BY search.object_id HAVING hits = 2
    ) AS numrowswrapper
    CROSS JOIN (SELECT @rn := 0) CONST
) AS turbo
INNER JOIN object AS object ON (search.object_id = object.object_id)
LEFT JOIN object_color AS object_color ON (search.object_id = object_color.object_id)
LEFT JOIN object_locale AS locale ON (search.object_id = locale.object_id)
ORDER BY timestamp_upload DESC

The above query will actually run within ~6 seconds, since it searches for two keywords. The more keywords I search for, the faster the search goes down.

Any way to further optimize this?


Update 2013-08-07

The blocking thing seems almost certainly to be the appended ORDER BY statement. Without it, the query executes in less than a second.

So, is there any way to sort the result faster? Any suggestions welcome, even hackish ones that would require post processing somewhere else.


Update 2013-08-07 later that day

Alright ladies and gentlemen, nesting the WHERE and ORDER BY statements in another layer of subquery to not let it bother with tables it doesn't need roughly doubled it's performance again:

SELECT wowrapper.*, locale.title
FROM (
    SELECT SQL_NO_CACHE object.object_id, ..., @rn AS numrows
    FROM (
        SELECT *, @rn := @rn + 1
        FROM (
            SELECT SQL_NO_CACHE search.media_id, COUNT(search.media_id) AS hits
            FROM object_keyword AS kwd
            INNER JOIN object_search AS search ON (kwd.keyword_id = search.keyword_id)
            WHERE MATCH (kwd.keyword) AGAINST ('+(frau)')
            GROUP BY search.media_id HAVING hits = 1
        ) AS numrowswrapper
        CROSS JOIN (SELECT @rn := 0) CONST
    ) AS search 
    INNER JOIN object AS object ON (search.object_id = object.object_id) 
    LEFT JOIN object_color AS color ON (search.object_id = color.object_id)
    WHERE 1
    ORDER BY object.object_id DESC
) AS wowrapper 
LEFT JOIN object_locale AS locale ON (jfwrapper.object_id = locale.object_id) 
LIMIT 0,48

Searches that took 12 seconds (single keyword, ~200K results) now take 6, and a search for two keywords that took 6 seconds (60K results) now takes around 3.5 secs.

Now this is already a massive improvement, but is there any chance to push this further?


Update 2013-08-08 early that day

Undid that last nested variation of the query, since it actually slowed down other variations of it... I'm now trying some other things with different table layouts and FULLTEXT indexes using MyISAM for a dedicated search table with a combined keyword field (comma separated in a TEXT field).


Update 2013-08-08

Alright, a plain fulltext index doesnt really help.

Back to the previous setup, the only thing blocking is the ORDER BY (which resorts to using a temporary table and filesort). Without it a search is complete within less than a second!

So basically what's left of all this is:
How do I optimize the ORDER BY statement to run faster, likely by eliminating the use of the temporary table?

Was it helpful?

Solution

Full text search will be much faster than using the standard SQL string comparison features.

Secondly, if you have a high degree of redundancy in the keywords, you could consider a "many to many" implementation:

Keywords
--------
keyword_id
keyword

keyword_object
-------------
keyword_id
object_id

objects
-------
object_id
......

If this reduces the string comparison from 39 million rows to 100K rows (roughly the size of the English dictionary), you may also see a distinct improvement, as the query would only have to perform 100K string comparisons, and joining on an integer keyword_id and object_id field should be much, much faster than doing 39M string comparisons.

OTHER TIPS

The best solution for this will be a FULLTEXT search, but you will probably need a MyISAM table for that. You can setup a mirror table and update it with some events and triggers or if you have a slave replicating from your server you can change its table to MyISAM and use it for searches.

For this query the only thing I can come up with is to rewrite it as:

SELECT s1.object_id
FROM object_search s1
JOIN object_search s2 ON s2.object_id = s1.object_id AND s2.key_word = 'word2'
JOIN object_search s3 ON s3.object_id = s1.object_id AND s3.key_word = 'word3'
....
WHERE s1.key_word = 'word1'

and I'm not sure it will be faster this way.

Also you will need to have an index on object_id (assuming your PK is (key_word, object_id)).

If you have seldom INSERTs and often SELECTs you could optimize your data for the reads i.e. recalculate the number of object_ids per keyword and directly store it in the database. The SELECTs would then be very fast, the INSERTs would take some seconds though,.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top