查询和MySQL缓18M+行表

https://stackoverflow.com/questions/4265544

27-09-2019
|

题

因为这是我第一章似乎我只能发布1链接，以便我所列举的网站我指的是在底部。简而言之我的目标是使该数据库结果返回速度更快，我已经试图包括尽可能多的相关信息，因为我能想到的帮助框架的问题在底部的员额。

机信息

8 processors
model name      : Intel(R) Xeon(R) CPU           E5440  @ 2.83GHz
cache size      : 6144 KB
cpu cores       : 4 

top - 17:11:48 up 35 days, 22:22, 10 users,  load average: 1.35, 4.89, 7.80
Tasks: 329 total,   1 running, 328 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.0%sy,  0.0%ni, 87.4%id, 12.5%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   8173980k total,  5374348k used,  2799632k free,    30148k buffers
Swap: 16777208k total,  6385312k used, 10391896k free,  2615836k cached

但是我们正在移动mysql安装一个不同的机在集群中有256GB ram

表中信息

我MySQL表看起来像

CREATE TABLE ClusterMatches 
(
    id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
    cluster_index INT, 
    matches LONGTEXT,
    tfidf FLOAT,
    INDEX(cluster_index)   
);

它拥有大约18M行，有1M特cluster_index和6K独特的相匹配。Sql query我发生在PHP看起来像。

SQL query

$sql_query="SELECT `matches`,sum(`tfidf`) FROM 
(SELECT * FROM Test2_ClusterMatches WHERE `cluster_index` in (".$clusters.")) 
AS result GROUP BY `matches` ORDER BY sum(`tfidf`) DESC LIMIT 0, 10;";

其中$集群含有一串大约3,000个逗号分隔cluster_index。这个查询使用的大约50,000行和大约需要15运行时同样的查询，再次运行需要大约1运行。

使用

内容表可以被假定为是静态的。
低数量的用户
上述查询是目前唯一查询将运行在桌上

查询

基于这个职位[计算器:Cache/再利用中的子查询MySQL][1]和改善查询时间我相信我的子查询可以编制索引。

mysql> EXPLAIN EXTENDED SELECT `matches`,sum(`tfidf`) FROM 
(SELECT * FROM ClusterMatches WHERE `cluster_index` in (1,2,...,3000) 
AS result GROUP BY `matches` ORDER BY sum(`tfidf`) ASC LIMIT 0, 10;

+----+-------------+----------------------+-------+---------------+---------------+---------+------+-------+---------------------------------+
| id | select_type | table                | type  | possible_keys | key           | key_len | ref  | rows  | Extra                           |
+----+-------------+----------------------+-------+---------------+---------------+---------+------+-------+---------------------------------+
|  1 | PRIMARY     |  derived2            | ALL   | NULL          | NULL          | NULL    | NULL | 48528 | Using temporary; Using filesort | 
|  2 | DERIVED     | ClusterMatches       | range | cluster_index | cluster_index | 5       | NULL | 53689 | Using where                     | 
+----+-------------+----------------------+-------+---------------+---------------+---------+------+-------+---------------------------------+

根据这一陈旧的文章[优化MySQL:查询和索引][2]中的额外的信息-坏家伙在这里看到的是"使用临时"和"使用filesort"

MySQL配置的信息

查询高速缓冲存储器可用，但却有效地关闭的大小目前设置为零


mysqladmin variables;
+---------------------------------+----------------------+
| Variable_name                   | Value                |
+---------------------------------+----------------------+
| bdb_cache_size                  | 8384512              | 
| binlog_cache_size               | 32768                | 
| expire_logs_days                | 0                    |
| have_query_cache                | YES                  | 
| flush                           | OFF                  |
| flush_time                      | 0                    |
| innodb_additional_mem_pool_size | 1048576              |
| innodb_autoextend_increment     | 8                    |
| innodb_buffer_pool_awe_mem_mb   | 0                    |
| innodb_buffer_pool_size         | 8388608              |
| join_buffer_size                | 131072               |
| key_buffer_size                 | 8384512              |
| key_cache_age_threshold         | 300                  |
| key_cache_block_size            | 1024                 |
| key_cache_division_limit        | 100                  |
| max_binlog_cache_size           | 18446744073709547520 | 
| sort_buffer_size                | 2097144              |
| table_cache                     | 64                   | 
| thread_cache_size               | 0                    | 
| query_cache_limit               | 1048576              |
| query_cache_min_res_unit        | 4096                 |
| query_cache_size                | 0                    |
| query_cache_type                | ON                   |
| query_cache_wlock_invalidate    | OFF                  |
| read_rnd_buffer_size            | 262144               |
+---------------------------------+----------------------+

根据这篇文章在[数据库性转折][3]我认为，值，我需要调整是

table_cache
key_buffer
sort_buffer
概只有不
record_rnd_buffer(组通过和秩序通过的条款)

区域确定为改善-MySQL查询的调整

变化的数据类型的匹配指数，这是一个int指向另一个表[MySQL的确将使用一个动态的行格式，如果它包含可变长领域，如文本或BLOB，其中，在这种情况下，装置排序需要做的磁盘上。该方案是未来避免这些数据类型，而是分割掉这样的领域进入一个相关的表格。][4]
索引新的match_index领域，以使集团通过 matches 发生的速度更快的基础上的发言["你也许应该创建指数的任何领域的选择、分组，排序，或加入。"][5]

工具

调整执行我的计划使用

[解释][6]参考[输出格式][7]
[ab-Apache HTTP服务器基准测试工具][8]
[分析][9]与[日志数据][10]

未来数据库的尺寸

我们的目标是建立一个系统，该系统可以有1M特cluster_index值1M独特的匹配值，大约3,000,000,000表行的响应时间来查询的0.5s(我们可以添加更多的内存在必要和分发的数据库整个群集)

的问题

我认为我们要保持整个记录集中ram，以便查询不会触发盘，如果我们保持整个数据库在MySQL缓并消除需要memcachedb?
是在试图保持整个数据库在MySQL缓坏战略作为其不可持续的?会喜欢的东西memcachedb或穿红衣是一个更好的办法，如果是这样为什么？
是的临时表"的结果"，是通过查询的自动销毁当的查询结束?
我们应该交换机从少到些[作为其阅读大量的数据在哪里，作为少是很好的编写重][11]?
我的高速缓冲不出现在作为其零在我的[查询缓配置][12个]，为什么不查询目前发生的速度更快的第二次，我运行了它？
我可以重组我的查询，以消除"临时"和"使用filesort"发生,我应该可以使用的加入而不是子查询?
你怎么看小MySQL[数据的高速缓存][13]?
什么样的尺寸的价值观table_cache,key_buffer,sort_buffer,概只有不,record_rnd_buffer你会建议作为一个起点?

链接

1:stackoverflow.com/questions/658937/cache-re-use-a-subquery-in-mysql
2:databasejournal.com/features/mysql/article.php/10897_1382791_4/Optimizing-MySQL-Queries-and-Indexes.htm
3:debianhelp.co.uk/mysqlperformance.htm
4:20bits.com/articles/10-tips-for-optimizing-mysql-queries-that-dont-suck/
5:20bits.com/articles/10-tips-for-optimizing-mysql-queries-that-dont-suck/
6:dev.mysql.com/doc/refman/5.0/en/explain.html
7:dev.mysql.com/doc/refman/5.0/en/explain-output.html
8:httpd.apache.org/docs/2.2/programs/ab.html
9:mtop.sourceforge.net/
10:dev.mysql.com/doc/refman/5.0/en/slow-query-log.html
11:20bits.com/articles/10-tips-for-optimizing-mysql-queries-that-dont-suck/
12:dev.mysql.com/doc/refman/5.0/en/query-cache-configuration.html
13:dev.mysql.com/tech-resources/articles/mysql-query-cache.html

解决方案

改变表

根据建议在这个岗位上如何挑选指标顺序和组通过查询这表现在看起来像

CREATE TABLE ClusterMatches 
(
    cluster_index INT UNSIGNED, 
    match_index INT UNSIGNED,
    id INT NOT NULL AUTO_INCREMENT,
    tfidf FLOAT,
    PRIMARY KEY (match_index,cluster_index,id,tfidf)
);
CREATE TABLE MatchLookup 
(
    match_index INT UNSIGNED NOT NULL PRIMARY KEY,
    image_match TINYTEXT
);

消除子查询

的查询，而无需分类的结果的总和(tfidf)看起来像

SELECT match_index, SUM(tfidf) FROM ClusterMatches 
WHERE cluster_index in (1,2,3 ... 3000) GROUP BY match_index LIMIT 10;

这消除了使用临时和使用filesort

explain extended SELECT match_index, SUM(tfidf) FROM ClusterMatches 
WHERE cluster_index in (1,2,3 ... 3000) GROUP BY match_index LIMIT 10;
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+--------------------------+
| id | select_type | table                | type  | possible_keys | key     | key_len | ref  | rows  | Extra                    |
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+--------------------------+
|  1 | SIMPLE      | ClusterMatches       | range | PRIMARY       | PRIMARY | 4       | NULL | 14938 | Using where; Using index | 
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+--------------------------+

分类问题

然而，如果我加入了通过总和(tfdif)在

SELECT match_index, SUM(tfidf) AS total FROM ClusterMatches
WHERE cluster_index in (1,2,3 ... 3000) GROUP BY match_index 
ORDER BY total DESC LIMIT 0,10;
+-------------+--------------------+
| match_index | total              |
+-------------+--------------------+
|         868 |   0.11126546561718 | 
|        4182 | 0.0238558370620012 | 
|        2162 | 0.0216601379215717 | 
|        1406 | 0.0191618576645851 | 
|        4239 | 0.0168981291353703 | 
|        1437 | 0.0160425212234259 | 
|        2599 | 0.0156466849148273 | 
|         394 | 0.0155945559963584 | 
|        3116 | 0.0151005545631051 | 
|        4028 | 0.0149106932803988 | 
+-------------+--------------------+
10 rows in set (0.03 sec)

结果是适当的快速在这种规模的但具有的 为了通过总和(tfidf)意味着它使用临时和filesort

explain extended SELECT match_index, SUM(tfidf) AS total FROM ClusterMatches 
WHERE cluster_index IN (1,2,3 ... 3000) GROUP BY match_index 
ORDER BY total DESC LIMIT 0,10;
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+-----------------------------------------------------------+
| id | select_type | table                | type  | possible_keys | key     | key_len | ref  | rows  | Extra                                                     |
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+-----------------------------------------------------------+
|  1 | SIMPLE      | ClusterMatches       | range | PRIMARY       | PRIMARY | 4       | NULL | 65369 | Using where; Using index; Using temporary; Using filesort | 
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+-----------------------------------------------------------+

可能的解决方案？

Im寻找一个解决方案不使用临时或filesort，沿线的

SELECT match_index, SUM(tfidf) AS total FROM ClusterMatches 
WHERE cluster_index IN (1,2,3 ... 3000) GROUP BY cluster_index, match_index 
HAVING total>0.01 ORDER BY cluster_index;

我不需要硬编码一个阈值总额，任何想法？

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow