Question

We have the following table

CREATE TABLE variant
(
 id                         VARCHAR(105),
 chrom                      VARCHAR(12),
 condel_pred                VARCHAR(11),
 consequence                VARCHAR(97),
 dbsnp_id                   VARCHAR(23),
 most_del_score             INTEGER,
 pos                        INTEGER,
 polyphen_pred              VARCHAR(17),
 protein_change             VARCHAR(39),
 sift_pred                  VARCHAR(11),
 _13k_t2d_aa_mac            INTEGER,
 _13k_t2d_aa_maf            FLOAT,
 _13k_t2d_aa_mina           INTEGER,
 _13k_t2d_aa_minu           INTEGER,
 _13k_t2d_ea_mac            INTEGER,
 _13k_t2d_ea_maf            FLOAT,
 _13k_t2d_ea_mina           INTEGER,
 _13k_t2d_ea_minu           INTEGER,
 _13k_t2d_eu_mac            INTEGER,
 _13k_t2d_eu_maf            FLOAT,
 _13k_t2d_eu_mina           INTEGER,
 _13k_t2d_eu_minu           INTEGER,
 _13k_t2d_het_carriers      VARCHAR(4),
 _13k_t2d_het_ethnicities   VARCHAR(32),
 _13k_t2d_hom_carriers      VARCHAR(5),
 _13k_t2d_hom_ethnicities   VARCHAR(32),
 _13k_t2d_hs_mac            INTEGER,
 _13k_t2d_hs_maf            FLOAT,
 _13k_t2d_hs_mina           INTEGER,
 _13k_t2d_hs_minu           INTEGER,
 _13k_t2d_sa_mac            INTEGER,
 _13k_t2d_sa_maf            FLOAT,
 _13k_t2d_sa_mina           INTEGER,
 _13k_t2d_sa_minu           INTEGER,
 closest_gene               VARCHAR(16),
 exchp_t2d_beta             FLOAT,
 exchp_t2d_direction        VARCHAR(13),
 exchp_t2d_maf              FLOAT,
 exchp_t2d_neff             FLOAT,
 exchp_t2d_p_value          FLOAT,
 gene                       VARCHAR(20),
 in_exchp                   VARCHAR(1),
 in_exseq                   VARCHAR(1),
 in_gene                    VARCHAR(17),
 qcfail                     INTEGER,
 _13k_t2d_heta              INTEGER,
 _13k_t2d_hetu              INTEGER,
 _13k_t2d_homa              INTEGER,
 _13k_t2d_homu              INTEGER,
 _13k_t2d_mac               INTEGER,
 _13k_t2d_mina              INTEGER,
 _13k_t2d_minu              INTEGER,
 _13k_t2d_or_wald_dos_fe_iv FLOAT,
 _13k_t2d_p_emmax_fe_iv     FLOAT,
 _13k_t2d_transcript_annot  VARCHAR(10745),
 gwas_t2d_effect_allele     VARCHAR(1),
 gwas_t2d_or                FLOAT,
 gwas_t2d_pvalue            FLOAT,
 gws_traits                 VARCHAR(43),
 in_gwas                    VARCHAR(1),
 _13k_t2d_aa_eaf            FLOAT,
 _13k_t2d_ea_eaf            FLOAT,
 _13k_t2d_sa_eaf            FLOAT
 )                                                                    

with several indexes, but including

 GWAS_T2D_PVAL_MOST_DEL_13k_T2D_EA_MAF_IDX   

which is on (GWAS_T2D_PVALUE, MOST_DEL_SCORE, _13k_T2D_EA_MAF)

There are about 6m rows, with a lot of NULL data, with GWAS_T2D_PVALUE and MOST_DEL_SCORE being non-null together for a relatively small number of rows (~40k rows).

We are observing performance we don't understand when running the following query

SELECT * 
FROM VARIANT USE INDEX GWAS_T2D_PVAL_MOST_DEL_13k_T2D_EA_MAF_IDX) 
WHERE GWAS_T2D_PVALUE < .05 AND MOST_DEL_SCORE = 1;                      

which has the following EXPLAIN:

+----+-------------+---------+-------+-------------------------------------------+-------------------------------------------+---------+------+--------+-------------+                                                                        

| id | select_type | table   | type  | possible_keys                             | key                                       | key_len | ref  | rows   | Extra       |                                                                        

+----+-------------+---------+-------+-------------------------------------------+-------------------------------------------+---------+------+--------+-------------+                                                                        

|  1 | SIMPLE      | VARIANT | range | GWAS_T2D_PVAL_MOST_DEL_13k_T2D_EA_MAF_IDX | GWAS_T2D_PVAL_MOST_DEL_13k_T2D_EA_MAF_IDX | 10      | NULL | 280242 | Using where |                                                                        

+----+-------------+---------+-------+-------------------------------------------+-------------------------------------------+---------+------+--------+-------------+                                                                        

What happens is that the query takes a very long time to execute (~3 minutes) if it has not been run in a while (e.g. 8 hours), but afterwards takes <1s and returns 8 rows. We have two questions:

  1. Why does the first query execution take so long? We assume this is because of some OS caching or paging issue, since the query cache is turned off, and closely related queries (e.g. replacing 0.05 with 0.1) also run fast the second time.

  2. Why does this query ever take ~3 minutes, even with no caching and every page fetch going to disk? It only returns 8 rows, and shouldn't the index be able to directly seek to these 8 rows since the first two keys are on the two keys in the where clause? Why does explain estimate 280K rows scanned rather than 8? We ran an OPTIMIZE on the table and the estimate was still the same. What is also confusing is that an explain when forcing to use an index on GWAS_T2D_PVALUE alone produces an estimate of 44K rows scanned, and an index on (GWAS_T2D_PVALUE, MOST_DEL_SCORE) produces and estimate of 32K rows scanned. Based on our understanding of multi-column indexes, why would query performance be any different for the 2 and 3 column indexes, and shouldn't both be far superior to the 1 column index?

Était-ce utile?

La solution

The columns in your index are backwards, based on the query, and that's why you see using where in the query plan.

To invoke a well-worn illustration, let's consider the telephone directory.

Your query is WHERE last_name < 'smith' AND first_name = 'john'.

The fact that the first names are sorted within each sorted group of last names is of no real value, because we still have to consider all of the people in a large portion of the directory (everyone before Smith) and evaluate their first names individually within each distinct last name. That's why your row estimate is so large.

If both expressions were equality comparisons, the server could indeed go directly to the 8 rows. If the leftmost column in the index were subject to the equality comparison and the second column were the "less than" comparison, the server could again go directly to the rows in question, because they would all be adjacent in the index.

An index with the two columns in the opposite order will most likely give very different performance.

Generally, using where with a key value from among possible_keys also shown means that the index is helping some, but the server is still having to evaluate what the index finds and eliminate additional rows using expressions in the where clause.

The faster response on identical queries is probably the query cache in action. The faster response on similar queries possibly means your innodb_buffer_pool_size is too small for your workload and all of the random reads required by the lack of an optimum index means a lot of pages loaded into the pool from disk on first execution.

Autres conseils

Your existing index on (GWAS_T2D_PVALUE, MOST_DEL_SCORE, _13k_T2D_EA_MAF), I would consider reversing the order of the columns to (MOST_DEL_SCORE, GWAS_T2D_PVALUE, _13k_T2D_EA_MAF) and here is why.

Think of the indexing as this. The first index has the GWAS_T2D_PVALUE. So you have a file cabinet with all these values sorted by value. Then, within EACH of these common value entries, it will put in all the MOST_DEL_SCORE in order within that... then finally all the _13k sorted within that. So, in order to process your query, you need to pull out all the files with the PVALUE < .05 (or whatever). Then, you have to manually run through each file for those that have your specific value for MOST_DEL_SCORE of 1 and pull those out.

Now, try the alternate index. You still have a file cabinet, but each file is for a specific MOST_DEL_SCORE. So, if you have 20 scores, you have 20 files to look at. Since you are always looking for the ONE INSTANCE "MOST_DEL_SCORE = 1", you have one file and you are almost done. Your next criteria is for the GWAS_T2D_PVALUE < .05. Since these were the secondary sort to the index, these are all sorted ready to go. So the engine can quickly start at the first record and go up to the .05 and stop. It doesn't have to keep going through all the other combinations the first index offers.

Just a suggestion, but I've seen historical querying improvements based on the proper index matching the criteria to the more specific and working out to the more generic at the subsequent columns in the index.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top