We have the following table
CREATE TABLE variant
(
id VARCHAR(105),
chrom VARCHAR(12),
condel_pred VARCHAR(11),
consequence VARCHAR(97),
dbsnp_id VARCHAR(23),
most_del_score INTEGER,
pos INTEGER,
polyphen_pred VARCHAR(17),
protein_change VARCHAR(39),
sift_pred VARCHAR(11),
_13k_t2d_aa_mac INTEGER,
_13k_t2d_aa_maf FLOAT,
_13k_t2d_aa_mina INTEGER,
_13k_t2d_aa_minu INTEGER,
_13k_t2d_ea_mac INTEGER,
_13k_t2d_ea_maf FLOAT,
_13k_t2d_ea_mina INTEGER,
_13k_t2d_ea_minu INTEGER,
_13k_t2d_eu_mac INTEGER,
_13k_t2d_eu_maf FLOAT,
_13k_t2d_eu_mina INTEGER,
_13k_t2d_eu_minu INTEGER,
_13k_t2d_het_carriers VARCHAR(4),
_13k_t2d_het_ethnicities VARCHAR(32),
_13k_t2d_hom_carriers VARCHAR(5),
_13k_t2d_hom_ethnicities VARCHAR(32),
_13k_t2d_hs_mac INTEGER,
_13k_t2d_hs_maf FLOAT,
_13k_t2d_hs_mina INTEGER,
_13k_t2d_hs_minu INTEGER,
_13k_t2d_sa_mac INTEGER,
_13k_t2d_sa_maf FLOAT,
_13k_t2d_sa_mina INTEGER,
_13k_t2d_sa_minu INTEGER,
closest_gene VARCHAR(16),
exchp_t2d_beta FLOAT,
exchp_t2d_direction VARCHAR(13),
exchp_t2d_maf FLOAT,
exchp_t2d_neff FLOAT,
exchp_t2d_p_value FLOAT,
gene VARCHAR(20),
in_exchp VARCHAR(1),
in_exseq VARCHAR(1),
in_gene VARCHAR(17),
qcfail INTEGER,
_13k_t2d_heta INTEGER,
_13k_t2d_hetu INTEGER,
_13k_t2d_homa INTEGER,
_13k_t2d_homu INTEGER,
_13k_t2d_mac INTEGER,
_13k_t2d_mina INTEGER,
_13k_t2d_minu INTEGER,
_13k_t2d_or_wald_dos_fe_iv FLOAT,
_13k_t2d_p_emmax_fe_iv FLOAT,
_13k_t2d_transcript_annot VARCHAR(10745),
gwas_t2d_effect_allele VARCHAR(1),
gwas_t2d_or FLOAT,
gwas_t2d_pvalue FLOAT,
gws_traits VARCHAR(43),
in_gwas VARCHAR(1),
_13k_t2d_aa_eaf FLOAT,
_13k_t2d_ea_eaf FLOAT,
_13k_t2d_sa_eaf FLOAT
)
with several indexes, but including
GWAS_T2D_PVAL_MOST_DEL_13k_T2D_EA_MAF_IDX
which is on (GWAS_T2D_PVALUE, MOST_DEL_SCORE, _13k_T2D_EA_MAF)
There are about 6m rows, with a lot of NULL data, with GWAS_T2D_PVALUE
and MOST_DEL_SCORE
being non-null together for a relatively small number of rows (~40k rows).
We are observing performance we don't understand when running the following query
SELECT *
FROM VARIANT USE INDEX GWAS_T2D_PVAL_MOST_DEL_13k_T2D_EA_MAF_IDX)
WHERE GWAS_T2D_PVALUE < .05 AND MOST_DEL_SCORE = 1;
which has the following EXPLAIN:
+----+-------------+---------+-------+-------------------------------------------+-------------------------------------------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------+-------+-------------------------------------------+-------------------------------------------+---------+------+--------+-------------+
| 1 | SIMPLE | VARIANT | range | GWAS_T2D_PVAL_MOST_DEL_13k_T2D_EA_MAF_IDX | GWAS_T2D_PVAL_MOST_DEL_13k_T2D_EA_MAF_IDX | 10 | NULL | 280242 | Using where |
+----+-------------+---------+-------+-------------------------------------------+-------------------------------------------+---------+------+--------+-------------+
What happens is that the query takes a very long time to execute (~3 minutes) if it has not been run in a while (e.g. 8 hours), but afterwards takes <1s and returns 8 rows. We have two questions:
Why does the first query execution take so long? We assume this is because of some OS caching or paging issue, since the query cache is turned off, and closely related queries (e.g. replacing 0.05 with 0.1) also run fast the second time.
Why does this query ever take ~3 minutes, even with no caching and every page fetch going to disk? It only returns 8 rows, and shouldn't the index be able to directly seek to these 8 rows since the first two keys are on the two keys in the where clause? Why does explain estimate 280K rows scanned rather than 8? We ran an OPTIMIZE on the table and the estimate was still the same.
What is also confusing is that an explain when forcing to use an index on GWAS_T2D_PVALUE alone produces an estimate of 44K rows scanned, and an index on (GWAS_T2D_PVALUE, MOST_DEL_SCORE) produces and estimate of 32K rows scanned. Based on our understanding of multi-column indexes, why would query performance be any different for the 2 and 3 column indexes, and shouldn't both be far superior to the 1 column index?