Performance for MySQL InnoDB B+Tree index with many duplicate values

https://dba.stackexchange.com/questions/129544

30-09-2020
|

Question

I'm trying to diagnose seemingly random performance issues with our database server. Below is a simplified scenario, hopefully generic enough to serve as a useful future reference for anyone looking for the same answer.

Suppose I have a (MySQL 5.6 w/ InnoDB) table like

CREATE TABLE Example (
    id INT NOT NULL AUTO_INCREMENT,
    secondary_id INT DEFAULT NULL,
    some_data TEXT NOT NULL,
    PRIMARY KEY (id),
    KEY (secondary_id)
) ENGINE=InnoDB;

with about 15 million rows. However, the secondary_id column is NULL for almost all rows, so the index on secondary_id has very, very low cardinality (in our case about ~30k). In our case, when we experience the performance issue I'm investigating, the process list for the server shows many (100+) queries of the form:

UPDATE Example SET secondary_id = NULL, some_data = '...' WHERE id = 123;

that are taking ~90+ seconds to complete, during which they are in the "updating" state. (These queries would be running in separate transactions.)

I'm specifically wondering if the transition from a not-null secondary_id to a null secondary_id is causing performance slowdowns in the above UPDATE. That is, is it possible that updating the index in this case takes a significant amount of time since there are so many rows (~15mil) that have the same value for that column (NULL)?

I guess this question stems from me not understanding how the B+Tree index will store row pointers for rows having duplicate index values. I'd guess that node would have a linked list (or something similar) with a pretty fast insertion time, so I'd guess the answer to my question is "No". But I'd like to confirm that with the experts, i.e. you all.

I've tried to do a fair amount of research here, but I've come up pretty empty handed. Probably the most comprehensive post is this one, which explains some different techniques for handling duplicate keys, but I'm specifically looking for InnoDB/MySQL's approach.

Solution

90 seconds for a single UPDATE sounds too, too much. There is probably some blocking involved and should be investigated.

Apart from that, having a column that has 98% the same (NULL) value does not sound good either. You should consider putting that column in a separate table (which would have only 30K rows). It would complicate your INSERT/DELETE/UPDATE procedures a bit but you would probably gain from the smaller indexes. Suggested design:

CREATE TABLE Example (
    id INT NOT NULL AUTO_INCREMENT,
    some_data TEXT NOT NULL,
    PRIMARY KEY (id)
) ENGINE = InnoDB ;

CREATE TABLE Example_secondary (
    id INT NOT NULL,
    secondary_id INT NOT NULL,
    PRIMARY KEY (id),
    INDEX (secondary_id),
    FOREIGN KEY (id)
      REFERENCES Example (id)
) ENGINE = InnoDB ;

Then your UPDATE:

UPDATE Example 
SET secondary_id = NULL, 
    some_data = '...' 
WHERE id = 123 ;

would become:

BEGIN ;
    UPDATE Example 
    SET some_data = '...' 
    WHERE id = 123 ;

    DELETE FROM Example_secondary 
    WHERE id = 123 ;
COMMIT ;

OTHER TIPS

What does this give you:

EXPLAIN UPDATE Example 
    SET secondary_id = NULL, 
        some_data = '...' 
    WHERE id = 123 ;

Perhaps it gives some more clues.

Another idea: Change INDEX(secondary_id) to INDEX(secondary_id, id). Even though that is what is stored in the BTree, I wonder if being explicit would trick it into being more efficient. Possibly your index has the ids stored in random order, but mine would have them in an order that would be efficient for inserting/finding/etc.

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange