Is "CREATE INDEX` in MySQL a Linear Operation?

https://dba.stackexchange.com/questions/581

mysql
index

16-10-2019
|

Question

What I mean is the following:

If creating an index on a table with n rows takes t time. Will creating an index on the same table with 1000*ntake approximately 1000*t time.

What I'm trying to achieve is to to estimate the time it takes to create the index on the production database by creating the same index on the much smaller test database.

Solution

Index creation is essentially a sort operation, so is at best has a growth complexity of the order n log n on average (you might find it does better in some cases, and is not likely to do much worse).

If all your relevant data pages fit into RAM and are already in RAM, and the index will fit also, and your DBMS does not force index pages to be written before the creation is complete (so index blocks are not updated on disk multiple times during the operation), then the speed of writing the resulting index to disk will be more significant than the time taken to perform the sort - so you might find you get closer to a linear relationship between number of rows and the time the index creation takes - but if you assume the worse case you are less likely to be unpleasantly surprised!

Remember that unless you are not going to stop access to the production database during the operation any index create will be competing for IO bandwidth and/or locks with other activity, so you should try to account for this if you are doing your timing estimation tests on another system even if it is identically configured.

OTHER TIPS

Also worth noting is that if you can split the spindles for the indexes from the spindles for the table then you'll be able to work from two disks at one time (still be limited to the speed of the disk controller in the middle, if a RAID or the like, but still it'll be faster than one disk).

I realize that creating an index isn't completely a simul-read-write operation, but it does speed things up considerably.

CAVEATS: I'm a MSSQL guy myself, and so I'm not sure about MySQL, but I gotta imagine that concept of splitting spindles is not specific to SQLServer and Oracle (where I've heard it talked about over there too, IIRC). I just wouldn't know how to go about setting that concept up. But in SQLServer terms it would mean having a separate filegroup besides PRIMARY and putting the indexes on the other filegroup, with the other filegroup assigned to a set of spindles not involving PRIMARY (granted spindle placement vs filegroups is another story altogether)

If this question was asked about 6 years ago, I would have emphatically said NO as it would have pertained to MySQL 4.x. However, MySQL 5.x does perform index creation linearly today. I just had a nostalgic experience explaining this in my answer to that previous question.

It depends.

Variable #1: If MySQL chooses to build the index(es) on the fly, or wait until all the data is in, then do a sort, etc, to build the index. Note: UNIQUE indexes (I think) have to be built on the fly so that UNIQUEness can be verified. The PRIMARY KEY for InnoDB is stored with the data (or you could state it vice versa), so that MUST be built randomly.

Variable #2: The Index tracks the data (eg AUTO_INCREMENT or timestamp) versus random (GUID, MD5), or somewhere in between (part number, name, friend_id).

Variable #3 (if the index is built on the fly): The index might fit in cache (key_buffer or innodb_buffer_pool), or it might spill to disk.

Indexes that track the data are efficient, and virtually linear, regardless of the answer to #1.

Random ids are a pain. If the index won't fit in cache, the time to build it will be much worse than linear, regardless of the other variables. (I disagree with Rolando in this case.) A huge InnoDB table with a GUID for the PK is painfully slow to INSERT into -- plan on 100 rows/sec for ordinary disks; maybe 1000 if you have SSDs. LOAD DATA and batched INSERTs won't get you past the slowness of the random storage.

3.53 thru 5.6 -- not much has changed.

Multiple spindles? RAID striping is better in almost any situation than manually assigning this to here and that to there. Manual splitting leads to unbalanced situations -- a table scan is stuck on the data disk; an index-only operation is stuck on the index disk; a lone query first hits the index disk, then the data disk (no overlap); etc.

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange