Index on a table you must fully scan anyway? (MySQL)

https://stackoverflow.com/questions/4630881

08-10-2019
|

Question

I'm pretty stumped here.

I have 2 tables and I'm left joining the first (around 500k records) with the second (around 2.2 million records) in order to find out which records are in the first and not the second. (typical "b.attribute is null" nonsense)

Why (how) is it that an index is utilized on the first table? It's going to have to go through EVERY record in the first table anyway, and yet when I try to do this join without any index (or primary key.. none needed because this is all just ETL) on the first table, it crawls.

using innodb, by the way.

Help?

EDIT : the 2nd table is indexed. The first wasn't.

Solution

I have no idea if this is what is happening, but it would, in theory, be possible (depending on the actual query) for the database engine to be scanning the index for the left table rather than the table itself. It could construct the necessary key data for that. If scanning the index were faster than scanning the table, that could account for the speed difference.

OTHER TIPS

This should shed some light on it: http://dev.mysql.com/doc/refman/5.5/en/innodb-index-types.html

In short: All InnoDB tables have so called 'clustered index' (even if no explicit index is defined on the table, InnoDB creates it automatically), in which actual rows are stored.

The purpose of the primary index is put things in order by sorting and creating a big tree (at least in SQL Server). B-tree if to be more specific. This means each record's key belongs to some place (or bucket) in the tree.

alt text

So why adding a key to the FIRST table helps to speed up the query? The reason is that when the query is executed, the FIRST table is being sorted since the SECOND table is already sorted due to presence of a primary key. This is due to the simple fact that comparing two sorted lists is much faster than doing binary search for each element. In this case, since there is no index, sorting takes time.

By the way, don't be confused by what I say. It's not really comparing lists, but more pruning the index tree on the above picture, e.g. if the T1 has K1, K2, K3 and K1 can be found in second bucket on the picture then there is no need to check first bucket for the rest of the keys.

MySQL doesn't have hash joins.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow