Too many columns to index - use mySQL Partitions?

https://stackoverflow.com/questions/4429180

09-10-2019
|

Question

We have an application with a table with 20+ columns that are all searchable. Building indexes for all these columns would make write queries very slow; and any really useful index would often have to be across multiple columns increasing the number of indexes needed.

However, for 95% of these searches, only a small subset of those rows need to be searched upon, and quite a small number - say 50,000 rows.

So, we have considered using mySQL Partition tables - having a column that is basically isActive which is what we divide the two partitions by. Most search queries would be run with isActive=1. Most queries would then be run against the small 50,000 row partition and be quick without other indexes.

Only issue is the rows where isActive=1 is not fixed; i.e. it's not based on the date of the row or anything fixed like that; we will need to update isActive based on use of the data in that row. As I understand it that is no problem though; the data would just be moved from one partition to another during the UPDATE query.

We do have a PK on id for the row though; and I am not sure if this is a problem; the manual seemed to suggest the partition had to be based on any primary keys. This would be a huge problem for us because the primary key ID has no basis on whether the row isActive.

Solution

I am not a MySQL expert. My focus is Oracle, but I've been working with Partitioning for years and I've come to find that your suggested use is very appropriate but not inside the mainstream understanding of partitions.

Index on low cardinality columns

Putting aside Index Merging for now. Let's say that your active rows are somewhat scattered and are a 1:20 ratio with the number of inactive rows. Say your page size is 8Kb and your get about 20 rows per block. If you get a very even distribution of isactive records, you'll have almost 1 per block. A full table scan will be much, much, much faster to read EVERY block/page in the table than using an index to find those same rows.

So let's say they are concentrated instead of evenly scattered. Even if they are concentrated in 20% of the pages or even 10% of the pages, a full table scan can out perform an index even in those cases.

So now include index merging. If after you scan the index of ISactive and you DO NOT visit the table but join those results to the results of ANOTHER index and that final result set will yield reading, say, less than 5% of your blocks. Then yes, and index on isactive and index merging could be a solution.

The caveat here is that there are a lot of limitation on the implementation of index joins in MySQL. Make sure that this works in your situation. But you said you have another 20 fields that may be searched. So if you don't index all of them so there's an available second index to join the IsActive index to, you'll not be using the index merging/join.

Partitioning a low cardinality column

now if you partition on that column, you'll have 5% of the blocks with IsActive = True in them and they will be densely packed. A full partition scan will quickly yield the list of active records, and allow every other predicate to be applied as a filter instead of an index seek.

But that flag changes, right.

In Oracle we have a command that allows us to enable Row Migration. That means, when Is_Active changes from True to False, move the partition the row falls in. This is pretty expensive but only a bit more than the index maintenance that would occur if you indexed that column instead of partitioning by it. In a partitioned example. Oracle first changes the row with an update, then does a delete and then an insert. If you indexed that column, you'd do an update of the row and then the index entry for TRUE would be deleted and then an index entry for False would be create.

If MySQL doesn't have row migration then you'll have to program your crud package to do that. UPDATE_ROW_ISACTIVE(pk IN number) procedure <---- something like that) will do the delete and insert for you.

Regarding Konerak's Answer

While I agree that parallel access is ONE use of partitioning, it's not the exclusive one. But if you follow the link he provides, the user comment at the very bottom of the page is:

Beware of having low selectivity indexes on your table. A complex AND/OR WHERE clause will surely make your query very very slow if Index_Merge optimization is being used with an intersect() algorithm.

That seems to speak to your situation, so you can take that comment FWIW.

OTHER TIPS

If you are going to index that many "column" you may want to rethink your data structure. For example, make each column a row/record instead. Then have a "group ID" to link the individual records together, and a "name" field to indicate what piece of data it is. Then you only need 1 index for all your pieces of data.

This name/value pair setup is actually fairly common now and is what some noSQL databases are based on. Which is something else you may want to look into. Something like MongoDB is excellent for indexing "all" pieces of data.

You don't need partitions for this - just an Index on your isActive column would be enough. Note that MySQL can use the Index Merge operation to use both indexes.

Partitions would be useful when they would allow to execute the searches in parallel: eg if you partition per date, you can search 5 partitions simultaneously to find results spanning 5 years.

Your description of the "table" and the "database" are classic symptoms of a lack of Normalisation. A "table" with 20 searchable columns is not 3NF and probably not even 1NF. The best advice is to go back to first principles, and normalise the data, that will result in much narrower tables, and also fewer rows per table, but sure, mote tables. However the result also has fewer indices, per table, and overall.

And a much faster database. Fat-wide "tables" are a disaster for performance, at every level.

Partitions do not apply here, they will not ease your problem.

An id PK is an additional column and index, a surrogate, a substitute (but not a replacement) for the real Primary Key. If you used Relational modelling techniques, that can be eliminated, at least getting down to 19 searchable indices. Any serious work on the "table" will be centred around the real PK, not the surrogate, for example, as you have seen from the restrictions re Partitions.

If you wish to discuss it, please post your DDL for the "table", plus every connected "table".

Response to Comments

The table is best thought of as "emails" but with a lot of extra fields (category/department/priority/workflow/owner) which are all properly normalised. There are a range of other variables as well including quite a lot of timestamps.

That's the very definition of a flat file, at 0NF. Unless you are using some unwritten definition of "Normalisation", it is, by your own description, not Normalised at all. It is the article one starts with before any Normalisation is commenced.

No doubt the indices will be fat-wide as well, in order to be useful for queries.
and you may not have realised yet, there is massive data duplication in that file, and Update Anomalies (when you update a column in one row, you have to update the duplicated value in the other rows), which makes your application unnecessarily complex.

You need to understand that all the Relational DBMS vendors write Relational database engines that are optimised to handle Relational databases. That means they are optimised for Normalised, not Unnormalised or Denormalised, structures.

I will not be drawn into academic arguments, and SO is question-and-answer site, not a debating site. As requested, post your DDL for the file, and all connected files, and we can definitely (a) give it some speed and (b) avoid 20+ indices (which is another common symptom of the condition). That will deal with a specific real world issue and solve it, and avoid debate.

Second, you seem to have the roles mixed up. It is you, with the problem, posting the question on SO, and it is me who has fixed hundreds of performance problems, answering. By definition the solution is outside your domain, otherwise you would have solved it, and thus you would not be posting a question; so it does not work when you tell me how to fix your problem. That would be tying me up in the same limitations that yo have, and thus ensuring that I do not fix the problem.

Also from our tests, having lots of tables to JOIN against that we need to include in the WHERE clause only makes the query slower.

Actually I tune databases for a living, and I have hundreds of tests that demonstrate joining many, smaller, tables is faster. It would be interesting to look into the test and the coding capability of the coder, but that would start a debate, so let's not do that; let's stick to the question. If you want examples of (a) serious testing which (b) proves what I have stated before being challenged, here's just one example fully documented and under scrutiny of, and corresponding test with, stalwarts in the Oracle world.

You may also be interested in this question/answer, which killed the same debate you are approaching.

Joins cost nothing. The files you join to; and the number of records joined on either side; the usefulness of an indices, that is where the cost lies. If it is another Unnormalised file (fat, wide, many optional columns), sure it will be slow.

Anyway, if you are genuinely interested in fixing your posted problem, post all your DDL and we can make it faster for you. If all you want is a yes/no answer re partitions (and to not address the causative problem), that's fine too; you already have that.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow