MySQL: columns with mixed character encodings, and finding columns which have multibyte data

https://dba.stackexchange.com/questions/238039

30-01-2021
|

题

I have some rather large CSV files that I am loading into my MySQL 5.7 database. The files are several gigabytes in size, several million lines long, and have large column widths that must be used in joins (sometimes as long as ~500 characters).

The data is all standard English characters, and most of the columns can fit into a single byte character set like latin1. However, several of the columns require unicode for things like trademark/registered/copyright symbols, measurement symbols (inches, feet, radius, etc), and therefore I've been using utf8mb4 on all tables.

The problem with doing this is twofold. It blows up our index sizes, so in some cases, we can't create an index on a column(s) because the width becomes greater than 3072. Additionally, it seems to be having a significant performance impact, presumably because the data size is 4x.

What I'd like to do is use latin1 on all columns in the table, and only utf8mb4 on columns that need it. This leads to my questions -

What's the best way to identify for sure which columns are actually storing multibyte characters? Can I detect that somehow, either within my CSV prior to loading (using python/pandas maybe?), or from within the database? The files are stored as utf8. They are currently loaded into a utf8mb4 table. If I could easily scan the table and say "this column contains no multibyte data", I could change it to latin1.

Second, will I run into problems if I try to create composite indexes with columns using different encodings? Say column A is utf8mb4 and column B is latin1. Is there anything wrong with creating an index on these two columns? ie: CREATE INDEX my_index ON my_table(A, B);. I'm assuming there's no issue doing that.

解决方案

The data size is not 4x. English text, even in utf8mb4, takes only one byte per character. Trademark(etc) symbols are multi-byte. Then ones you mentioned are only 2 bytes. Emoji and some Chinese is where 4 bytes become necessary.

Don't create indexes on large columns. Don't create indexes until you have the queries -- derive the optimal indexes from the queries.

It is perfectly fine (in MySQL, at least) to have one column be latin1 and another be utf8mb4 (etc). And both of those can be in the same index.

I suggest making a couple of passes over the data. First bring everything in with utf8mb4, no indexes, wide columns (eg TEXT). Then analyze what you got -- SELECT MAX(CHAR_LENGTH(col2)), ...; test for non-latin1, etc. For the second pass, re-do the schema to be closer to max len, etc.

许可以下： CC-BY-SA 和归因

不隶属于 dba.stackexchange