Question

In a fairly unique situation, my team has ended up with UTF-8 bytes in a database that thinks the data is encoded as latin1.

At least, I'm 85% certain that this is the situation at hand.

For example, a right single quotation mark was handed to the database by a programming language that had no concept of encodings (Ruby 1.8) and just treated the data as raw bytes (0xE2 0x80 0x99). This data, as far as I can tell (how to verify?), was stored as those actual bytes. So now when the data is read out by a more intelligent programming language (Ruby 1.9), the database helpfully says "Oh! 0xE2 is 'â', 0x80 is '€', 0x99 is '™'", and so instead of "Mike’s", we end up with "Mike’s". This is also what I get in the mysql prompt when SELECTing that value.

So, essentially, we have a bunch of utf-8 encoded data stored in a database that thinks the data is encoded as latin1.

This makes me to somehow tell the database "No, no matter what you think, this stuff is actually utf-8". CONVERT TO doesn't seem like the right tool, because then I'll end up with permanent "Mike’s".


Failed/moronic attempt #1

I noticed this:

> SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name LIKE 'collation%';
+--------------------------+-------------------+
| Variable_name            | Value             |
+--------------------------+-------------------+
| character_set_client     | utf8              |
| character_set_connection | utf8              |
| character_set_database   | utf8              |
| character_set_filesystem | binary            |
| character_set_results    | utf8              |
| character_set_server     | latin1            |
| character_set_system     | utf8              |
| collation_connection     | utf8_general_ci   |
| collation_database       | utf8_unicode_ci   |
| collation_server         | latin1_swedish_ci |
+--------------------------+-------------------+

And thought that maybe changing character_set_results to latin1 would trick it into not doing any conversion of the bytes, resulting in the proper display of data on my utf8 OS.

Sure enough, SET character_set_results=latin1; results in instead of ’. Cool!

So I added this to my ~/.my.cnf (which is the only my.cnf, I checked):

[mysqld]
...
character-set-results=latin1

and when I go back to the MySQL prompt & check the character_set_% variables, it's still utf8.

Yes, it just occurred to me that mysqld is a deamon, which means I probably need to restart the whole mysql process for this to take effect. But whoever installed MySQL on this machine used the dmg instead of the brew (wasn't me!) and the MySQL pref pane is currently telling me that MySQL isn't running even though it clearly is, and anyhow before I go down that rabbit hole, I want to check with an actual DBA and see how ridiculous this is, or if there's just a better, cleaner way to do it.

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with dba.stackexchange
scroll top