Question

This is probably something simple. I swear I've been looking online for the answer and haven't found it. Since my particular case is a little atypical I finally decided to ask here.

I have a few tables in MySQL that I'm using for a Chinese language program. It needs to be able to support every possible Chinese character, including rare ones that don't have great font support. A sample cell in the table might be this:

東菄鶇䍶𠍀倲𩜍𢘐涷蝀凍鯟𢔅崠埬𧓕䰤

In order to get that to work right in the database, I've had to set the encoding/collation to utf8mb4. So far so good. Unfortunately when I pull the same string into PHP, it gets printed as this:

東菄鶇䍶?倲??涷蝀凍鯟?崠埬?䰤

How can I finally kill off the remaining question marks and get them to show as the unicode glyphs they should be? I've got the php page itself using UTF8 encoding in the tag and as a meta tag.

Why can't they communicate with each other? What am I doing wrong?

Was it helpful?

Solution

I'd simply guess that you are setting the table to utf8mb4, but your connection encoding is set to utf8. You have to set it to utf8mb4 as well, otherwise MySQL will convert the stored utf8mb4 data to utf8, the latter of which cannot encode "high" Unicode characters. (Yes, that's a MySQL idiosyncrasy.)

On a raw MySQL connection, it will have to look like this:

SET NAMES 'utf8mb4';
SELECT * FROM `my_table`;

You'll have to adapt that to the best way of the client, depending on how you connect to MySQL from PHP (mysql, mysqli or PDO).


To really clarify (yes, using the mysql_ extension for simplicity, don't do that at home):

mysql_connect(...);
mysql_select_db(...);
mysql_set_charset('utf8mb4');     // adapt to your mysql connector of choice

$r = mysql_query('SELECT * FROM `my_table`');

var_dump(mysql_fetch_assoc($r));  // data will be UTF8 encoded

OTHER TIPS

Just to add to @deceze's answer, I recommend a well-configured MySQL server (for me, in /etc/mysql/mysql.conf.d/mysqld.cnf). Here are the configuration options to make sure you're using utfmb4, although I do recommend going through every MySQL configuration option though, daunting as it is, there are a lot of defaults that are are very non-optimal.

[client]

default-character-set           = utf8mb4

[mysql]

default_character_set           = utf8mb4

[mysqld]

init-connect                    = "SET NAMES utf8mb4"
character-set-client-handshake  = FALSE
character-set-server            = "utf8mb4"
collation-server                = "utf8mb4_unicode_ci"
autocommit                      = 1
block_encryption_mode           = "aes-256-cbc"

That last one is just one that should be default. Also, init-connect deals with not having to execute that every time. Keeps code clean. Now run:

SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name LIKE 'collation%';

You should return something like the following:

+--------------------------+--------------------+
| Variable_name            | Value              |
+--------------------------+--------------------+
| character_set_client     | utf8mb4            |
| character_set_connection | utf8mb4            |
| character_set_database   | utf8mb4            |
| character_set_filesystem | binary             |
| character_set_results    | utf8mb4            |
| character_set_server     | utf8mb4            |
| character_set_system     | utf8               |
| collation_connection     | utf8mb4_unicode_ci |
| collation_database       | utf8mb4_unicode_ci |
| collation_server         | utf8mb4_unicode_ci |
+--------------------------+--------------------+

And looks like you're doing this already, but doesn't hurt to explicitly define on table creation:

CREATE TABLE `mysql_table` (
  `mysql_column` BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT,
  PRIMARY KEY (`mysql_column`)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8mb4;

Hope this helps someone.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top