Question

While converting a database to UTF-8 I noticed a strange behavior regarding the control characters 0x80-0x9F. For example, 0x92 (right apostrophe) would not get converted to UTF-8 and truncate the rest of the content of a column, using this method:

CREATE TABLE `bar` (
 `content` text
) ENGINE=MyISAM DEFAULT CHARSET=latin1

INSERT INTO bar VALUES (0x8081828384858687898A8B8C8D8E8F909192939495969798999A9B9C9D9E9F);
Query OK, 1 row affected (0.06 sec)

SELECT content FROM bar;
+---------------------------------------------------------------------------------+
| content                                                                         |
+---------------------------------------------------------------------------------+
| €‚ƒ„…†‡‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ                                                 |
+---------------------------------------------------------------------------------+
1 row in set (0.06 sec)

ALTER TABLE bar CHANGE content content TEXT CHARACTER SET UTF8;
Query OK, 1 row affected, 1 warning (0.06 sec)
Records: 1  Duplicates: 0  Warnings: 1

SHOW WARNINGS;
+---------+------+-------------------------------------------------------------------------------------+
| Level   | Code | Message                                                                             |
+---------+------+-------------------------------------------------------------------------------------+
| Warning | 1366 | Incorrect string value: '\x80\x81\x82\x83\x84\x85...' for column 'content' at row 1 |
+---------+------+-------------------------------------------------------------------------------------+
1 row in set (0.06 sec)

SELECT * FROM bar;
+---------+
| content |
+---------+
|         |
+---------+
1 row in set (0.06 sec)

While normally, 0x80-0x9F wouldn't be allowed in Latin1, MySQL seems to handle it differently:

MySQL's latin1 is the same as the Windows cp1252 character set. This means it is the same as the official ISO 8859-1 or IANA (Internet Assigned Numbers Authority) latin1, except that IANA latin1 treats the code points between 0x80 and 0x9f as “undefined,” whereas cp1252, and therefore MySQL's latin1, assign characters for those positions. [src]

But MySQL can't seem to convert the above range of values from its latin1 character set to its UTF-8 character set.

These characters are getting in my database from copy/pasting from a word document (cp1252), and while I might have found a way to have the application force the right UTF-8 values for new entries, I need to make sure the old get converted properly.

Is there a way within MySQL that I'm missing to convert these to the UTF-8 equivalent without going through each row of each text column and replacing them with an ASCII-friendly version?

Was it helpful?

Solution

I'm not certain. I tried to start out be reproducing your problem but the alter worked fine for me.

test > CREATE TABLE `bar` (  `content` text ) ENGINE=MyISAM DEFAULT CHARSET=latin1;  INSERT INTO bar VALUES (0x8081828384858687898A8B8C8D8E8F909192939495969798999A9B9C9D9E9F);
Query OK, 0 rows affected (0.02 sec)

Query OK, 1 row affected (0.00 sec)

test > ALTER TABLE bar CHANGE content content TEXT CHARACTER SET UTF8;
Query OK, 1 row affected (0.04 sec)
Records: 1  Duplicates: 0  Warnings: 0

test > select * from bar;
+---------------------------------+
| content                         |
+---------------------------------+
| ����������������������������� |
+---------------------------------+
1 row in set (0.00 sec)

test > set names utf8;
Query OK, 0 rows affected (0.00 sec)

test > select * from bar;
+---------------------------------------------------------------------------------+
| content                                                                         |
+---------------------------------------------------------------------------------+
| €‚ƒ„…†‡‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ |
+---------------------------------------------------------------------------------+
1 row in set (0.00 sec)

Here's my related char settings

test > show variables like '%char%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       |
| character_set_connection | utf8                       |
| character_set_database   | latin1                     |
| character_set_filesystem | binary                     |
| character_set_results    | utf8                       |
| character_set_server     | latin1                     |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+

Edit

My char settings before running set names utf8

test > show variables like '%char%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | latin1                     |
| character_set_connection | latin1                     |
| character_set_database   | latin1                     |
| character_set_filesystem | binary                     |
| character_set_results    | latin1                     |
| character_set_server     | latin1                     |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)

Version

test > select version();
+-------------------------+
| version()               |
+-------------------------+
| 5.1.41-3ubuntu12.10-log |
+-------------------------+
1 row in set (0.00 sec)

OTHER TIPS

You may have to convert the character set to cp1250 before loading the data.

I ran this first

mysql> show character set like 'cp%';
+---------+---------------------------+-------------------+--------+
| Charset | Description               | Default collation | Maxlen |
+---------+---------------------------+-------------------+--------+
| cp850   | DOS West European         | cp850_general_ci  |      1 |
| cp1250  | Windows Central European  | cp1250_general_ci |      1 |
| cp866   | DOS Russian               | cp866_general_ci  |      1 |
| cp852   | DOS Central European      | cp852_general_ci  |      1 |
| cp1251  | Windows Cyrillic          | cp1251_general_ci |      1 |
| cp1256  | Windows Arabic            | cp1256_general_ci |      1 |
| cp1257  | Windows Baltic            | cp1257_general_ci |      1 |
| cp932   | SJIS for Windows Japanese | cp932_japanese_ci |      2 |
+---------+---------------------------+-------------------+--------+
8 rows in set (0.00 sec)

cp1252 does is not exist here. The closest is cp1250.

Try this sequence:

drop database if exists dtest;
create database dtest;
use dtest
set names cp1250;
CREATE TABLE `bar` ( 
 `content` text 
) ENGINE=MyISAM DEFAULT CHARSET=latin1 ;
INSERT INTO bar VALUES (0x8081828384858687898A8B8C8D8E8F909192939495969798999A9B9C9D9E9F); 
SELECT content FROM bar; 
SHOW VARIABLES LIKE '%char%';
set names utf8;
SHOW VARIABLES LIKE '%char%';
ALTER TABLE bar CHANGE content content TEXT CHARACTER SET UTF8; 
SELECT content FROM bar; 

and see what happens.

I got this in MySQL 5.5.19 on Linux

mysql> drop database if exists dtest;
Query OK, 0 rows affected (0.00 sec)

mysql> create database dtest;
Query OK, 1 row affected (0.00 sec)

mysql> use dtest
Database changed
mysql> set names cp1250;
Query OK, 0 rows affected (0.00 sec)

mysql> CREATE TABLE `bar` (
    ->  `content` text
    -> ) ENGINE=MyISAM DEFAULT CHARSET=latin1 ;
Query OK, 0 rows affected (0.01 sec)

mysql> INSERT INTO bar VALUES (0x8081828384858687898A8B8C8D8E8F909192939495969798999A9B9C9D9E9F);
Query OK, 1 row affected (0.00 sec)

mysql> SELECT content FROM bar;
+---------------------------------+
| content                         |
+---------------------------------+
| ??

??????                      |
+---------------------------------+
1 row in set (0.00 sec)

mysql> SHOW VARIABLES LIKE '%char%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | cp1250                     |
| character_set_connection | cp1250                     |
| character_set_database   | latin1                     |
| character_set_filesystem | binary                     |
| character_set_results    | cp1250                     |
| character_set_server     | latin1                     |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)

mysql> set names utf8;
Query OK, 0 rows affected (0.00 sec)

mysql> SHOW VARIABLES LIKE '%char%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       |
| character_set_connection | utf8                       |
| character_set_database   | latin1                     |
| character_set_filesystem | binary                     |
| character_set_results    | utf8                       |
| character_set_server     | latin1                     |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)

mysql> ALTER TABLE bar CHANGE content content TEXT CHARACTER SET UTF8;
Query OK, 1 row affected (0.01 sec)
Records: 1  Duplicates: 0  Warnings: 0

mysql> SELECT content FROM bar;
+---------------------------------------------------------------------------------+
| content                                                                         |
+---------------------ŽÂÂâââââ---------------------------------------------------+
| â¬ÂâÆââ¦â â¡â°Å â¹Å         ¢Å¡âºÅÂ
                                      +---------------------------------------------------------------------------------+
1 row in set (0.00 sec)

mysql>

and I got this in MySQL 5.5.12 for Windows on my Windows 7 machine

mysql> drop database if exists dtest;
Query OK, 1 row affected (0.00 sec)

mysql> create database dtest;
Query OK, 1 row affected (0.02 sec)

mysql> use dtest
Database changed
mysql> set names cp1250;
Query OK, 0 rows affected (0.00 sec)

mysql> CREATE TABLE `bar` (
    ->  `content` text
    -> ) ENGINE=MyISAM DEFAULT CHARSET=latin1 ;
Query OK, 0 rows affected (0.06 sec)

mysql> INSERT INTO bar VALUES (0x8081828384858687898A8B8C8D8E8F909192939495969798999A9B9C9D9E9F);
Query OK, 1 row affected (0.00 sec)

mysql> SELECT content FROM bar;
+---------------------------------+
| content                         |
+---------------------------------+
| Ç?é?äàåçëèï??Ä??æÆôöòûù?ÖÜ¢??₧? |
+---------------------------------+
1 row in set (0.00 sec)

mysql> SHOW VARIABLES LIKE '%char%';
+--------------------------+---------------------------------+
| Variable_name            | Value                           |
+--------------------------+---------------------------------+
| character_set_client     | cp1250                          |
| character_set_connection | cp1250                          |
| character_set_database   | latin1                          |
| character_set_filesystem | binary                          |
| character_set_results    | cp1250                          |
| character_set_server     | latin1                          |
| character_set_system     | utf8                            |
| character_sets_dir       | C:\MySQL_5.5.12\share\charsets\ |
+--------------------------+---------------------------------+
8 rows in set (0.00 sec)

mysql> set names utf8;
Query OK, 0 rows affected (0.00 sec)

mysql> SHOW VARIABLES LIKE '%char%';
+--------------------------+---------------------------------+
| Variable_name            | Value                           |
+--------------------------+---------------------------------+
| character_set_client     | utf8                            |
| character_set_connection | utf8                            |
| character_set_database   | latin1                          |
| character_set_filesystem | binary                          |
| character_set_results    | utf8                            |
| character_set_server     | latin1                          |
| character_set_system     | utf8                            |
| character_sets_dir       | C:\MySQL_5.5.12\share\charsets\ |
+--------------------------+---------------------------------+
8 rows in set (0.00 sec)

mysql> ALTER TABLE bar CHANGE content content TEXT CHARACTER SET UTF8;
Query OK, 1 row affected (0.06 sec)
Records: 1  Duplicates: 0  Warnings: 0

mysql> SELECT content FROM bar;
+---------------------------------------------------------------------------------+
| content                                                                         |
+---------------------------------------------------------------------------------+
| €‚ƒ„…†‡‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ |
+---------------------------------------------------------------------------------+
1 row in set (0.00 sec)

mysql>

Give it a Try !!!

Licensed under: CC-BY-SA with attribution
Not affiliated with dba.stackexchange
scroll top