MySQL unicode literals

https://stackoverflow.com/questions/3632410

26-09-2019
|

Question

I want to insert a record into MySQL that has a non-ASCII Unicode character, but I'm on a terminal that doesn't let me easily type non-ASCII characters. How do I escape a Unicode literal in MySQL's SQL syntax?

Solution

See: http://bugs.mysql.com/bug.php?id=10199 (Bug #10199: "Allow Unicode escape sequence for string literals.") This request has been "Open" since 2005. More details in Worklog Task #3529: Unicode Escape Sequences.

From https://web.archive.org/web/20091117221116/http://eng.kaching.com/2009/10/mysql-unicode-escape-sequences.html though, you can see the following example, which does actually seem to work, but requires you to know the actual byte-by-byte UTF8 encoding:

You can also use the variable-length UTF-8 representation (convenient when, for example, copying from a utf-8 URL-encoded value like %E2%80%98).
mysql> select _utf8 x'E28098';  
+---+  
| ‘ |  
+---+  

OTHER TIPS

This stored function provides the functionality MySQL is (apparently) missing, with a way to turn a literal code point into a character without having to already know the UTF-8 encoding.

If VARCHAR(1) seems strange, since utf8 characters in MySQL can be up to 3 bytes long, remember the size of VARCHAR is characters, not bytes. The function returns a single UTF-8-encoded character from the input value.

For hexadecimal literals, prepend 0x.

DELIMITER $$

DROP FUNCTION IF EXISTS `utf8_char` $$
CREATE FUNCTION `utf8_char`(v smallint unsigned) RETURNS VARCHAR(1) CHARSET utf8
NO SQL
DETERMINISTIC
BEGIN

-- http://stackoverflow.com/questions/3632410/mysql-unicode-literals/30675371#30675371

RETURN CHAR(CASE
            WHEN v <= 0x7F THEN v
            WHEN v <= 0x7FF THEN 0xC080 | ((v >> 6) << 8) | (v & 0x3F)
            WHEN v <= 0xFFFF THEN 0xE08080 | (((v >> 12) & 0x0F ) << 16)  | (((v >> 6) & 0x3F ) << 8) | (v & 0x3F)
            ELSE NULL END);

END $$

DELIMITER ;

Example output:

mysql> select utf8_char(8592) AS 'leftwards_arrow';
+-----------------+
| leftwards_arrow |
+-----------------+
| ←               |
+-----------------+
1 row in set (0.00 sec)

mysql> select utf8_char(0x2192) AS 'rightwards_arrow_hex';
+----------------------+
| rightwards_arrow_hex |
+----------------------+
| →                    |
+----------------------+
1 row in set (0.00 sec)

If the goal is to specify the code point instead of the encoded byte sequence (i.e. 0x0F02 instead of the UTF-8 0xE0BC82 for "༂"), then you need to use an encoding in which the code point value just happens to be the encoded byte sequence. For example, "0xE28098" is the UTF-8 encoded byte sequence for the " ‘ " character (as shown in dkamins's answer), which is code point U+2018. However, 0x2018 is both the code point value for ‘ and the encoded byte sequence for ucs2 / utf16 (they are effectively the same encoding for BMP characters, but I prefer to use "utf16" as it is consistent with "utf8" and "utf32", consistent in the "utf" theme). Hence:

_utf16 0x2018

returns the same ‘ character as:

_utf8 0xE0BC82

But, utf16 only works for BMP characters (code points U+0000 - U+FFFF) in terms of specifying the code point value. If you want a Supplementary Character (by specifying the code point instead of a specific encoding's sequence of bytes), then you will need to use the utf32 encoding. Not only does _utf32 0x2018 return ‘, but:

_utf32 0x1F47E

returns: 👾

To use either UTF-8 or UTF-16 encodings for that same Supplementary Character would require the following:

_utf8mb4 0xF09F91BE

_utf16 0xD83DDC7E

HOWEVER, if you are having trouble adding this to a string that is already utf8, then you will need to convert this into utf8 (or into utf8mb4 when creating Supplementary Characters as the utf8 encoding / charset can only handle BMP characters):

CONVERT(_utf32 0x1F47E USING utf8mb4)

Or, using the example character from Michael - sqlbot's answer:

CONVERT(_utf32 0x2192 USING utf8)

returns a →. Hence, a custom function is not needed in order to create a UTF-8 encoded character from its code point (at least not as of MySQL 8.0). Here is a test query

SELECT _utf32 0x1F47E AS "Supplementary Character in utf32",
       CONVERT(_utf32 0x1F47E USING utf8mb4) AS "Supplementary Character in utf8mb4",
       CHARSET(CONVERT(_utf32 0x1F47E USING utf8mb4)) AS "Proof",

       "---" AS "---",

       _utf32 0x2192 AS "BMP character in utf32",
       CONVERT(_utf32 0x2192 USING utf8) AS "BMP character in utf8",
       CHARSET(CONVERT(_utf32 0x2192 USING utf8)) AS "Proof";

And you can see it working on db<>fiddle (might not work in pre-8.0 MySQL).

For more details on these options, plus Unicode escape sequences for other languages and platforms, please see my post:

Unicode Escape Sequences Across Various Languages and Platforms (including Supplementary Characters)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow