Converting Unicode reference to UTF-8 character in PHP with mbstring
-
02-03-2021 - |
Question
I have a set of data inside a database which has been input with unicode characters, but they were interpreted as a string. That is, where there should be an apostrophe ’
I've actually got \u2019
So I now need to convert this into its character representation, which is ’
. Firstly it is quite easy to change the string into its entity version: ’
, then I need to turn it into the correct UTF-8 multibyte string.
I have attempted to do this in a number of ways; on my local server I can exctract the characters with a preg_match function and then pass each to the following function:
mb_convert_encoding($string, "UTF-8", "HTML-ENTITIES");
Sounds quite sensible, and works without issue. Turning off the UTF-8 charset in the browser shows that this has actually converted into ’
when read by the browser default encoding.
However, the exact same code when run in my production environment produces the dreaded "missing symbol" box when rendered as UTF-8. Turning off UTF-8 and it has produced whatever byte stream renders as ò°‘£
. It appears to be outputting 4 bytes rather than 3, I don't know if that is relevant as I'm not well read on character encoding.
I assume that the issue is with my mbstring settings. Here are the mbstring settings from my local server:
Multibyte Support enabled
Multibyte string engine libmbfl
HTTP input encoding translation disabled
Multibyte (japanese) regex support enabled
Multibyte regex (oniguruma) version 4.7.1
mbstring.detect_order no value no value
mbstring.encoding_translation Off Off
mbstring.func_overload 0 0
mbstring.http_input auto auto
mbstring.http_output UTF-8 UTF-8
mbstring.http_output_conv_mimetypes ^(text/|application/xhtml\+xml)^(text/|application/xhtml\+xml)
mbstring.internal_encoding UTF-8 UTF-8
mbstring.language neutral neutral
mbstring.strict_detection Off Off
mbstring.substitute_character no value no value
There are a few differences on my production environment:
Multibyte Support enabled
Multibyte string engine libmbfl
Multibyte (japanese) regex support enabled
Multibyte regex (oniguruma) version 3.7.1
mbstring.detect_order no value no value
mbstring.encoding_translation Off Off
mbstring.func_overload 0 0
mbstring.http_input auto auto
mbstring.http_output UTF-8 UTF-8
mbstring.internal_encoding UTF-8 UTF-8
mbstring.language neutral neutral
mbstring.strict_detection Off Off
mbstring.substitute_character no value no value
Anyone see what I'm doing wrong?
Solution
See if this can help you: hex2ascii and ascii2hex
ADDED on 09-19-2012:
function ascii2hex($ascii)
{
$hex = '';
for ($i = 0; $i < strlen($ascii); $i++)
{
$byte = strtoupper(dechex(ord($ascii{$i})));
$byte = str_repeat('0', 2 - strlen($byte)).$byte;
$hex .= $byte." ";
}
return $hex;
}
function hex2ascii($hex)
{
$ascii = '';
$hex = str_replace(" ", "", $hex);
for($i = 0; $i < strlen($hex); $i = $i+2)
$ascii .= chr(hexdec(substr($hex, $i, 2)));
return($ascii);
}
OTHER TIPS
I guess what you're looking for, are multibyte versions of ord
and chr
.
I wrote the following polyfill
for that :
if (!function_exists('mb_internal_encoding')) {
function mb_internal_encoding($encoding = NULL) {
return ($from_encoding === NULL) ? iconv_get_encoding() : iconv_set_encoding($encoding);
}
}
if (!function_exists('mb_convert_encoding')) {
function mb_convert_encoding($str, $to_encoding, $from_encoding = NULL) {
return iconv(($from_encoding === NULL) ? mb_internal_encoding() : $from_encoding, $to_encoding, $str);
}
}
if (!function_exists('mb_chr')) {
function mb_chr($ord, $encoding = 'UTF-8') {
if ($encoding === 'UCS-4BE') {
return pack("N", $ord);
} else {
return mb_convert_encoding(mb_chr($ord, 'UCS-4BE'), $encoding, 'UCS-4BE');
}
}
}
if (!function_exists('mb_ord')) {
function mb_ord($char, $encoding = 'UTF-8') {
if ($encoding === 'UCS-4BE') {
list(, $ord) = (strlen($char) === 4) ? @unpack('N', $char) : @unpack('n', $char);
return $ord;
} else {
return mb_ord(mb_convert_encoding($char, 'UCS-4BE', $encoding), 'UCS-4BE');
}
}
}
Demo
echo "\nGet string from numeric DEC value\n";
var_dump(mb_chr(25105));
var_dump(mb_chr(22909));
echo "\nGet string from numeric HEX value\n";
var_dump(mb_chr(0x6211));
var_dump(mb_chr(0x597D));
echo "\nGet numeric value of character as DEC int\n";
var_dump(mb_ord('我'));
var_dump(mb_ord('好'));
echo "\nGet numeric value of character as HEX string\n";
var_dump(dechex(mb_ord('我')));
var_dump(dechex(mb_ord('好')));
Output:
Get string from numeric DEC value
string(3) "我"
string(3) "好"
Get string from numeric HEX value
string(3) "我"
string(3) "好"
Get numeric value of character as DEC string
int(25105)
int(22909)
Get numeric value of character as HEX string
string(4) "6211"
string(4) "597d"