Converting Unicode reference to UTF-8 character in PHP with mbstring

https://stackoverflow.com/questions/8155401

02-03-2021
|

Question

I have a set of data inside a database which has been input with unicode characters, but they were interpreted as a string. That is, where there should be an apostrophe ’ I've actually got \u2019

So I now need to convert this into its character representation, which is ’. Firstly it is quite easy to change the string into its entity version: ’, then I need to turn it into the correct UTF-8 multibyte string.

I have attempted to do this in a number of ways; on my local server I can exctract the characters with a preg_match function and then pass each to the following function:

mb_convert_encoding($string, "UTF-8", "HTML-ENTITIES");

Sounds quite sensible, and works without issue. Turning off the UTF-8 charset in the browser shows that this has actually converted into â€™ when read by the browser default encoding.

However, the exact same code when run in my production environment produces the dreaded "missing symbol" box when rendered as UTF-8. Turning off UTF-8 and it has produced whatever byte stream renders as ò°‘£. It appears to be outputting 4 bytes rather than 3, I don't know if that is relevant as I'm not well read on character encoding.

I assume that the issue is with my mbstring settings. Here are the mbstring settings from my local server:

Multibyte Support   enabled
Multibyte string engine libmbfl
HTTP input encoding translation disabled
Multibyte (japanese) regex support  enabled
Multibyte regex (oniguruma) version 4.7.1
mbstring.detect_order   no value    no value
mbstring.encoding_translation   Off Off
mbstring.func_overload  0   0
mbstring.http_input auto    auto
mbstring.http_output    UTF-8   UTF-8
mbstring.http_output_conv_mimetypes ^(text/|application/xhtml\+xml)^(text/|application/xhtml\+xml)
mbstring.internal_encoding  UTF-8   UTF-8
mbstring.language   neutral neutral
mbstring.strict_detection   Off Off
mbstring.substitute_character   no value    no value

There are a few differences on my production environment:

Multibyte Support   enabled
Multibyte string engine libmbfl
Multibyte (japanese) regex support  enabled
Multibyte regex (oniguruma) version 3.7.1
mbstring.detect_order   no value    no value
mbstring.encoding_translation   Off Off
mbstring.func_overload  0   0
mbstring.http_input auto    auto
mbstring.http_output    UTF-8   UTF-8
mbstring.internal_encoding  UTF-8   UTF-8
mbstring.language   neutral neutral
mbstring.strict_detection   Off Off
mbstring.substitute_character   no value    no value

Anyone see what I'm doing wrong?

Solution

See if this can help you: hex2ascii and ascii2hex

ADDED on 09-19-2012:

function ascii2hex($ascii)
{
    $hex = '';
    for ($i = 0; $i < strlen($ascii); $i++)
    {
        $byte = strtoupper(dechex(ord($ascii{$i})));
        $byte = str_repeat('0', 2 - strlen($byte)).$byte;
        $hex .= $byte." ";
    }
    return $hex;
}

function hex2ascii($hex)
{
    $ascii = '';
    $hex = str_replace(" ", "", $hex);
    for($i = 0; $i < strlen($hex); $i = $i+2)
        $ascii .= chr(hexdec(substr($hex, $i, 2)));

    return($ascii);
}

OTHER TIPS

I guess what you're looking for, are multibyte versions of ord and chr.

I wrote the following polyfill for that :

if (!function_exists('mb_internal_encoding')) {
    function mb_internal_encoding($encoding = NULL) {
        return ($from_encoding === NULL) ? iconv_get_encoding() : iconv_set_encoding($encoding);
    }
}

if (!function_exists('mb_convert_encoding')) {
    function mb_convert_encoding($str, $to_encoding, $from_encoding = NULL) {
        return iconv(($from_encoding === NULL) ? mb_internal_encoding() : $from_encoding, $to_encoding, $str);
    }
}

if (!function_exists('mb_chr')) {
    function mb_chr($ord, $encoding = 'UTF-8') {
        if ($encoding === 'UCS-4BE') {
            return pack("N", $ord);
        } else {
            return mb_convert_encoding(mb_chr($ord, 'UCS-4BE'), $encoding, 'UCS-4BE');
        }
    }
}

if (!function_exists('mb_ord')) {
    function mb_ord($char, $encoding = 'UTF-8') {
        if ($encoding === 'UCS-4BE') {
            list(, $ord) = (strlen($char) === 4) ? @unpack('N', $char) : @unpack('n', $char);
            return $ord;
        } else {
            return mb_ord(mb_convert_encoding($char, 'UCS-4BE', $encoding), 'UCS-4BE');
        }
    }
}

Demo

echo "\nGet string from numeric DEC value\n";
var_dump(mb_chr(25105));
var_dump(mb_chr(22909));

echo "\nGet string from numeric HEX value\n";
var_dump(mb_chr(0x6211));
var_dump(mb_chr(0x597D));

echo "\nGet numeric value of character as DEC int\n";
var_dump(mb_ord('我'));
var_dump(mb_ord('好'));

echo "\nGet numeric value of character as HEX string\n";
var_dump(dechex(mb_ord('我')));
var_dump(dechex(mb_ord('好')));

Output:

Get string from numeric DEC value
string(3) "我"
string(3) "好"

Get string from numeric HEX value
string(3) "我"
string(3) "好"

Get numeric value of character as DEC string
int(25105)
int(22909)

Get numeric value of character as HEX string
string(4) "6211"
string(4) "597d"

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow