I have a database which stores video game names with Unicode characters but I can't figure out how to properly escape these Unicode characters when printing them to an HTML response.

For instance, when I print all games with the name like Uncharted, I get this:

Uncharted: Drake's Fortuneâ„¢
Uncharted 2: Among Thievesâ„¢
Uncharted 3: Drake's Deceptionâ„¢

but it should display this:

Uncharted: Drake's Fortune™
Uncharted 2: Among Thieves™
Uncharted 3: Drake's Deception™

I ran a quick JavaScript escape function to see which Unicode character the is and found that it's \u2122.

I don't have a problem fully escaping every character in the string if I can get the character to display correctly. My guess is to somehow find the hex representation of each character in the string and have PHP render the Unicode characters like this:

print "&#x2122";

Please guide me through the best approach for Unicode escaping a string for being HTML friendly. I've done something similar for JavaScript a while back, but JavaScript has a built in function for escape and unescape.

I'm not aware of any PHP functions of similar functionality however. I have read about the ord function, but it just returns the ASCII character code for a given character, hence the improper display of the ™ or the ™. I would like this function to be versatile enough to apply to any string containing valid Unicode characters.

有帮助吗?

解决方案

It looks like you have UTF-8 encoded strings internally, PHP outputs them properly, but your browser fails to auto-detect the encoding (it decides for ISO 8859-1 or some other encoding).

The best way is to tell the browser that UTF-8 is being used by sending the corresponding HTTP header:

header("content-type: text/html; charset=UTF-8");  

Then, you can leave the rest of your code as-is and don't have to html-encode entities or create other mess.

If you want, you can additionally declare the encoding in the generated HTML by using the <meta> tag:

  • <meta http-equiv=Content-Type content="text/html; charset=UTF-8"> for HTML <=4.01
  • <meta charset="UTF-8"> for HTML5

HTTP header has priority over the <meta> tag, but the latter may be useful if the HTML is saved to HD and then read locally.

其他提示

I spent a lot of time trying to find the better way to just print the equivalent char of an unicode code, and the methods I found didn't work or it just were very complicated.

This said, JSON is able to represent unicode characters using the syntax "\u[unicode_code]", then:

echo json_decode('"\u00e1"'); 

Will print the equivalent unicode char, in this case: á.

P.D. Note the simple and the double quotes. If you don't put both it won't work.

Try this:

echo htmlentities("Uncharted: Drakes Fortune™ \n", ENT_QUOTES, "UTF-8");

From: http://php.net/htmlentities

// PHP 7.0
var_dump(
    IntlChar::chr(0x2122),
    IntlChar::chr(0x1F638)
);

var_dump(
    utf8_chr(0x2122),
    utf8_chr(0x1F638)
);

function utf8_chr($cp) {

    if (!is_int($cp)) {
        exit("$cp is not integer\n");
    }

    // UTF-8 prohibits characters between U+D800 and U+DFFF
    // https://tools.ietf.org/html/rfc3629#section-3
    //
    // Q: Are there any 16-bit values that are invalid?
    // http://unicode.org/faq/utf_bom.html#utf16-7

    if ($cp < 0 || (0xD7FF < $cp && $cp < 0xE000) || 0x10FFFF < $cp) {
        exit("$cp is out of range\n");
    }

    if ($cp < 0x10000) {
        return json_decode('"\u'.bin2hex(pack('n', $cp)).'"');
    }

    // Q: Isn’t there a simpler way to do this?
    // http://unicode.org/faq/utf_bom.html#utf16-4
    $lead = 0xD800 - (0x10000 >> 10) + ($cp >> 10);
    $trail = 0xDC00 + ($cp & 0x3FF);

    return json_decode('"\u'.bin2hex(pack('n', $lead)).'\u'.bin2hex(pack('n', $trail)).'"');
}
许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top