Question

According to this list: http://mcdlr.com/8/, this special character: ▶ has the HTML entity ▶. Therefore I thought that the PHP function htmlentities() would convert input of ▶ to ▶. However, this is what is shown when I run the string with that special character through that function and store it in an MySQL database:

â–¶

I have set up the HTTP header on the page from where I send the string to <meta charset="utf-8"> and I even tried adding this in the PHP file where the string is processed: header('Content-Type: text/html; charset=utf-8');, but it doesn't help. What am I doing wrong?

Thanks in advance.

Was it helpful?

Solution

When dealing with UTF-8 characters, the key is that every encoding needs to be in UTF-8 or else it will be converted to ISO-8859-1.

Make sure you check:

  • The collation of the table column in the database
  • If the value is hard-coded into the PHP file, make sure the file is saved in UTF-8 format
  • If the data comes from the browser, make sure the PHP Content-Type header is for UTF-8 encoding. Typically you can leave out the <meta charset> in the HTML since browsers will use the HTTP header if it is received.
  • The connection to the database must specify the encoding, like this:

.

$dbc = new PDO('mysql:host=localhost;dbname=****;charset=utf8;', '******', '*****');

Edit:

I think the htmlentities manual page might be a bit misleading:

htmlentities — Convert all applicable characters to HTML entities

I think it should say, "Convert all applicable characters available in the translation table to HTML entities". Not all characters are necessarily available in the translation table, and anything not there will not be converted into their HTML entities. To view which characters are in your translation table, see get_html_translation_table().

For example, doing:

print_r( get_html_translation_table(HTML_ENTITIES));

will output:

Array
(
    ["] => &quot;
    [&] => &amp;
    [<] => &lt;
    [>] => &gt;
    [ ] => &nbsp;
    [¡] => &iexcl;
    [¢] => &cent;
    [£] => &pound;
    [¤] => &curren;
    [¥] => &yen;
    [¦] => &brvbar;
    [§] => &sect;
    [¨] => &uml;
    [©] => &copy;
    [ª] => &ordf;
    [«] => &laquo;
    [¬] => &not;
    [­] => &shy;
    [®] => &reg;
    [¯] => &macr;
    [°] => &deg;
    [±] => &plusmn;
    [²] => &sup2;
    [³] => &sup3;
    [´] => &acute;
    [µ] => &micro;
    [¶] => &para;
    [·] => &middot;
    [¸] => &cedil;
    [¹] => &sup1;
    [º] => &ordm;
    [»] => &raquo;
    [¼] => &frac14;
    [½] => &frac12;
    [¾] => &frac34;
    [¿] => &iquest;
    [À] => &Agrave;
    [Á] => &Aacute;
    [Â] => &Acirc;
    [Ã] => &Atilde;
    [Ä] => &Auml;
    [Å] => &Aring;
    [Æ] => &AElig;
    [Ç] => &Ccedil;
    [È] => &Egrave;
    [É] => &Eacute;
    [Ê] => &Ecirc;
    [Ë] => &Euml;
    [Ì] => &Igrave;
    [Í] => &Iacute;
    [Î] => &Icirc;
    [Ï] => &Iuml;
    [Ð] => &ETH;
    [Ñ] => &Ntilde;
    [Ò] => &Ograve;
    [Ó] => &Oacute;
    [Ô] => &Ocirc;
    [Õ] => &Otilde;
    [Ö] => &Ouml;
    [×] => &times;
    [Ø] => &Oslash;
    [Ù] => &Ugrave;
    [Ú] => &Uacute;
    [Û] => &Ucirc;
    [Ü] => &Uuml;
    [Ý] => &Yacute;
    [Þ] => &THORN;
    [ß] => &szlig;
    [à] => &agrave;
    [á] => &aacute;
    [â] => &acirc;
    [ã] => &atilde;
    [ä] => &auml;
    [å] => &aring;
    [æ] => &aelig;
    [ç] => &ccedil;
    [è] => &egrave;
    [é] => &eacute;
    [ê] => &ecirc;
    [ë] => &euml;
    [ì] => &igrave;
    [í] => &iacute;
    [î] => &icirc;
    [ï] => &iuml;
    [ð] => &eth;
    [ñ] => &ntilde;
    [ò] => &ograve;
    [ó] => &oacute;
    [ô] => &ocirc;
    [õ] => &otilde;
    [ö] => &ouml;
    [÷] => &divide;
    [ø] => &oslash;
    [ù] => &ugrave;
    [ú] => &uacute;
    [û] => &ucirc;
    [ü] => &uuml;
    [ý] => &yacute;
    [þ] => &thorn;
    [ÿ] => &yuml;
    [Œ] => &OElig;
    [œ] => &oelig;
    [Š] => &Scaron;
    [š] => &scaron;
    [Ÿ] => &Yuml;
    [ƒ] => &fnof;
    [ˆ] => &circ;
    [˜] => &tilde;
    [Α] => &Alpha;
    [Β] => &Beta;
    [Γ] => &Gamma;
    [Δ] => &Delta;
    [Ε] => &Epsilon;
    [Ζ] => &Zeta;
    [Η] => &Eta;
    [Θ] => &Theta;
    [Ι] => &Iota;
    [Κ] => &Kappa;
    [Λ] => &Lambda;
    [Μ] => &Mu;
    [Ν] => &Nu;
    [Ξ] => &Xi;
    [Ο] => &Omicron;
    [Π] => &Pi;
    [Ρ] => &Rho;
    [Σ] => &Sigma;
    [Τ] => &Tau;
    [Υ] => &Upsilon;
    [Φ] => &Phi;
    [Χ] => &Chi;
    [Ψ] => &Psi;
    [Ω] => &Omega;
    [α] => &alpha;
    [β] => &beta;
    [γ] => &gamma;
    [δ] => &delta;
    [ε] => &epsilon;
    [ζ] => &zeta;
    [η] => &eta;
    [θ] => &theta;
    [ι] => &iota;
    [κ] => &kappa;
    [λ] => &lambda;
    [μ] => &mu;
    [ν] => &nu;
    [ξ] => &xi;
    [ο] => &omicron;
    [π] => &pi;
    [ρ] => &rho;
    [ς] => &sigmaf;
    [σ] => &sigma;
    [τ] => &tau;
    [υ] => &upsilon;
    [φ] => &phi;
    [χ] => &chi;
    [ψ] => &psi;
    [ω] => &omega;
    [ϑ] => &thetasym;
    [ϒ] => &upsih;
    [ϖ] => &piv;
    [ ] => &ensp;
    [ ] => &emsp;
    [ ] => &thinsp;
    [‌] => &zwnj;
    [‍] => &zwj;
    [‎] => &lrm;
    [‏] => &rlm;
    [–] => &ndash;
    [—] => &mdash;
    [‘] => &lsquo;
    [’] => &rsquo;
    [‚] => &sbquo;
    [“] => &ldquo;
    [”] => &rdquo;
    [„] => &bdquo;
    [†] => &dagger;
    [‡] => &Dagger;
    [•] => &bull;
    […] => &hellip;
    [‰] => &permil;
    [′] => &prime;
    [″] => &Prime;
    [‹] => &lsaquo;
    [›] => &rsaquo;
    [‾] => &oline;
    [⁄] => &frasl;
    [€] => &euro;
    [ℑ] => &image;
    [℘] => &weierp;
    [ℜ] => &real;
    [™] => &trade;
    [ℵ] => &alefsym;
    [←] => &larr;
    [↑] => &uarr;
    [→] => &rarr;
    [↓] => &darr;
    [↔] => &harr;
    [↵] => &crarr;
    [⇐] => &lArr;
    [⇑] => &uArr;
    [⇒] => &rArr;
    [⇓] => &dArr;
    [⇔] => &hArr;
    [∀] => &forall;
    [∂] => &part;
    [∃] => &exist;
    [∅] => &empty;
    [∇] => &nabla;
    [∈] => &isin;
    [∉] => &notin;
    [∋] => &ni;
    [∏] => &prod;
    [∑] => &sum;
    [−] => &minus;
    [∗] => &lowast;
    [√] => &radic;
    [∝] => &prop;
    [∞] => &infin;
    [∠] => &ang;
    [∧] => &and;
    [∨] => &or;
    [∩] => &cap;
    [∪] => &cup;
    [∫] => &int;
    [∴] => &there4;
    [∼] => &sim;
    [≅] => &cong;
    [≈] => &asymp;
    [≠] => &ne;
    [≡] => &equiv;
    [≤] => &le;
    [≥] => &ge;
    [⊂] => &sub;
    [⊃] => &sup;
    [⊄] => &nsub;
    [⊆] => &sube;
    [⊇] => &supe;
    [⊕] => &oplus;
    [⊗] => &otimes;
    [⊥] => &perp;
    [⋅] => &sdot;
    [⌈] => &lceil;
    [⌉] => &rceil;
    [⌊] => &lfloor;
    [⌋] => &rfloor;
    [〈] => &lang;
    [〉] => &rang;
    [◊] => &loz;
    [♠] => &spades;
    [♣] => &clubs;
    [♥] => &hearts;
    [♦] => &diams;
)

So any characters not listed above will not be converted to their entities. Note, if you set the ENT_HTML5 flag, the translation table will be about 10 times larger, however it still does not contain (at least on my server) the entity for . It only has named entities.

If you need to convert all characters to their respective entities, you can use the following function (Disclaimer, I did not write it. Here is the original source: http://php.net/htmlentities#107985):

// Unicode-proof htmlentities.
// Returns 'normal' chars as chars and weirdos as numeric html entites.
function superentities( $str ){
    // get rid of existing entities else double-escape
    $str = html_entity_decode(stripslashes($str),ENT_QUOTES,'UTF-8');
    $ar = preg_split('/(?<!^)(?!$)/u', $str );  // return array of every multi-byte character
    foreach ($ar as $c){
        $o = ord($c);
        if ( (strlen($c) > 1) || /* multi-byte [unicode] */
            ($o <32 || $o > 126) || /* <- control / latin weirdos -> */
            ($o >33 && $o < 40) ||/* quotes + ambersand */
            ($o >59 && $o < 63) /* html */
        ) {
            // convert to numeric entity
            $c = mb_encode_numericentity($c,array (0x0, 0xffff, 0, 0xffff), 'UTF-8');
        }
        $str2 .= $c;
    }
    return $str2;
}

So using the example , you can do:

var_dump(superentities('▶')); // outputs string(7) "&#9654;"

However, with all that said, I would recommend that you store everything in your database without encoding it. Typically it is preferred to encode appropriately before outputting to the browser. That way if you ever need to change the way you encode it, you won't have to decode it and re-encode it in some other way. To do that, you will have to make sure all of the encodings are correctly set to UTF-8 as mentioned in my original answer.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top