Question

I did a lot of PHP programming in the last years and one thing that keeps annoying me is the weak support for Unicode and multibyte strings (to be sure, natively there is none). For example, "htmlentities" seems to be a much used function in the PHP world and I found it to be absolutely annoying when you've put an effort into keeping every string localizable, only store UTF-8 in your database, only deliver UTF-8 webpages etc. Suddenly, somewhere between your database and the browser there's this hopelessly naive function pretending every byte is a character and messes everything up.

I would just love to just dump this kind of functions, they seem totally superfluous. Is it still necessary these days to write 'ä' instead of 'ä'? At least my Firefox seems perfectly happy to display even the strangest Asian glyphs as long as they're served in a proper encoding.

Update: To be more precise: Are named entities necessary for anything else than displaying HTML tags (as in "&lt;" for "<")

Update 2:

@Konrad: Are you saying that, no, named entities are not needed?

@Ross: But wouldn't it be better to sanitize user input when it's entered, to keep my output logic free from such issues? (assuming of course, that reliable sanitizing on input is possible - but then, if it isn't, can it be on output?)

Was it helpful?

Solution

Named entities in "real" XHTML (i.e. with application/xhtml+xml, rather than the more frequently-used text/html compatibility mode) are discouraged. Aside from the five defined in XML itself (&lt;, &gt;, &amp;, &quot;, &apos;), they'd all have to be defined in the DTD of the particular DocType you're using. That means your browser has to explicitly support that DocType, which is far from a given. Numbered entities, on the other hand, obviously only require a lookup table to get the right Unicode character.

As for whether you need entities at all these days: you can pretty much expect any modern browser to support UTF-8. Therefore, as long as you can guarantee that the database, the markup and the web server all agree to serve that, ditch the entities.

OTHER TIPS

If using XHTML, it's actually recommended not to use named entities ([citation needed]). Some browsers (Firefox …), when parsing this as XML (which they normally don't), don't read the DTD files and thus are unable to handle the entities.

As it's best practice anyway to use UTF-8 as encoding if there are no compelling reasons to do otherwise, this only means that the creator of the documents needs a decent editor that can not only handle the documents but also provides a good way of entering the divers glyphs. OS X doesn't really have this problem because most needed glyphs can be reached via “alt” keys but Windows doesn't have this feature.


@Konrad: Are you saying that, no, named entities are not needed?

Precisely. Unless, of course, there are silly restrictions, e.g. legacy database drivers that choke on UTF-8 etc.

Safari seems to have issues with some glyphs but not others, it may not be needed but it's probably best to do so, of course, this is my opinion and not backed up by anything but my own observations.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top