Normalizing Unicode according to the W3C in PHP

https://stackoverflow.com/questions/8832468

15-04-2021
|

Question

While validating my website's HTML code in the W3C validator I got the following warning:

Line 157, Column 220: Text run is not in Unicode Normalization Form C.

…i͈̭̋ͥ̂̿̄̋̆ͣv̜̺̋̽͛̉͐̀͌̚e͖̼̱ͣ̓ͫ͆̍̄̍͘-̩̬̰̮̯͇̯͆̌ͨ́͌ṁ̸͖̹͎̱̙̱͟͡i̷̡͌͂͏̘̭̥̯̟n̏͐͌̑̄̃͘͞…

I'm developing it in PHP 5.3.x, so I can use the Normalizer class.

So, in order to fix this, should I use Normalizer::normalize($output) when displaying any input made by a user (e.g. a comment) or should I use Normalizer::normalize($input) for any user input before storing it in the database?

tl;dr: should I use Unicode normalization before storing user input in the database or just when it's displayed?

Solution

It is up to you to decide, on the basis of the purpose and nature of your application, whether you apply normalization upon reading user input, or storing it to a database, or when writing it, or at all. To summarize the long thread mentioned in the comments to the question, also available in the official list archive at http://validator.w3.org/feedback.html

The warning message comes from the experimental “HTML5 validation” (which is really a linter, applying subjective rules in addition to some formal tests).
The message is not based on any requirement in HTML5 drafts but on opinions on what might cause problems in some software.
The opinions originally made “HTML5 validation” issue an error message, now a warning.

It is certainly possible, though uncommon, to get unnormalized data as user input. This does not depend on normalization carried out by browsers (they don’t do such things, though they conceivably might in the future) but on input methods and habits. For example, methods of typing the letter ü (u umlaut, or u with diaeresis) tend to produce the character in precomposed form, as normalized. People can produce it as unnormalized, in decomposed form, as letter u followed by combining diaeresis, but they usually have no reason to do so, and most people wouldn’t even know how to do that.

If you do string comparisons in your software, they may or may not (depending on comparison routines used) treat e.g. a precomposed ü as equal to the decomposed presentation. Simple implementations treat them as different, as they are definitely distinct at the simple character level (Unicode code points).

One reason to normalize at some point, in the writing phase at the latest, is that precomposed characters generally get displayed more reliably. To present a normalized ü, a program just has to pick up a glyph from a font. To present a decomposed ü, a program must either recognize it as canonically equivalent to the normalized ü or write the letter u with a diaeresis symbol properly placed above it, with due attention to the graphic properties of the glyph for u, and many programs fail in this.

On the other hand, in the rare cases where unnormalized data is received as user input, the user may well have a reason to have produced it. He may have the idea that normalized ü and unnormalized ü are distinct and need to be treated as such.

OTHER TIPS

Strictly speaking, the rules of the web character model are not just that one should normalise to NFC, but that both the form before and the form after any technology that includes text from another mechanism is run should be in NFC. Example would be XML includes, character references and entity references. For example, ä would not fit the character model for while it is in NFC expanding the character reference turns it into a followed by a combining diareses, which is not NFC. Mostly avoiding this is pretty easy in practice, but it's worth noting.

There is an interesting case with U+0338. > followed by U+0338 normalises to ≯ and with < to produce ≮. The reasons why it should not be allowed at the start of an element name or as the first character within an eleemnt should be clear.

As a rule, it makes no sense to have a piece of text start with a combining character in any case, but this particular example allows for the entire document to be mangled (even if you don't normalise, since something else may).

If you are concerned only with the text qua text (digital signatures are of no interest, for example), then normalising on input simplifies the rest of what you do, including your internal use of the text (e.g. searching), so is probably the way to go.

See http://www.w3.org/TR/charmod-norm/ for more.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow