문제

I learned recently (from these questions) that at some point it was advisable to encode ampersands in href parameters. That is to say, instead of writing:

<a href="somepage.html?x=1&y=2">...</a>

One should write:

<a href="somepage.html?x=1&amp;y=2">...</a>

Apparently, the former example shouldn't work, but browser error recovery means it does.

Is this still the case in HTML5?

We're now past the era of draconian XHTML requirements. Was this a requirement of XHTML's strict handling, or is it really still something that I should be aware of as a web developer?

도움이 되었습니까?

해결책

It is true that one of the differences between HTML5 and HTML4, quoted from the W3C Differences Page, is:

The ampersand (&) may be left unescaped in more cases compared to HTML4.

In fact, the HTML5 spec goes to great lengths describing actual algorithms that determine what it means to consume (and interpret) characters.

In particular, in the section on tokenizing character references from Chapter 8 in the HTML5 spec, we see that when you are inside an attribute, and you see an ampersand character that is followed by:

  • a tab, LF, FF, space, <, &, EOF, or the additional allowed character (a " or ' if the attribute value is quoted or a > if not) ===> then the ampersand is just an ampersand, no worries;
  • a number sign ===> then the HTML5 tokenizer will go through the many steps to determine if it has a numeric character entity reference or not, but note in this case one is subject to parse errors (do read the spec)
  • any other character ===> the parser will try to find a named character reference, e.g., something like &notin;.

The last case is the one of interest to you since your example has:

<a href="somepage.html?x=1&y=2">...</a>

You have the character sequence

  • AMPERSAND
  • LATIN SMALL LETTER Y
  • EQUAL SIGN

Now here is the part from the HTML5 spec that is relevant in your case, because y is not a named entity reference:

If no match can be made, then no characters are consumed, and nothing is returned. In this case, if the characters after the U+0026 AMPERSAND character (&) consist of a sequence of one or more alphanumeric ASCII characters followed by a U+003B SEMICOLON character (;), then this is a parse error.

You don't have a semicolon there, so you don't have a parse error.

Now suppose you had, instead,

<a href="somepage.html?x=1&eacute=2">...</a>

which is different because &eacute; is a named entity reference in HTML. In this case, the following rule kicks in:

If the character reference is being consumed as part of an attribute, and the last character matched is not a ";" (U+003B) character, and the next character is either a "=" (U+003D) character or an alphanumeric ASCII character, then, for historical reasons, all the characters that were matched after the U+0026 AMPERSAND character (&) must be unconsumed, and nothing is returned. However, if this next character is in fact a "=" (U+003D) character, then this is a parse error, because some legacy user agents will misinterpret the markup in those cases.

So there the = makes it an error, because legacy browsers might get confused.

Despite the fact the HTML5 spec seems to go to great lengths to say "well this ampersand is not beginning a character entity reference so there's no reference here" the fact that you might run into URLs that have named references (e.g., isin, part, sum, sub) which would result in parse errors, then IMHO you're better off with them. But of course, you only asked whether restrictions were relaxed in attributes, not what you should do, and it does appear that they have been.

It would be interesting to see what validators can do.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top