Proper entity types for XHTML, XML and inside inline JavaScript

https://stackoverflow.com/questions/1734692

20-09-2019
|

Question

First, the way I understand it, it's more appropriate to use numeric entities in an XHTML document, such as " instead of ", is that right?

Second, for my RSS XML feed, which entity type is correct? Named or numeric? I believe it's numeric, but see examples of both in my searches.

Third, which of the following is correct for entities inside inline JavaScript?

<span onmouseover="tooltip_on( '<strong>Tooltip inside a span</strong>
<br />Lorem ipsum dolor sit amet.<span>Lorem ipsum <code>dolor sit</code>
amet, consectetur adipisicing elit.</span>' );"
onmouseout="tooltip_off();">tooltip inside a span</span>

OR... (the tags inside the JS function are converted to named entities):

<span onmouseover="tooltip_on( '&lt;strong&gt;Tooltip inside a
span&lt;/strong&gt;&lt;br /&gt;Lorem ipsum dolor sit amet.
&lt;span&gt;Lorem ipsum &lt;code&gt;dolor sit&lt;/code&gt;
amet, consectetur adipisicing elit.&lt;/span&gt;' );"
onmouseout="tooltip_off();">tooltip inside a span</span>

EDIT 1:

Great answers below, but maybe I should have worded my question differently.

Disregarding the JavaScript question, which would YOU use for YOUR website and RSS feed:

(1) All numeric entities, (2) all named entities, (3) a mixture of both: & " < >, with the rest being numeric.

I am leaning towards 3 because my site already has & " < > ' deeply embedded, plus htmlspecialchars() used in quite a few places.

EDIT 2:

All good answers below, folks. Had to pick just one, unfortunately.

Solution

First, the way I understand it, it's more appropriate to use numeric entities in an XHTML document, such as " instead of ", is that right?

" is also defined for XHTML. So you can use both.

Second, for my RSS XML feed, which entity type is correct? Named or numeric? I believe it's numeric, but see examples of both in my searches.

Again, " is also defined for XML. So you can use both.

Third, which of the following is correct for entities inside inline JavaScript?

The second one is correct since a plain < is not allowed inside an attribute value declaration (but > is).

Edit Now that you refined your question:

I would use a charset that contains all characters I need. So if you want to be able to use almost any character, use Unicode and encode the characters with UTF-8.

Thereby you can encode any character with UTF-8 directly and have no need to use character references for characters other than the special characters of XML (at least &, >, " and ').

And here you have the free choice between the named or numeric character references. Use what you like better or what your programming language uses/prefers.

OTHER TIPS

First, the way I understand it, it's more appropriate to use numeric entities in an XHTML document, such as " instead of ", is that right?

Not exactly.

There are two issues to worry about.

Is this going to be plain old XHTML or is it going to be HTML compatible XHTML?

There is no ' is HTML, so you can't use it in HTML compatible XHTML (but you only need to use it in attribute values delimited with an ', so just use " instead.

Is this going to be processed with an XML parser that is not DTD aware?

If so, only the generic XML entities will be recognized (quot, apos, gt, lt, amp).

On the other hand, named entities are much more readable. Real characters (e.g. via UTF-8) are most readable.

Second, for my RSS XML feed, which entity type is correct?

Use quot, gt, lt, amp where needed and real characters elsewhere.

Third, which of the following is correct for entities inside inline JavaScript?

Better to use unobtrusive JS instead of intrinsic event attributes.

That said, the rules are the same as for any other HTML attribute — only & and whatever character you used to delimit the attribute value need to be represented with an entity.

<, & and " in attribute values where " is the delimiter: use <, & and ", respectively.

These are predefined entities in XML so will work with any parser regardless of whether it reads the document type. They are also normal defined entities in HTML.

Numeric character references are just as valid, but slightly harder to read.

> in text content: use > or leave as -is.

> doesn't normally need escaping, it's perfectly legal in an attribute value at all times, and it's legal in text content as long as it doesn't form part of a ]]> sequence. (This is an obscure, pointless and sometimes-ignored part of the XML spec.) You might prefer to always escape it in text content anyway, just to be safe and not have to remember this rule. (That's what Canonical XML does.)

Numeric character references are just as valid, but slightly harder to read.

' in attribute values where ' is the delimiter: use '.

The numeric character reference is most correct here, because the XML predefined entity ' isn't technically defined by the HTML4 standard (even though it will work in all current browsers). The lateness of adding this entity reflects the common practice of always using " as the attribute value delimiter.

non-ASCII characters: include as-is

As long as you're using and declaring UTF-8 you can just spit the characters straight out. Smaller, more readable results.

non-ASCII characters (without Unicode): use numeric character reference

If for some reason you can't use UTF-8 (boooo!!!), use a character reference like é in preference to the HTML entities. The HTML entities only cover a very small portion of the Unicode character set anyway; might as well use them for all IMO. I personally prefer to use the &#x... hex-escapes for the non-ASCII characters as it is traditional to refer to Unicode characters by their ‘U+xxxx’ hex code.

Though using the HTML entities is quite valid in an XHTML document, it means the parser has to fetch external entities such as the DTD to work out what the entities are. If you stick to the predefined entities and character references you can use a lightweight non-external-entity-including XML parser without losing your ability to find text-including-entity-references in the document.

The situation with RSS is murky, as usual with all the different RSS versions lurking about. RSS 0.91 had a DTD that included the older HTML 3.2 standard's entities, but the previous official SYSTEM URL for the DTD has gone walkies. (In an annoying and needless piece of internet vandalism, Netscape's owners, AOL, broke the link in a reorg a few years ago. Not only that but they also 302 you to their home page if you try to access it or any other address on the old site, thus serving a badly-written HTML page to clients expecting a DTD. Bad AOL, 302-404s are so bogus.)

RSS 2.0 doesn't have an official DTD at all. So either way, avoid the HTML entities, using the predefined entities and the numeric character references in preference.

onmouseover="tooltip_on( '<strong>Tool...

Not allowable in any document type. < is invalid in an attribute value.

onmouseover="tooltip_on( '<strong>Tooltip...

Valid but unreadable. I second David's Unobtrusive JavaScript suggestion.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow