Question

Am I reading the HTML 4.01 standard wrong, or is Google? In HTML 4.01, if I write:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
    "http://www.w3.org/TR/html4/strict.dtd">
<html> <head> <body>plain <em>+em <strong>+strong </em>-em

The rendering in Google Chrome is:

plain +em +strong -em

This seems to contradict the HTML 4.01 standard, which summarizes the underlying SGML rules as: “an end tag closes, back to the matching start tag, all unclosed intervening start tags with omitted end tags”.¹

That is, the </em> end tag should close not only the <em> start tag but also the unclosed intervening <strong> start tag, and the rendering should be:

plain +em +strong -em

A commenter pointed out that it is bad practice to leave tags open, but this is only an academic example. An equally good example would be: <em> +em <strong> +strong </em> -em </strong>. It was my understanding from the HTML 4.01 standard that this code fragment would not work as intended because of the overlapping elements: the </em> end tag should implicitly close the <strong>. The fact that it did work as intended was surprising, and this is what led to my question.

And it turned out I proposed a false dichotomy in the question: neither Google nor I were reading the HTML 4.01 standard wrong. A private correspondent at w3.org pointed me to Web SGML and HTML 4.0 Explained by Martin Bryan, which explains that “[t]he parsing program will automatically close any currently open embedded element which has been declared as having omissible end-tags when it encounters an end-tag for a higher level element. (If an embedded element whose end-tag cannot be omitted is still open, however, the program will report an error in the coding.)”² (Emphasis added.) Bryan’s summarization of the SGML standard is right, and HTML 4.01’s summarization is wrong.

Was it helpful?

Solution

The statement quoted from the HTML 4.01 specification is very obscure, or just plain wrong on all accounts. HTML 4.01 has specific rules for end tag omission, and these rules depend on the element. For example, the end tag of a p element may be omitted, the end tag of an em may never be omitted. The statement in the specification probably tries to say that an end tag implicitly closes any inner elements that have not yet been closed, to the extent that end tag omission is allowed.

No browser has ever implement HTML 4.01 (or any earlier HTML specification) as defined, with the SGML features that are formally part of it. Anything that the HTML specifications say about SGML should be taken as just theoretical until proven otherwise.

HTML5 doesn’t change the rules of the game in this respect, except that it writes down the error handling rules. In simple issues like these, the rules just make the traditional browser behavior a norm. They are tagsoup-oriented, treating tags more or less as formatting commands: <em> means “italicize,” </em> means “stop italicizing,” etc. But HTML5 also takes measures to define error handling more formally so that despite such tag soup usage, it is well-defined what document tree in the DOM will be constructed.

OTHER TIPS

Some tags are allowed to be omitted (such as the end tag for <p> or the start and end tags for <body>), and some are not (such as the end tag for <strong>). It is the former that the section of the spec you quote is referring to. You can identify them by the use of a dash in the DTD:

<!ELEMENT P - O (%inline;)*            -- paragraph -->
  ^A p element
            ^ requires a start tag
              ^ has optional end tag
                 ^ contains zero or more inline things
                                       ^ Comment: Is a paragraph

What you have is not an HTML document with an omitted tag, but and invalid pseudo-HTML document that browsers will try to perform error recovery on.

The specification (for HTML 4) does not describe how to perform error recovery, that is left up to browsers.

The specification says that:

Some HTML element types allow authors to omit end tags (e.g., the P and LI element types).

This:

Please consult the SGML standard for information about rules governing elements (e.g., they must be properly nested, an end tag closes, back to the matching start tag, all unclosed intervening start tags with omitted end tags (section 7.5.1), etc.).

Applies to elements which can have omitted end tags.

If you look the P element spec you will see:

Start tag: required, End tag: optional

So, when you use this:

<DIV>
<P>This is the paragraph.
</DIV>

The P element will be automatically closed.

But, if you look at the EM spec, you will see:

Start tag: required, End tag: required

So this rule of automatic closing is not valid since the HTML is not valid.

Curiously all the browsers presented the same behavior with that kind of invalid HTML.

All modern browsers use an HTML5 parser (even for HTML 4.01 content), so the parsing rules of HTML5 apply. You can find more information at the Parsing HTML Documents section in the HTML5 spec.

HTML Outline

  • HTML
    • HEAD
      • #text " " ()
    • BODY
      • #text "plain " ()
      • EM
        • #text "+em " (italic)
        • STRONG
          • #text "+strong " (bold/italic)
      • STRONG
        • #text "-em" (bold)

If you try running your HTML through http://validator.w3.org/check it will flag up this HTML as being pretty much invalid.

If your HTML is invalid, all bets are off, and different browsers may render your HTML differently.

If you look at the D.O.M. in Chrome by right clicking and saying inspect element, you'll be able to deduce that since your tags do not match up, it applied an algorithm to decide where you messed up. Technically, it does close the strong tag at the correct place. However, It decides that you were probably trying to make both pieces of text bold, so it puts the last -em in an entirely new, extra "strong" element while keeping the '+strong' in it's own "strong" element. It looks to me like the chrome team decided it is statistically likely that you want both things to be bold.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top