Doctype Woes (back to HTML4)

At the moment I’m working together with the rest of the webteam of the ubuntuusers Team on the new portal of ubuntuusers.de based on django. One of the things we will do is consolidating all templates. And while doing so we have to decide to use an HTML/XHTML standard which we will use including the correct mimetype and doctype.

And selecting that is the hardest part because once you’ve decided on something you have to live with the consequences and cannot really change. For example HTML and XHTML have a slightly different DOM or different rules for CSS (CSS for example has an exception that allows background colors on the body-tag to affect the whole page, this exception does not exist for XHTML). Without a doubt many people use XHTML in a wrong way. Just have a look how many people serve their webpages as text/html and only use HTML semantics. They break if you serve them as application/xml+xhtml or render in a wrong way.

But why does XML and SGML have different semantics? SGML itself was created long ago (I assume IBM has something to do with it, at least it’s predecessor was created there) and is an insane specification. At least that’s what the web told me. I cannot tell you if that’s true or not because the standard itself is not available without paying for it :-/

From what sources tell me XML is an subset of SGML. I wonder how that’s possible tough, because there are syntactic elements that in my opinion are not compatible. For example clash XML’s self closing tags with null end tags in SGML:

XML <br /><br />
SGML: <p/This is some text in a paragraph/

Because the slash has a special meaning in tags in SGML it clashes with the closing slash of XML tags. Also SGML is apparently case insensitive where XML is not. Maybe I’m also wrong there and that part is up to the DTD, but quite frankly. I don’t care. I don’t even are about clashing slashes in tags because no browser implements the correct SGML behavior. And if they would do, we would all see invalid output because the web is not valid. It’s not and it will never be.

But what’s indeed ridiculous is that it’s incredible hard to write pages that are semantical and syntactical correct to both HTML4 and XHTML. However you have to make your documents compatible to both if you want to your page to be valid XHTML and render correctly. The reason is that no browser today selects the render mode by Doctype, and even if they would do, other browsers would break then on the huge number of pages that incorrectly use XHTML.

XML is strict, very strict. Syntactical errors appear as big red error messages. I for myself have to work on the wiki markup for the new portal and one of the things I have to deal with is balancing elements. That is possible and simple, but what’s harder is adding paragraphs without breaking things. And that’s not that easy because not every element is allowed in a paragraph and a paragraph cannot be mixed with every element thanks to inline versus block elements.

Even HTML5 disallows that mixing of different element types but at least it doesn’t complain. Sure, I could send the output through a validator and tell the user that his markup is bullshit and he should correct it. But I won’t do that. Users give a fuck about their markup. And I cannot bloat the parser more than it is now. Server resources are limited and additional validation for such a high traffic site is nearly impossible.

Fortunately browsers will never show you those errors because they parse XHTML with their tagsoup parser they use for HTML too. Even tough, if we cannot ensure that all of our pages are valid XML and XHTML we are not allowed to use the doctype because it would break browsers that support XHTML.

While this is hard for webdesigners and especially for programmers that want to create parsers that generate XHTML it’s an even harder job for the developers of browsers. In the end they have to have two independent parsers for HTML and XHTML. This makes it hard enough for the big browser vendors Microsoft, Mozilla, Opera and Apple, but even harder if you are new to the market and want to ship your own one. Because you not only have to be compatible with the new XHTML standard, but also the old HTML one. Nobody will translate all the old documents to XHTML I’m sure ;-)

Details about the issues are summarized here:

Without a doubt we will have fun with XHTML in the future. Probably the web stays like it’s today, we will still use the tag soup parsers, people will write XHTML that is HTML in fact and browsers will interpret it like that.

For me the decision is HTML4 at the moment, with the subset that is valid for both HTML4 and HTML5. That could make it easier for transition once the standard is ready (and I hope it’s earlier than 2022) and it’s good idea now too. Who needs an u-Tag anyway?

4 Responses to “Doctype Woes (back to HTML4)”

  1. You said: “Fortunately browsers will never show you those errors because they parse XHTML with their tagsoup parser they use for HTML too.”

    But that will be true for HTML too. What is your incentive to write conformant HTML 4.01?

    Comment by karl dubost, w3c — Thursday, October 4th, 2007 @ 12:13 am
  2. I’m afraid it is true, XML is a (true) subset of SGML. SGML has an incredible amount of customisation available through the declaration of their DTDs, tag omissibility, short referencing, case-insensitivity, etc. etc. If you play around long enough, you can create SGML that is XML (through declarations and things, although there are always minor exceptions such as quoting of attribute values).
    There’s a really good reason that XML came along, SGML was just too flexible and incredibly difficult to write tools around that catered entirely to the specification and the freedom the declaration allowed. There are tools capable of displaying SGML rather well, but they aren’t common and they usually aren’t free.

    Comment by Alan Hynes — Thursday, October 4th, 2007 @ 12:40 am
  3. @Karl: Of course that’s true for HTML. But at least as far as I know HTML was designed with the fail silently approach in mind. Which by the way was a very good idea for the early web because I allowed browser vendors to extend it. And I don’t have a problem with the fact that XHTML has to by syntactically correct, it’s just that it doesn’t make things that much easier. The number of pages that incorrectly use XHTML is insane. If now browsers would implement XHTML like the w3c specified it half of the web2.0 breaks.

    XHTML would probably have a brighter future if there wouldn’t have been such a hype around it some years ago. People certainly misunderstood XHTML and started using it when browsers haven’t supported it at all. And now browsers have the same problem like IE had ten years ago. They had to implement all the bugs of Netscape too so stay compatible.

    Well. Let’s see what the future brings, but so far XHTML just makes things a lot harder for the average developer.

    Comment by Armin Ronacher — Friday, October 5th, 2007 @ 8:17 pm
  4. […] small resumption to my previous post about XHTML/HTML here a small list of websites using XHTML that break when rendered on a browser in XHTML […]

    Comment by Lucumr Cogitations » Blog Archive » Abusing XHTML — Friday, October 5th, 2007 @ 8:35 pm

Leave a Reply

cogitations driven by wordpress