The Web and XML

Selected Content


This document gives a short overview over current standards relating to content and documents on the WWW. It briefly mentions SGML, but the main focus is on XML. Normative references are listed towards the end.

SGML — DTDs and validaton

SGML is a very powerful standard for creating markup languges. HTML is an application of SGML. Its latest version HTML 4.01 is a powerful language for web documents.

An SGML document consists of an SGML Prolog and the document instance. The prolog in turn consist of an SGML Declaration and a Document Type Definition (DTD): The SGML Declaration defines various technicalities, e.g. the lexical rules (which characters are valid, which characters mark the markup), SGML features to be uses (tag minimization, case folding), capacities, etc, and for most cases knowledge of the declaration is not cruical. The DTD on the other hand is an integral part of every SGML document: It lays out the syntactical structure of the document, i.e. it defines which element types are available, which attribute types are available for which element types, which type of data is allowed inside elements (content model) and attributes.

Parsing and validating

To make use of an SGML document is to parse it, that is to read the document in its text form and somehow expose its structure for further use. To do so, the SGML parser needs to know about the permissible structure (element and attribute types) and constructs (tag minimization, special characters, entities) — that is to say, the parser has to read the SGML Declaration and the DTD, and according to them scan the document instance.

At the same time the parser also validates the document, i.e. it checks whether the document instance conforms to the rules laid out in the DTD. It usually reports errors in some detail and only succeeds in parsing if the document successfully validates first. Validation is a strong mechanism to ascertain data integrity, spot errors and help structuring a document during creation — for the first step in writing an SGML document is to construct an appropriate DTD. This step usually takes a lot of thought, time and highly paid SGML consultants — but it forces you to really understand the structure of your data first, before starting to write it down.


While it appears now that SGML would prove a formidable basis for HTML (and indeed much effort has gone into this resulting in a W3C recommendation and an ISO Standard), the main trouble with HTML are the user agents, the web browsers:

Web browsers, when presented with HTML, do not normally ever do any SGML parsing. Instead, they merely recognize a number of element type names (“tags”) and display them somehow, be it graphically, as text or acoustically. Therefore it is well nigh impossible to use the power of SGML in HTML documents — like custom entities, short tags, new element types... The most powerful solution would probably be for browsers to recognize, parse and process HTML as an ISO SGML Architecture.

XML — Validity vs. Well-formedness

Thus out of various reasons and motives it was that XML was created. In fact XML is a reduced version of SGML. The reduction was made chiefly by prescribing a fixed SGML Declaration (which disallows omission of tags, disables case folding, and a few Unicode related issues), disallowing inclusion/exclusion exceptions and inline comments and adding a short form for an empty element. The result is that an XML document that is merely well-formed, i.e. contains only elements with open and close tags and no improper nesting of tags, can already be read by an XML parser without need for any further information. This kind of parsing will at least give the logical structure (the “Document Tree”) of the document. It does not allow for any kind of validation, since the document so far does not have any notion of what it means for it to be valid. This situation is sometimes described as XML does not require a DTD.

Nonetheless XML can, just like SGML, use a DTD. In fact it is highly recommendable to create one for one’s documents, just for the above-mentioned advantages. The exception to this is the case when the language of DTDs is not powerful enough to express the desired constraints. In that case an alternative language like XML Schema or RelaxNG may be considered. However, DTDs are still the only possible way of defining entities.

Web standards

After the creation of XML, the Hypertext Language was soon reformulated and became XHTML 1.0. It is backwards compatible to HTML 4.01, and non-XML browsers can still find the exact same tags inside an XHTML document.

However, XML browser behaves much better: whenever confronted with an XML document, they parse it and its DTD (at least the internal subset — some browsers ignore external DTDs). Therefore XHTML becomes truly extensible, and many generalized markup features become usable at last.

A number of new technologies follow: XHTML 1.1 is a modularization of XHTML, XHTML 2.0 is (finally) an actual improvement, throwing out all presentational element types and adding more structure (sections and headings), but it is still being drafted. MathML is a markup language for mathematical expressions, SVG is one for Scalable Vector Graphics, XSLT is a transformation language that transforms an XML document into something else, including XSL-FO (Formatting Objects), and lastly XML Schema is an application to describe document constraints, much like DTDs do, only far more powerful.

SGML Declaration

(This section is superceded by a separate document on the SGML declaration.)

As said above, a complete SGML Document contains an SGML Declaration, a DTD and the document instance. These can all be stored in a single file as follows:

<!SGML -- SGML Declaration here -- >
<!DOCTYPE doctype [ -- DTD here -- ]>
<doctype>Document instance here</doctype>

Alternatively, the prolog may be stored in external entities:

<!SGML name PUBLIC public-id system-id >
<!DOCTYPE doctype PUBLIC public-id system-id >
<doctype>Document instance here</doctype>

(Note that the syntax of the first line is given by the standard, whereas the following lines assume the Reference Concrete Syntax. The DTD uses Reference Concrete Syntax if the declaration specifies SCOPE DOCUMENT, otherwise it is subject to the syntax specified by the declaration.)

Lastly, and most commonly, the SGML Declaration can be linked to a DTD Public Identifier via a DTDDECL entry in the SGML catalog.

An XML document must not contain an SGML Declaration, since the declaration for XML is fixed. The declaration in use can be found on James Clark’s Comparison of SGML and XML.

Web browers and XML

Modern web browsers can operate in two ways: if the MIME type of a file is text/html, it will be read as old-fashioned HTML, that is, simply read and scanned for familiar tags. If on the other had the type is text/xml, application/xml or application/xhtml+xml, the browser will parse the file properly as XML. This means that it is allowed to abort if the document is not well-formed, or if undeclared entities are used, but it also means that the DOCTYPE declaration has been read, and hopefully all DTD data has been parsed.

The MIME type depends on the configuration of the web server and on the file extension — usually .html will always be read in HTML mode and .xml in XML mode, while .xhtml is read in one way by some, in different way by others and not at all by the rest.

However, the standards prescribe as which MIME type a given language may be presented. The rules are:

MIME Type HTML 4.01 XHTML 1.0 XHTML 1.1 and above
text/htmlmustmay, for backwards compatibilitymust not
text/xml, application/xmlmust notmaymay
application/xhtml+xmlmust notencouragedencouraged

Style and XML

As markup languages finally get rid of presentational elements, it is vital to express presentational details in a separate language. The classic, CSS, works fine for styling arbitrary XML documents (which is a lot of work since each element type must be fully styled, e.g. display: block; etc.), and in particular for XHTML (where browsers come with default stylesheets as per W3C recommendation).

The general way of associating style sheets with XML documents is via XML Processing Instruction (PI):

<?xml-stylesheet href="URI" title="title" alternate="yes|no" type="type" media="media" ?>

For reference, the (pseudo-)content model of xml-stylesheet is

alternate (yes|no) "no"

If extra stylesheets are supplied via alternate="yes", modern browsers will allow to switch between stylesheets (distinguished via the title attribute).

style and script in HTML and XHTML

In HTML 4.01, the style element has declared content CDATA, whereas in XHTML it has the (#PCDATA) content model. Therefore internal style sheets should be enclosed in a CDATA marked section.

The same holds for the script element.

XML Namespaces

XML based lanugages recommended by the W3C all live in their own namespaces. If documents are parsed as XML (see above), browsers will only recognize XHTML documents as such if they live indeed in the XHTML namespace. Moreover, the recommendation prescribes that XHTML 1.0 and 1.1 use html as their root element. This means that the default namespace has to be the XHTML namespace, and no prefix may be used instead. Here is a more comprehensive overview:

HTML Root element
XHTML 1.0XHTML 1.1XHTML 2.0MathML 2.0
element must be html element must be html local part of element type name must be html local part of element type name must be math

Now at last the namespaces for W3C languages:

Languagesuggested prefixnamespace
XHTMLh, html, none
MathMLmath, m, none
MathML (preferences)pref
XML Schema (Definition)xs, xsd
XML Schema (Instance)xsi
XML *xml *

*) Note: the namespace xml (as found in xml:lang etc.) should never be declared and must always be used as xml.

Various DTDs etc.

Document typePublic identifierSystem identifier
W3C HTML 3.2 SGML Declaration +//IDN W3C.ORG//SD HTML Version 3.2//EN
IDN//W3C.ORG//SD HTML Version 3.2//EN
W3C XML 1.0 SGML Declaration IDN//W3C.ORG//SD XML Version 1.0//EN
HTML 3.2 -//W3C//DTD HTML 3.2 Final//EN
-//W3C//DTD HTML 3.2//EN
ISO/IEC 15445:2000//DTD HyperText Markup Language//EN
HTML 4.01 -//W3C//DTD HTML 4.01//EN
XHTML 1.0 -//W3C//DTD XHTML 1.0 Strict//EN
XHTML Basic -//W3C//DTD XHTML Basic 1.0//EN
XHTML 1.1 -//W3C//DTD XHTML 1.1//EN (modular driver) (flattened)
XHTML 2.0 -//W3C//DTD XHTML 2.0//EN (no DTD written yet)
XHTML 1.1 with MathML 2.0 and SVG 1.1 -//W3C//DTD XHTML 1.1 plus MathML 2.0 plus SVG 1.1//EN
MathML 2.0 -//W3C//DTD MathML 2.0//EN
SVG 1.1 -//W3C//DTD SVG 1.1//EN
Schema 1.0 (non-normative) -//W3C//DTD XMLSCHEMA 200102//EN

Parsers and validators




Example skeletons

XHTML 1.1 with MathML 2.0 and SVG 1.1






Note: In order to dislay MathML correctly, the document should include the mathml.xsl transformation sheet. This in turn refers to three more files, all of which need to be present locally. They can be obtained here: mathml.xsl, pmathml.xsl, pmathmlcss.xsl, ctop.xsl

XSLT Transformation into XHTML with CSS



    href="[CSS FILE NAME]" type="text/css" title="[STYLE TITLE]"
        <xsl:value-of select="[SOME TITLE INFORMATION]"/>



XML Schema




Last updated 2008-01-16 by Thomas Köppe