Browser and server support for XML

XML parsing and MIME types

The standard requires, as I outlined in my article on XML and the web, that XHTML be served as application/xhtml+xml, or at least generically as application/xml.

Current user agents (browsers) usually handle documents that are written as XHTML reasonably well. However, current browsers also contain XML parsers, and their behaviour when confronted with an XML or XHTML document depends drastically on the MIME type as which the document is served:

If the MIME type is text/html, the browsers will just try and interpret the document as ordinary HTML ‘tag soup’, which might appear to work rather well on well-formed XHTML, as long as it only contains HTML ‘tags’. However, no XML parsing is done in this case, and so no entities that may have been declared in internal or external DTD subsets will be available. Moreover, marked sections (most likely the CDATA sections required for the script and style elements) will not be parsed and the literal string <![CDATA[ etc will be considered part of the element content. This will most likely break scripts and stylesheets.

DOM differences

The Document Object Model (DOM) of XML is somewhat more sophisticated and comprehensive than the (usually proprietary) quasi-DOMs with which browsers allow HTML tag soup to be manipulated. Therefore, by serving the document as the wrong MIME type, client-side scripting may also be affected.

Silly reminder: When handling the DOM in proper XHTML+XML, remember to use namespace aware methods like createElementNS. Just remember namespaces, period. Browsers are only required to (and should only) display XML as HTML if the root element is {http://www.w3.org/1999/xhtml}html and all subordinate elements belong to http://www.w3.org/1999/xhtml.

Server configuration

A possible source of confusion due to the above mentioned issues is that the MIME type which a browser assumes depends on the circumstances under which the document is loaded: If the document comes from the local machine, the browser might simply go by the file name. In that case, file extensions like .html or .htm will usually be interpreted as text/html, whereas .xml may be either text/xml or application/xml, and .xhtml may in fortunate cases be read as application/xhtml+xml.

This situation is probably not ideal for any author’s sanity. Therefore it would seem best to edit, or at least check frequently, the documents as served through a web server; and that web server should be unambiguously configured.

Apache

In the case of the celebrated Apache webserver, it would be sensible for example to place in the document directory an .htaccess rule such as this:

AddType 'application/xhtml+xml; charset=iso-8859-1' xhtml

Then one should name all XHTML pages with the extension .xhtml (which seems rather sensible anyway) and can be sure that modern browsers will do the Right Thing.

Other browser specific idiosyncrasies

The Internet Explorer

A perpetual source or sorrow, despair, hatred and violence, Microsoft’s browser does actually not cope with application/xhtml+xml MIME types. This is in spite of the fact that it has a quite powerful XML parser, and it is a shame that such silly little file type recognition issues stand in the way of using an otherwise rather powerful tool.

Fix it: However, by adding MIME information about application/xhtml+xml into the Windows registry, the Internet Explorer can be made to accept these files! See below on how to do it.

The Internet Explorer supports some kind of XSLT, though not fully standards compliant. (You look surprised?) If I recall correctly, the Internet Explorer also gets confused by CSS stylesheets for general XML documents that are invoked with the <?xml-stylesheet?> processing instruction.

Most annoyingly, IE will actually fail to parse correct XHTML 1.1. This is because one the one hand it does parse the entire DTD (which is a very strong feature), but on the other hand its implementation is buggy (you still look surprised?) with respect to the treatment of relative pathnames, wherefore it complains about not finding a module file.

As a last surprise, the Internet Explorer retains its ability to produce intentional misbehaviour (‘Quirks Mode’) even when processing XML/XHTML (just in case an author wanted to switch to the new technology but needed to rely on a buggy browser). However, unlike in the HTML case, where the buggy browser behaviour is only triggered by buggy documents (so writing proper HTML results in proper processing), in XHTML the quirks mode is caused by proper XHTML, namely by specifying the XML declaration.

Mozilla et al.

(See the Mozilla FAQ.)

When invoking its XML parser, Mozilla based browsers are rather powerful. However, there is a weird peculiarity when it comes to parsing the DTD: Mozilla does read the internal subset, and entities which are declared therein may be used (and will be expanded) in the document. The standard ISO entities from XHTML may also be used. However, any external subset (referred to via a system ID) is not parsed by Mozilla. (I am not sure whether this can be helped by placing the external DTD file in a specific subdirectory of the Mozilla installation. Input welcome.)

Mozilla supports and automatically applies XSLT transformations that are specified with the <?xml-stylesheet?> rule. It does not support XSL-FO at the moment. It does support CSS as a general styling language for XML.

Opera

In XML parsing mode Opera seems to cope well with both internal and external DTDs (note that neither browser validates, so I am referring only to the usage of entities), and it parses CDATA sections correctly. Opera does not apply XSLT transformations, nor does it support XSL-FO; it does however support CSS for general styling of XML.

Fix it: Unfortunately, Opera’s HTTP_ACCEPT does not allow server content negotiation in favour of XHTML (see below). Fortunately, this can be fixed easily on the client side.

Using Apache content negotiation

When Apache content negotiation is enabled and Apache is set to serve .xhtml files as application/xhtml+xml as described above, then it is possible to place two files, e.g. website.html and website.xhtml into the same directory and refer to them merely as website, and Apache will send the right file depending on which MIME types the browser claims to support in its HTTP_ACCEPT HTTP header field. The two filenames could for instance be links to the same basic file, which could then be authored to conform to both standards.

(To make an HTML document work both in tag soup and XML parsers, it should be encoded in UTF-8, not have an XML declaration, have a meta element setting the content type to “text/html; charset=UTF-8”, and empty elements should have a space just before the “/>”. Moreover, marked sections have to be avoided, which is easily done by putting scripts and stylesheets in separate files.)

The browser situation is as follows:

The Internet Explorer does not send application/xhtml+xml in the HTTP_ACCEPT header field at all; it will receive the .html file.
Ditto Lynx.
Mozilla sends “text/xml,application/xml,application/xhtml+xml,text/html;q=0.9…” and will receive the .xhtml file.
Opera 7.54 sends “text/html, application/xml;q=0.9, application/xhtml+xml,…”, which is annoying because it does accept the right type, but Apache will still send the .html file, because text/html is of the same priority (1.0) as application/xhtml+xml. Fortunately, Opera can be reconfigured to send a more appropriate HTTP_ACCEPT.
More browser information welcome.

Using server-side scripting

When a server-side scriping language is available, it is conceivable to test whether the HTTP_ACCEPT field contains any mention of application/xhtml+xml at all (or even application/xml as a fallback) and then tailor the HTTP header as suitable.

For example in PHP one could start a document website.php like this:

<?php if (stristr($_SERVER["HTTP_ACCEPT"], "application/xhtml+xml")) { header("Content-Type: application/xhtml+xml; charset=iso-8859-1"); echo "<?xml version=\"1.0\" encoding=\"iso-8859-1\" standalone=\"no\"?>\n"; } else header("Content-type: text/html; charser=iso-8859-1"); ?>

The HTML head could then contain the element

<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" />

to make the document valid HTML; and this could also be suppressed conditionally on whether the file will be served as XHTML, but would not need to be, since XHTML requires that content-type meta elements be ignored.

Fixing the Internet Explorer

To register the MIME type application/xhtml+xml under Windows (so that the Internet Explorer will open files of that type), the following information needs to be added to the registry:

Windows Registry Editor Version 5.00 [HKEY_CLASSES_ROOT\MIME\Database\Content Type\application/xhtml+xml] "CLSID"="{25336920-03F9-11cf-8FD0-00AA00686F13}" "Encoding"=hex:08,00,00,00 "Extension"=".xhtml"

The above code is in .reg file format, and it is also available as a registry patch file.

Fixing Opera

Opera accepts text/html and application/xhtml+xml at the same (maximum) priority, which may give unpredictable results in conjunction with server-side content negotiation.

To change Opera’s HTTP_ACCEPT string, one of its .ini files must be changed. The location of the relevant file can be found by going to opera:about, it is the file listed after “Preferences” in the section called “Paths”.

The HTTP_ACCEPT string is set under the section [Adv User Prefs] as the value of HTTP Accept, and it should be set to something like this:

[Adv User Prefs] HTTP Accept=application/xhtml+xml, application/xml;q=0.9, text/html;q=0.85, image/png, image/jpeg, image/gif, image/x-xbitmap, */*;q=0.1 ...