Xml Vs Html

This page describes the role of XML as it relates to HTML. There is a whole world of other things going on with XML that has nothing to with document formatting or the Web, but that's a subject for another page.

See ExtensibleMarkupLanguage
XML will replace HTML, eventually. XML has a sister specification called XSL which allows you to define rules for processing the data and formatting it for a user. This solves one of the key problems with HTML - little or no reuse. Web pages have data but it is embedded in the formatting and that keeps you from readily changing the look and feel. XSL and XML are the key to reusable data. Already there are databases which allow you to perform queries with XML and return XML data. I am sure that you can see how nice that will be for developers.

XML will replace HTML, eventually. Maybe. Maybe not. In our shop we think of XML as doing for data what the various IDLs do for behaviour, as discussed elsewhere on this page. We absolutely do not see XML as a client-side technology, since that would involved pushing our clients' data dictionary and schemas over the wire. Since our clients consider their data models to be very valuable corporate secrets, that idea really isn't going to go down too well. We see XML->HTML via XSL as the way forward. --KeithBraithwaite

If anything replaces HTML, then it's XHTML, not XML. -- JuergenHermann
Here's how I am thinking about it, until such time as I ever become expert in the matter. Perhaps someone can derail this thinking if it is incorrect.

HTML and SGML actually define objects in UniCode text. You write or generate,
	<TAGNAME VAL1="some value" VAL2="somevalue">
	Body of text, and possibly nested tags.
	</TAGNAME>
The TAGNAME corresponds to an object type, and the Vals to attributes or instance variables. cool enough. We used that idea to export our sequence charts and make them trivial to generate, parse, debug and send around the net. HTML and SGML work because everyone is told in advance what the tags are. But other people than me want to invent their own object types and tags.

My understanding is that XML is the meta-level thing. They predefine some tags so your document starts off by naming the tags it will use, the TAGs and their VALues. Then the next reader can test to see what of the stuff they are going to understand, and at least parse the stuff they won't understand.

Anyone confirm or break that view? --Alistair


XML is quite tied to UniCode, which doesn't matter much to us english speakers, but is quite important for everyone else.


Pretty close. SGML is the root, a heavy duty markup language for enterprise-scale publishing. It's very configurable, but things have to be done properly to get through the parser. You can buy thick books about it written by people from IBM.

HTML is the cheap and cheerful version with just a few tags pre-defined by TimBernersLee, then all the browser vendors, then the W3C. The browsers are very forgiving, which is why it's easy to write HTML but hard to write browsers now.

XML is an attempt to rationalise all this, so the syntax has been tidied up a bit, and meta-definitions (DTD) have come back -- although your parsers don't have to validate. The push now is to provide interesting extensions, such as search and incremental update, and to move it away from its document roots to be more generic.


XML does not come with a predefined set of tags to allow you to define your own tags.

It does come with the necessary syntax to allow you to describe which DocumentTypeDefinition you are writing to, and hence what the valid tags will be.

The DTD is not written in XML, though it's obvious now that defining a new syntax is bad when you are trying to make a standard syntax for everything. So it probably will be the case that there will be a standard DTD for DTDs in XML, but not yet AFAIK. --JohnFarrell


HTML, SGML, and XML. Why three different standards?

SGML is semantic markup for storage (define your own tags).

XML is semantic markup for interoperability (define your own tags).

These are WAY too simplistic. Yes, you can use SGML for data storage, but people use it for all sorts of bizarre things like multimedia presentations and adding link information to CIA spy satellite data. No, I'm not kidding. And XML is good for a lot more than interoperability; for one thing, you can use it as a human-readable (and human-editable) serialization mechanism. I use XML constantly in places where it has nothing to do with interoperability. -- JamesWilson

HTML is visual markup for presentation (predefined tags).

XML will not replace SGML or HTML. It addresses a new market - integration. XML is the currency of the information economy. (Someone else said this first)

A proposal for web sites such as CNN.com:

Save stories in SGML with full semantic information. Create an engine which reads SGML and produces XML in various formats.

Define a standard XML news story format. Let search engines read this format. This would enable searches that understand authors, dates, and so on.

Create an engine that reads XML news stories and produces HTML. Let the web browser choose what HTML format they want - IE, Pilot, 3.2, 4.0.

-- EricUlevik


HTML started out as a subset of SGML, intended to be a simple document description language. Tags like <title>, <h1> represented document structure attributes -- the viewer (there was no such thing as a browser then) was expected to interpret and display the tagged text in a way that was appropriate for the characteristics of the particular user interface and that user's preferences.

At some point the print-world graphic designers got involved and started deciding that it was a page description language, or should be. The semantics of the tags were overloaded to included visual display hints. This trend led to extension tags to HTML such as <font>, <center>, and the much hated <blink>. When HTML became too unweildy to handle the semantic overloaded, Cascading Style Sheets were develop to attempt to separate content from presentation.

At some point the idea that a particular user might not have the exact screen real estate, color pallette, or font collection that the page designer expects was completely forgotten. As a result, there's a set of web designers deridingly called pixel perfectionists, who get exercised because blue isn't the same blue everywhere and one browser offsets the document from the top left by a different number of pixels than another browser.

Also, early on someone working on a browser made a decision to allow "sloppy" HTML -- either because the standards were vague (they were) or because there was a lot of bad hand-coded HTML out there that would break if the rendering engines required what is now known as "well-formedness". One of the most troublesome results of this is that dropping or doubling the various table tags (which started out as a non-standard extension) can cause various surprising things to happen, such as a table not being rendered at all.

Most (more) recently the WC3 group has attempted to bring some order to this mess by a specification now known at XHTML, which is an XML DTD for HTML. The idea would be that an XML parser/renderer could handle HTML if it were written to comply with the XHTML standard -- many cannot at the moment because of sloppy (non-well-formed) HTML.

The catch is that because XML is a rather more rigid document definition language, and because the tags in any given XML document do not necessarily have pre-defined meanings like HTML tags, an XML document cannot be rendered in and of itself. A DTD only specificies what elements are legal and what sorts of nesting is allowed. So the addition of style sheets in the eXtensible Style Language and XSL transforms. And XSL style sheet is a bit like an HTML Cascading Style Sheet -- it associates possibly renderable document output with input XML tags. So an XML document specifying a bibliographic entry may only have structural tags for author, title, publisher, etc. Applying a given XSL transform can generate some kind of displayable output based on those tags -- for example title could be italicized in the grand old bibliographic manner.

In a general sense, the online document world has come full circle, but with some lessons applied. Initially HTML didn't specify how it was to be displayed, but it was intended as display hints, and graphic designers too that to the limit. XML doesn't specify how it is to be displayed but isn't intended for display, but there are ways to transform the descriptive structure of an XML document into a human-readable document. The ideal behind XML/XSL would allow a single XML document to act as a sort of record set or raw data description, and various style sheets to act on that data to generate appropriate markup. This is happening to a large extent. There are transforms that can take an XML document and generate either WML -- a kind of minimal markup for celluar phones -- or HTML. There are tools to generate PDF from XML.

The downside of all this is that XML is now the silver bullet, and buzzword-compliant products must somehow 'support' XML. This has been taken to rather extreme levels, such as requirements that EJBs generate XML results directly.

XML will likely replace HTML, but not on the client, where HTML browsers remain the norm. There will be a general trend towards storing certain kinds of textual documents as XML, which can be processed, cataloged, and transform, but which remain in textual form rather than becoming rows in a database.

-- StevenNewton

These days XHTML is basically dead: trying to get the world to move from sloppy HTML to the more strict XML syntax proved too difficult (no doubt due in part to Microsoft's reluctance to properly support it in Internet Explorer). Instead, a newer HTML5 proposal has been developed, which takes a different tack to specifying the language.

"HTML5" is used as a blanket term for a battery of different APIs and technologies including stylesheets and scripting, as well as page markup. The page markup language (the new HTML) is no longer specified directly in terms of syntax, but as an object model; elements are described in terms of which elements they may be contained in, which attributes they possess, and what an HTML document needs to expose to programs working on it. Only two of sixteen chapters describe the two specified serialisations of HTML documents: the "HTML" syntax (the loose and happy version) and the "XHTML" syntax (the XML-strict version). Generation and parsing of the former is spelled out character by character, grammar production by grammar production, algorithm by algorithm; the latter delegates all that to the XML specification.


My take : XML is "merely" a simplification and formalization of the syntax of HTML, with some concepts stolen from SGML.

No, exactly the reverse. XML is simplified SGML with many complexities and warts removed. For instance, it has exactly one character set instead of the unlimited mischief SGML allows with character sets. (Multiple encodings are allowed, however.) It disallows various insanities SGML DTDs make possible. And it vastly simplifies the task of parser writers. The SGML specification is the size of a small book, and to find an intelligible (that is, annotated) version you have to buy a large book. In contrast, the XML specification is about 36 pages and even I can understand it without annotations. -- JamesWilson

To a first approximation, XML is just a syntax.

More pedantically, but more accurately, XML "describes a class of data objects called XML documents and partially describes the behavior of computer programs which process them".

XML is NoSilverBullet. This can be deduced directly from the above.

On the other hand, the elementary concepts of XML have benefited, and IMHO will continue to benefit, an ever-growing community of developers because of practical benefits.

There is a wide variety of tools for processing XML in "standard" ways.

If your application needs to store some data, and such data is of a kind which can be expressed in XML, you have a lot to gain by using XML; not having to write a parser, at the very worst; in lots of case, leveraging existing tools which help with further processing XML data into something closer to what your application internally manipulates.

[ To be continued, I'm off to dinner. ;) ]
 HTML = Content + Presentation
 Refactor it to get: 
 XML = Content
 XSL = Presentation
A clean pattern for separating the 2 when content changes often (as in ecommerce items & prices) and presentation style doesn't. --MichaelLeach

Can't you just as easily say the following?
 HTML = Content + Presentation
 Refactor it to get: 
 HTML = Content
 CSS = Presentation


CategoryProgrammingLanguageComparisons CategoryXml

EditText of this page (last edited May 11, 2014) or FindPage with title or text search