Xml Is Just Dumb Text

I'm planning on merging in bits and pieces from other XML-related pages here as I see fit, as part of an attempt at refactoring XmlSucks et al. -- DanielBrockman
XML has no built-in data types, save strings of Unicode characters. This makes it a bit unwieldy in some situations. The usual example compares the following XML (forgive me, but comparing XML to S-expressions is unavoidable)
to the following, slightly terser, S-expression (see EssExpressions for more about these brackety beasts):
 (42 "foo")
However, this example is somewhat contrived (if not a lot), since in most real-world examples you will not find this kind of abstract list of data. A more realistic version could look like the following:
which requires some extensions to the S-expression for it to keep up:
   (price 42)
   (name "foo"))
Suddenly, the S-expression's ability to represent different data types (integers and strings) is not as important. Indeed, if you were to add a long description of the gizmo, perhaps even with internal markup, you would most likely find the S-expression's syntax for strings starting to get in the way.

Um... the EssExpression still looks better, takes up less space, you're killing your own argument.
 [insert example here]
Conclusions: Another unsupported conclusion. There's nothing superior about redundant information... (html (body (span "This is great!"))) is superior to <html><body><span>This sucks!</span></body></html> because it has the exact same structure, carries the same information, can't be messed up by misspelling then end tag, and doesn't force me to type redundant information.

[Really? Now we're back to this question: In the above example, which element does the symbol ')' close? In the example following it, which element does the tag "</html>" close?]

Umm... who care's, that's the whole damn point, you shouldn't be looking at ) to determine scope... that's what indention is for, get an editor that properly indents the code and you'll never worry about a closing tag again. The eye spots indention far better than it does closing tags anyway.
      (span "This is great!"))) 

html body span "This is great!"
See... the ( and ) should be invisible to you... don't look at them.

Simple quoting is superior when there is a lot of markup. Examples: configuration files, SVG (XML works well, but S-expressions would probably work even better from a purely syntactical point of view)

[Ahh. Now we're back to the old "How Many Spaces To A Tab" question when it comes to formatting and presenting source code and, in this case, data. The presentation of data and/or source code should NEVER have an effect on its value.

These "Indentation Equals Scope" folks should be rounded up and put into a re-education camp. The same goes for the "enough parenthesis fixes everything" crowd.

To say that "simple quoting is superior when there is a lot of markup" flies in the face of SGML, etc. Is is a religious assertion with nothing to back it up.]

Religious assertion -- that's pretty funny, considering you just made a "round 'em up and shoot 'em" rant without giving any reasons behind your opinion. The one reason you mentioned for any part of that was the word "SGML", but SGML is a highly flawed language, which is why it was redesigned as XML in the first place, FYI; check it.

And your point would be...? Anyway, we're kinda getting away from the thrust of the discussion. The page topic is supposed to be about "dumb text," but somehow it mutated almost immediately into Yet Another Page On XML Versus S-Expressions. Why? Actually, nobody cares. It's a phun rant, though.
What data types should a general and basic standard like XML accommodate?
[The following discussion moved here from XmlSucks.]

An application that uses XML must implement its own lexical an analysis and formatting to do any kind of data typing within that text.

I think you have failed to understand the fundamental concept of a "markup language". The idea is that the dumb text is put into context and given meaning by the surrounding markup. Consider, e.g., the element <integer>42</integer>: you don't have to perform any lexical analysis to determine that the text "42" is to be interpreted as an integer. It works like this for all data types. The only difference is that XML has only one scalar type - the string, while sexps give you a couple of more for free - such as numbers. In both languages you have to attach additional markup to represent less common objects, such as regular expressions, dates, or integer ranges. -- DanielBrockman

I think you have failed to understand the fundamental concept of "meaning". "<integer>" is just a string. There is no guarantee that producers and consumers of XML will agree on its meaning. -- EricHodges

That's where schemas come in. You can attach a schema that says that the content of any <integer> element is an xsd:integer, say. No-one is going to misunderstand the meaning of xsd:integer, because it's rigorously defined by the XSD specification. On the other hand, there's no guarantee that producer and consumer will agree on the meaning of (dictionary), because "dictionary" is just a string. If you're working with XML, you can place the <dictionary> element in a namespace that you claim ownership of, and thus centralize the authority that defines the meaning of the element: you. -- DanielBrockman

But the schema doesn't define the meaning of an element. It can specify structure and value ranges, but that's about it.

You're right. Schemas can only define low-level meaning - such as integer values vs string values. Then again, if you don't agree with me on the meaning of the element <integer xmlns="mailto:[edited out]">42</integer>, then may I humbly suggest you stay away from using it?

Sure you can. May I humbly suggest we not confuse something as simple as schema with something as complex as meaning?

Are you trying to tell me that the schema cannot attach any meaning to the data - at all? If so, then I would be interested to see something you do consider to give "meaning". Here's how I reason: The characters ['4', '2'] represent the decimal digits 4 and 2, respectively. These two digits together represent the integer 42. Let's say the integer is interpreted as an ASCII code, and so represents the character '*'. I would say, in this case, that the original characters mean '*'. However, I don't see why any step in the interpretation should somehow be more significant than the others, so if we forget about the last step I would have to say instead that the original characters mean 42. Now, since a schema (assuming it has basic datatyping) is able to specify this apparently meaningful interpretation, but you do not consider schemas to be capable of adding meaning, where am I losing you? -- DanielBrockman

I'm trying to tell you that schema doesn't attach meaning to data. Schema defines structure. It can be used to validate data. But what the data means is beyond its scope. The characters represent an integer, and that integer is 42, but we have no idea what 42 means in this context. We don't know the intent of the producer. I've seen XML lull consumer authors into a false sense of understanding. They think they know what an element name means when they really don't. We need to use plain text to convey meaning, and more of it than a single noun element name. I think this is a central fallacy in the hype around XML.

Ah, then this is a simple matter of definition. Whether schema can attach meaning to data depends on whether you allow something that can do so to be called "schema"! Most XML schema languages currently in use - XmlSchema, RelaxNg, etc. - do, and future ones probably will. But perhaps your definition is more useful, since it keeps the distinction and avoids mixing up two concepts.

How does XmlSchema attach meaning to data? I've seen the spec and worked with it for validation, and I don't see anything in there I'd classify as "meaning".

Let's look at an example. Here's a snippet from XML Schema Part 2: Datatypes, � 3.2.9 date <http://www.w3.org/TR/xmlschema-2/#date>:

[Definition:] date represents a calendar date. The �value space� of date is the set of Gregorian calendar dates as defined in � 5.2.1 of [ISO 8601]. Specifically, it is a set of one-day long, non-periodic instances e.g. lexical 1999-10-26 to represent the calendar date 1999-10-26, independent of how many hours this day has.

After some further discussion of the semantics of dates, the specification then goes on to discuss the "lexical representation". But first it defines the "value space", or the "semantics," or the "meaning" - terms which I consider more or less synonymous. The same applies to times, floats, integers, booleans, durations, and others. -- DanielBrockman

But knowing that an element is a date doesn't tell me what the element means. The date of what? Why is it here? What does it mean to the producer and the consumer?

There are multiple cascading levels of meaning, and schemas can only define low-level meaning. Schemas can 1) define the allowed structures and patterns, and, in most cases, 2) define the meaning of some character sequences. Your question, the date of what?, refers to a "date," something that did not exist until a schema defined it to be the meaning of the original character sequence. Let's say we had somehow defined the meaning of the date to be X. I believe I could easily come up with questions about the meaning of X, meaning that was previously not defined. Let's call this interpretation X' . Likewise, then, I believe there could be a meaningful interpretation of X' significant to both producer and consumer - and so on. In any case, the first level of meaning - "datatyping" - needs to be defined in order to allow further interpretation.

There's no infinite regress here, though. The field means one thing to the producer. The consumer must use that meaning. That meaning can be (and is) defined through natural language. Schema can define the structure and pattern of character sequences, but they don't define the meaning.

[The "hype" around XML is ignorance and a rush to market. XML is a transport mechanism, not a universal cure-all. XML only insures that the data is represented properly to all interested parties. XML does nothing to insure that the receiver of the data interprets the values correctly. That is application layer - well beyond the scope of XML's purview.]

[This is the "hype" surrounding the bulk of the Lisp vs XML pissing match on this Wiki. Everybody is getting stinky wet without really understanding what we're bickering about. -- AnonymousDonor]

On the other hand, it provides a low-level namespace mechanism to fight these kinds of problems. Logically, our favourite element probably isn't named "integer" so much as "{http://www.authority.example.com/datatypes/integer} integer". In that case, there's not really, in theory, room for "oh, just a standard C integer literal, huh?" assumptions; but I acknowledge that this can and probably has happened uncountable times. Not least, of course, because no matter how much you preach them, at the end of the day, far from all elements found in the wild will be in a namespace. -- DanielBrockman

If we're going to have to look at another document to determine what a field means anyway, why mix the names of the fields in with the data? A numeric ID would do just as well and mislead fewer people. -- EricHodges

Sure, but once you know what the elements mean, it would be annoying to have to constantly look through a reference every time you wanted to insert an element. NamesAreMnemonics?; the meaning is defined elsewhere. Whether the low-level meaning is defined through the use of a schema is really irrelevant, as in the majority of cases additional, high-level meaning is also required. The high-level meaning may be defined in a specification, or it may be an unwritten agreement between producer and consumer. Of course, this can and probably has lead to many misunderstandings, but I don't think the alternative is viable. Even if an element name is "self-explanatory," the name may be the cause of, but is not responsible for, any misunderstandings. -- DanielBrockman

In my experience no one needs to look through a reference when they want to insert an element because software generates the elements. The software can produce a numeric ID as easily as it can produce a string. Using strings confuses the issue of where meaning is defined. It tempts people to think they can understand the XML by itself. The alternative is viable. See the ISO specs for credit card transaction messages. Each field is identified by a numeric ID (which also has a mnemonic, but that isn't transmitted in the message because it's already known by both ends of the communication). Each message has a header that indicates which fields are present. Their meaning is defined in a very large book approved by ISO and used by all of the credit card acceptance banks.

And how do you suggest the software UI distinguish between <E1824> and <E9473> if not by the use of a name or mnemonic (or an icon)?

The UI can distinguish between them in whatever way is appropriate to the user. The concerns of the user are far removed from communication protocol concerns. In the case of credit card communications, the sender's UI (a POS device) may only prompt for a signature. The receiver (a transaction processing system) has no single UI. Everyone who works at the acceptance bank usually gets a customized view of the data.

A signature, huh? Isn't that what you use to identify methods in JavaLanguage? Moving the field name/mnemonic to a pretty UI doesn't make the meaning qualitatively more obvious.

It is, but not on a POS device (Point Of Sale). In that context a signature is something a human does with a stylus to authorize a transaction. This is a good example of how important context is for understanding natural language. The example I gave doesn't move the field name or mnemonic to a UI. Each UI is customized for its particular set of users. The communication protocol is hidden from the sender and receiver.

In any case, to really understand the meaning, you have to read the document in which the meaning is specified. XML was designed to be human-readable, which IMHO is a GoodThing.

It gives the illusion of being human readable, but you just said to understand the meaning you have to read the document in which the meaning is defined. XML tells a lie.

XML doesn't tell anything. All element NamesAreMnemonics?. Once you have read and comprehended the specification, the XML becomes human-readable because the use of mnemonics makes it possible to remember what the elements mean. This would obviously not be possible under the obfuscation solution.

The use of element names helps only when you don't have software to translate IDs to their names. If you have the spec you usually have something that can do that translation, or you can easily write it. XML could have IDs and use the schema to provide names, definitions, algorithms, graphics, etc.

By obfuscating the raw data, you are only moving the problem, worsening it in the process. I wouldn't want to have to fire up MicrosoftFrontPage 2003 Advanced Server each time I wanted to edit an XML document. I think it's safe to say that if that was the case, XML wouldn't be anywhere near where it is today.

You have to fire up something to edit it. Why not use something that translates IDs to their mnemonics and links to their definitions as well?

Because that would be completely pointless. There would be just as much confusion about the meaning of the elements if the software simply translated the obfuscated XML to readable XML. The only difference would be the utterly ridiculous bookkeeping imposed on you to maintain the list of ID -> mnemonic mappings.

That bookkeeping isn't "utterly ridiculous". That's where the meaning is defined.

Okay, so let me recap: When the mnemonics sit in a piece of XML, they are meaningless and confusing; but when they sit in your element id database, they are suddenly made meaningful?

No, it doesn't matter where they are. Mnemonics don't transmit meaning. The bookkeeping I'm talking about is the link from ID to definition (in context). We need more than an XML element name to define an XML element. There's no good reason to send the element name in each and every message (twice).

Besides, if you're using domain-specific software to generate and interpret the data, why do you care what gets sent over the line or stored on disk?

Size is one concern. Identifiers can be much smaller if they don't have to be natural language strings, and the bulk of XML messages are usually element names (mysteriously repeated at the end of every element).

Granted. Size was explicitly excluded from the XML design goals. I thought we were discussing meaning.

You asked why I care what gets sent over the line or stored on disk. I work with FpML messages that can be compressed by an order of magnitude just by replacing the element names with IDs. Bandwidth and disk space are cheap, but an order of magnitude is still an order of magnitude.


The same question [why do you care?] can be asked of XML. If you're using domain specific software to generate and interpret the data, why do you want natural language strings surrounding every element sent over the line or stored on disk?

But I don't want to have to use domain-specific software! I'd much rather use the general text editing tools that I'm used to and which have been under refinement for decades. I could use WhatYouSeeIsWhatYouGet HyperTextMarkupLanguage editors today if I wanted to; but I don't, and there's a good reason for it.

[Why? What's the advantage to using general text editing tools?]

Because I have my text editor right here doing everything I need, and I completely fail to see what, if anything, MicrosoftFrontPage and MacromediaDreamweaver can do for me. The BurdenOfProof is on you to show the alleged advantage of using domain-specific WYSIWYG tools.

Front Page and Dreamweaver are definitely not the tools I'm talking about, nor have I said anything about WYSIWYG.

Oh, I'm sorry I interpreted it that way. I too can imagine some domain-specific tools that I'd like to use. But completely abandoning text editors and relying exclusively on one domain-specific tool for each document format I ever want to edit seems unrealistic at best. I'd expect at least 90% of those tools to be a huge pain to work with.

Most XML I've seen is never edited by a person. It is generated by software and received by software.

I edit lots of XML myself, so maybe that's why we disagree. For instance, I author web pages and XSLT stylesheets, edit configuration files, and write XML-based software - and it all includes editing XML. True, most people don't; but many do, too. Personally I see lots of hassle and not one advantage to your enforced ID-only XML. Maybe if it was a pure over-the-line encoding. If no-one else but the encoder and the decoder at the other end of the line got to see it, then it'd be all good as extra compression. But don't take away the current XML syntax when it works so well for so many people (mod mandatory end-tags).

I used to work on web portal software. It involved a lot of XML, but I rarely edited it by hand. Usually I had a tool to manipulate it. If I didn't I wrote one. Today I work on messaging for a big bank. A few years ago they bought the XML hype and replaced their CommaSeparatedValues and fixed length formats to XML. Today their messages are 10x longer, slower to parse, and no value has been added. This is being used by thousands of systems, each sending thousands of messages each day. It dwarfs the use of XML for web sites and config files. Hardware that could handle the earlier formats with ease has had to be replaced with more expensive hardware, or duplicated multiple times.

If there's a UI it's customized for the user and the domain. In the rare instance where a human has to look at a raw message, I'd like a tool that could display any level of documentation you want (from mnemonics to complete docs) for any element.

Me too! It could pull documentation from the schema, complete with links into the specification. I would guess something equivalent already exists for XmlSpy? and things like that. Do you know of any tools that do this?

If there is I haven't seen it. I don't think it wouldn't do much good. The "annotation" element isn't used very well by existing schemas. For instance, look at this snippet of the FpML schema:
 <xsd:element name="notional" type="Money">
     <xsd:documentation>The notional amount.</xsd:documentation>
This is "self documenting" only in the sense that a Java method getFloop() is self documented by the JavaDoc comment "Get floop amount."
XML designers ignored every piece of research that came before them in the area of designing external data representations. It's an absolutely basic requirement that such a representation be able to accommodate such basic elements as integers, strings, and so forth.
I don't think I've ever seen an XML file as compact or human-readable as an INI file. And I can't think of any advantage of XML over INI. A schema can be defined up top, it supports multiple data types (sets, arrays, etc) and has all the advantages and consumes far less space/bandwidth as well as using well understood file libraries (file I/O). What I have seen is months of effort to create XML libraries, load, convert, format, read, write, transform, split, etc. XML reinvents the wheel into a square wheel. See: http://grumpyoldprogrammer.myblogsite.com/blog/Definitions/_archives/2005/5/12/834657.html

I'll bite. How do you express collections and hierarchical data in an INI file?

There are millions of ways to express them, here is one:


[HierarchyName1] NodeSchema?=FirstName?,Surname,DOB,Gender Tree1=Node1(Node2 Node3(Node5 Node6) Node4) Node1=John,Jones,19700106,Male Node2=Barbara,Xmlcritic,19801230,Female ...

It should be noted that the INI example above looks strikingly like LDIF ... which is completely hierarchical in nature.

Although I would recommend table format for sets since you not only store the information in a human readable format, but it can easily be sorted/grouped/aggregated/imported/exported/filtered/etc by all. And it has a billion ubiquitous viewers and can be edited in place or transmitted in binary for low bandwidth etc. I am just illustrating that XML adds nothing and subtracts a lot. No one ever needed a new way of expressing information. We have expressed all possible information perfectly fine for half a century.
It seems that was a biased argument, that has no sense at all for a human reader, xml would form a much easier to note structure of the tree.

It also seems that it would be limited and even more difficult to write / parse than XML.

I don't think XML is a good solution but I am sure XML is a better solution than INI
This page is filled with unsubstantiated assertions, PersonalChoiceElevatedToMoralImperative, etc. All of this blather about "meaning" versus format and such is really outside the bounds when it comes to discussing XML. That's because XML is a transport, and not a means by which data is interpreted to "mean" any one thing.

Therefore, this whole page should be reduced to about a half dozen fragments of ThesisAntithesisSynthesis. Any WikiGnome game?
See: XmlSucks, XmlIsaPoorCopyOfEssExpressions

CategoryXml, CategoryRant

EditText of this page (last edited January 19, 2014) or FindPage with title or text search