Myth Of Metadata

MetaData Myth: MetaData Reality: Rather than argue over the general value of metadata, perhaps we should explore where metadata works well and where it doesn't. The alternatives can get pretty ugly also.

See: JobSecurity, BarbiePrinciple
Related to this myth, there's another myth: that ExtensibleMarkupLanguage is superior to ASN.1 or CORBA CommonDataRepresentation? or any other data interchange format because it is "self describing". That's a little too much.

Let's take an example:
	<order customer_id="X">
	<item quantity="5" >
	<identification type="ISBN">....</identification>
	<quotedprice value="44.95" currency="USD" /> 
To me as a programmer, the thing is indeed almost self-describing. If I also saw a DTD or a Schema file well commented, I would be thrilled. To a computer, these considerations have absolutely no value, we cannot yet build intelligence into computers to take in meta-data and do something useful with it. Before we can arrive there, we still have to find better ways to understand and define metadata.

Before we do that, ASN.1, CommonDataRepresentation? and XML are absolutely equivalent to a computer; the difference is that with XML we trade bandwidth for human convenience. XML can be easier to hand edit (just needs a text-editor), but for the purpose of building communications between systems this is irrelevant. When the computer can do everything, we won't care about these formats. For now, XML is easier for humans and library writers. That is all. -- CostinCozianu and others
	<asd dfr_id="X">
	<iu lwm="5" >
	<qaz bvn="ISBN">....</qaz>
	<pkdw cvv="44.95"dfgfd="USD" /> 
The above shows why XML is not self describing. Using tags that look meaningful to a human and assuming that there is indeed meaning in the tags available to a machine has been called the GensymFallacy? in the AI community. Moreover, there are no machine available semantics in the structure of the XML: what is the relationship between rtj and qaz above?

In the above example, I can provide metadata that defines precisely what each of the above tags "means" to the machine. XML provides a syntax so that this metadata is denoted using exactly the same syntax as the tags it applies to. If you had supplied the metadata, I could not only answer your question but dynamically alter, extend, or remove that relationship. I would encourage you to compare the construction of a distributed API using XML to the same task performed using, for example, IDL. Both allow the metadata to be described - but only XML describes it in the SAME syntax. -- TomStambaugh

There is questionable value in having the metadata encoded with the same mechanism as the data itself. There are places where this works, and places where it doesn't. However, you have to be more explicit when you say that you can define, what things "mean" to a machine. -- CostinCozianu

While I appreciate your effort to be even-handed, in twenty years of using Smalltalk I haven't found any "places where [this] doesn't [work]" (but see InAllMyYearsIveNever). As far being more explicit when I say that I can define what things "mean" to a machine, I refer you to the Smalltalk metastructure. Specifically, classes #Behavior, #Class, #Metaclass, #CompiledMethod?, #Block and #Process will be a good start. Xml is more than expressive enough to allow me to specify precisely what I mean by member, type, method, and so on. I can then use that specification in text such as you supplied to parse, traverse, and process pretty much anything you want. If you really care about what I mean by "mean", I direct you to, for example, the work of Brian Cantwell-Smith - specifically, his definition of a "notation", "symbol" and "meaning", and his definition of the theta/arctheta, phi/arcphi, and psi/arcpsi functions. That's as precise a definition of "mean" as I know. -- TomStambaugh

Sorry Tom, XML is capable of no such thing. XML is a tool for making markup languages. If we wanted to make a markup language to describe Smalltalk, that would be fine, but the semantics and meaning of Smalltalk expressions are not available from or held within the XML data. that would be like claiming the meaning of English sentences are held with its alphabet, or even its individual words. The best you can do with XML is lay out Smalltalk expressions for consumption by a Smalltalk evaluation. That does not make the the XML meaningful of itself - a process is what determines what that XML means. The common form of this myth in the XML Community is 'XML is like Lisp', but XML does not work like Smaltalk, Lisp or logic languages founded on a ModelTheoreticSemantics?. Quite the opposite, being a pure syntactic form, XML has no semantics or evaluation rules whatsoever that would allow us to accept what you're claiming. -- BillDehora

I'd like to mention that the semantics and meaning of English aren't available in sentences, paragraphs, or any other syntactic construct. The meaning is in the reader that evaluates it. Showing an English sentence to someone who doesn't know English and asking what it means demonstrates this. Metadata is most problematic when the meaning is ambiguous.

So the problem is that you've been referring to Smalltalk, an example where things work. Still not even Smalltalk's metastructure is not as omnipotently expressive as you credit it. Such constructs as "type" (which is widely accepted within the ComputerScience community that should be essentially different from the concept of class) that can be expressed in other languages as a primitive of the language (or of the metamodel) can't be equivalently expressed in Smalltalk. Other favorite examples of mine are relational databases where the metastructure is expressed as a set of relations (dat dictionary). But not all models have this characteristic that their metamodel can be sensibly or efficiently defined within the model.

I've worked on, with, and seen multiple Smalltalk environments that include the notion of "Type" (usually, but not always, mapped onto "Class"). For example, "Strongtalk" was externally available for awhile. Within IBM, researchers in RTP built a similar environment that used specially formatted (ala Javadoc) comment headers to accomplish similar goals. At my own startup, we built an Eiffel-style "weak typing" system in Smalltalk. It is straightforward (not "trivial", but requiring no "inventions") to create an "InstanceSpecification?" class, instances of which can then be used to describe instance variable slots in the Smalltalk metastructure. The question of whether or not these are "primitive" is a RedHerring - anything can be made a primitive in Smalltalk, but few things matter enough to make it worthwhile. I agree that "not all models ..." - in fact, most commercial systems (Java and CeePlusPlus to pick on two) cannot "sensibly or efficiently" define their metastructure from within the environment. This, in my opinion, is an important reason why those environments are so hosed and why a good object-oriented developer is so much less productive in those environments. Lisp and Eiffel, on the other hand, are two environments for which the metamodel is reasonably strong. -- TomStambaugh

Now, going back to the expressivity of XML, I very much doubt that it is as expressive as you may want it to be and among things that you can't easily express are types, constraints, relations and almost every useful data modeling construct. With a little effort, you can define new XML document types to encode that metadata also, but that was not my point. Expressing such things in XML is like expressing them in ASCII: of course you can do it, but there's nothing in XML that offers you a significant support for such an endeavour.

Of COURSE you have to define "new XML document types"! That is the entire point - that you can, and that all participants can adjust themselves accordingly. And if you really feel that "there's nothing in XML that offers you a significant support for such an endeavor", I suggest that you try and accomplish the task in ASCII. I've had exactly the opposite experience. I was able to replace pages of IDL with a dirt-simple XML dtd and associated documents. It saved my team literally months, even years, of development time. -- TomStambaugh

On the "meaning" of data, in plain English, there are two aspects to this story: Encoding external meaning within the system, and verifying that that data is "meaningful", is something that we have yet to find out whether is tractable at all. This was the subject of this page: no matter what wonderful metamodels you may construct, you cannot, at the current state of affairs in AI, or as far as I and the other contributors know about, achieve self-describing data that software systems can dynamically combine and produce results meaningful to the human user.

Cantwell-Smith was doing this at least a decade ago (Lisp-2, Lisp-3). I'm not sure what "current state of affairs in AI" you refer to, but software systems routinely "dynamically combine and produce results meaningful to the human user" every day, at least at the level I'm talking about. -- TomStambaugh

The "internal meaning", the set of constraints that data must satisfy, is what a "data modeler" devises as an approximation of the ideal set of constraints that would allow only correct data (from the perspective of "external meaning") to enter and be transformed by the software system. The data that satisfy all these internal constraints is said to be consistent, but not necessarily "correct" - fully satisfying the external meaning and being a truthful representation of the reality modeled. The more significant internal constraints the software system can verify and enforce, the lesser probability of errors you have. From this perspective, of supporting internal constraints, the XML itself and related technologies (including XML Schema, parsers, XPath, DOM, SAX, et cetera) are currently offering far from satisfactory possibilities, and that's what I was referring to when I asked you to be more precise. -- CostinCozianu

Well, perhaps you have a different threshold of "satisfactory" than me. I mean that when I have to build a system of distributed clients communicating with distributed servers in a heterogeneous (software, platform, OS, etc) environment, I find "XML itself and related technologies" a) more than satisfactory and b) qualitatively and quantitatively superior to alternatives like IDL, RMI, COM, and so on. -- TomStambaugh

To cut a long story short, XML metadata is, at this time, expressed as DTD which is NOT XML. What would be XML metadata in XML - XML Schema - has at best beta support in current validating parsers, and when we talk about production quality parsers there's none. While XML schema is kind of promising, the constraints you can express in DTD are just not good enough.

So when you say that XML is quantitatively and qualitatively superior to IDL, you have to have some arguments behind it. There's no data out there that can be encoded as XML and can't be encoded as GIOP. In GIOP you can restrict a specific piece of data to have a predefined type, while with the "current technologies" you can't in XML. Quantitatively , maybe you want to say that XML is bloated or something, the same data encoded in XML is surely taking a lot more (maybe an order or two of magnitude), in terms of space, bandwidth, CPU cycles. But this all depends on what you use XML for, XML is different things to different people. Just stating that XML is qualitatively and quantitatively superior, well, allow me to have a different opinion. -- CostinCozianu
For real examples of MetaData, see how the MetaObjectProtocol works in CommonLisp.
 * Metadata is almost always proprietary.

So, Tags in LaTeX are proprietary? The MOP is proprietary?

Maybe what you mean is: bad examples of MetaData adopted bandwagon style by companies trying to lock customers in is proprietary?

Maybe :)
 * So, Tags in LaTeX are proprietary? The MOP is proprietary?

The metadata description languages are separate from the data themselves. Different fields care about different data. So it's often a thing like "well, the Dublin Core spec specifies xxx core things, but we really need to track yy and zz as well." The metadata description specs are often ambiguous enough that companies / projects end up extending in proprietary ways because it's not obvious whether the language can be used to specify the extra fields they want.

I think a big part of the reality is that people interpret metadata differently. I might call the author of a piece of content Creator, while you might use Author, and use Creator as a field to specify who actually placed instantiated that piece of content. So already, the pipe dream of intercommunication is gone - even if we're using the same language to describe them!

-- jps

Of course, but then I say + is useless because I interpret it differently than you.

There is metadata in XML, but we don't trust it. When combined with schemas, the metadata in XML becomes markup and can not completely the data without schemas.

Metadata is also duplicated in processing applications, as we can't process XML otherwise.

Yes, metadata in XML is a myth.
You guys. <tsk, tsk> Confusing metadata's usefulness with its description? You know, if I can describe something well enough with carefully chosen names, attributes, and other metadata, even one of you losers can figure it out. Eh? Hey, I should know - I use to be a non-XML loser!


But seriously, folks - how can names and metadata not contribute to the value of information being conveyed through XML? Granted, the Bloofta content of an Ekmotz entity may mean nothing to me just looking at it, but to anything that understands Ekmotz and Bloofta they are the world. The same thing is true of any written word that is conveyed through any medium. The entities being described need to have the same meaning to both sending and receiving parties for there to be any meaningful discussion at all. How does the presence or absence of metadata have any impact on that?

Oh, and by the way - let's not confuse the use of XML with the use of straight-up binary data. Each has its application areas and places where it should never venture. For intra-application communication, between servlets on the same host, and long haul, high volume situations it should be obvious that XML is not a good choice. For inter-application comm, between servers on multiple hosts (or through multiple switches, etc.), and local database storage, XML provides a solution that allows for much analysis by human intervention. I like that. Lots. -- MartySchrader

But look at what you're saying - we require something else that understands what is inscribed. The point is that XML adds precisely no meaning or semantic import to what is inscribed in it. You could have just as easily picked CommaSeparatedValues or ASN.1. -- BillDehora

Sure, I guess. But XML carries names with every element and attribute, so those names convey some meaning to me as an observer. If somebody chooses name that don't convey any meaning or are actually misleading then the advantage goes out the window. I hope we're not talking about that.
I thought the purpose of XML was to have a specification for developing custom markup languages for data interchange between parties that have agreed to the meaning of the markup.

Given this purpose, there was the idea for namespaces which I thought were supposed to be a collection of predefined, universal markup tags that one could reference so that both parties were using the same tags and wouldn't have to create them.

These namespaces would be abstractions of specific vertical markets since the fundamental language of most vertical markets are identical. In a way, namespaces would be like "jargon dictionaries".

The idea is good since it is based on the use of agreed forms of communication just like a protocol.

Apparently, XML was seen as just some form of easily customizable alternate storage format. I figured XML through the use of namespaces would serve the same purpose for data interchange as browser specifications do for browsers, result in a situation where there was an agreed upon set of functionality that all browsers would implement and eliminate the need for different communications to different browsers for the same function.

This was my understanding when reading about XML when it first appeared. As usual, one can never be sure to what use humans will put new tools when they are placed in their hands.

I haven't done any programming beyond simple shell scripts in a long time since I am not a programmer. For my internal use I use CSV as my generic data transfer format. This means I am not qualified to speak about how XML is used. I'm also not sure that my understanding of the original purpose of XML is correct. I would appreciate any comment or critique of this that would enhance my understanding.


Your understanding is correct. The views that pushed XML as a general replacement for SQL and its databases -- and in some cases the DBMS too -- have been almost entirely shown to have been misunderstandings or delusions. XML can, however, be used as a format for representing documents -- which sometimes raises LaynesLaw-invoking debates over the distinction between a "database" and a "document".

Contributors: CostinCozianu, TomStambaugh, MartySchrader, miscellaneous...

View edit of September 9, 2010 or FindPage with title or text search