Processing Markup Languages

(Moved from StructuralRegularExpressions)

I don't know if its related, but parsing XML and markup languages brings up an interesting question: what does the result look like? As a table-head, I'm tempted to draft the parsed result as something like:

  table: tag
  sequence   // may be superfluous with ID, depending on environment
  parentRef  // parent tag (0 if no parent)
  isClosed   // "1" if self-closing such as <foo x=1\>
  // Alternative closing tracking to better handle bad or non-trees:
  closeType  // B=beginning tag, E=ending tag, S=self-closing

table: attribs -------------- tagRef // f.k. to tag table attribName attribValue

A dummy tag, such as "intertext", can perhaps be created for text between tags. What would non-table-heads choose as the output format? Iterators?

{A formal Grammar (as per the ChomskyHierarchy) can be more general than any regular expression - i.e. capable of matching anything a regular expression can match, plus plus. The basic output of matching for any grammar is very simple: a boolean indicating whether the structure is a match - is 'legal' within the grammar. The same would be said for structural regular expressions - simply returning a boolean as to whether the structure matched. Of course, it's also nice to know -why- a match occurs. To do that, you essentially add variables to the regular-expression or grammar, and you also need to handle choices (where a component is X or Y or Z, you need to know which 'choice' was made). The 'return' from a match in this case is the choice and its associated variables. Very simple. Things get a tad more complicated if you wish to receive recursive or list variables, but only a tad more.}

{As far as parse-results in general, either a single hierarchical value-structure, or a stream of such structures (depending on the medium), is most natural for the result of a parse. Database tables are best used for data - literally, propositions held to be true - not for partial values that individually say nothing of their world or context. Also, you need to be careful with tables: ordering is relevant in the parsing of XML and SGML.}

I believe your response reflects the BigIron RDBMS bias. Tools like a NimbleDatabase and DynamicRelational can transcend the BigIron stiffness of the current flavor of RDBMS. And, relational is perfectly capable of storing and using ordering-related info. --top

{Relational can store ordering if you include some sort of ordering identifier - a double or integer, for example. That is not reflected in your above schema.}

{And NimbleDatabases are still databases. They ought to store meaningful data. Structural information on the tags in the file is not particularly meaningful. Admittedly, at least, you intend this 'table' to be extremely transitory and subject to further processing. But I can't see it as a great deal more useful than storing structural information on anything else... e.g. on text files:}

  table: my-text-file
  char  // unicode codepoint
  pos   // position in file (integer)
{That you CAN extract and represent structure in a set of tables by introducing artificial identifiers and references doesn't make it particularly useful to do so. You maintain the exact same information, and at least the same capability of processing it, simply by using a plain data-structure - a string for text-files, and a hierarchical structure for XML. If you were to extract semantic information (e.g. relating books to their authors) from the XML file into a table, I'd have far fewer objections. Unfortunately, even XML schema never specify semantics.}

I am not following your argument. If one wants to process information from an XML/HTML document, using it in raw form can be daunting. SeparateIoFromCalculation applies here. I don't want to mix parsing with processing if I don't have to.

{I'm not saying you should mix parsing with processing (at least for XML; streaming text would be a different matter). I'm saying that a table doesn't help. It provides no benefit whatsoever over a hierarchical datatype (since you NEED the structure in order to make sense of an XML document), but they're a bit harder to work with (since you'll be rebuilding the structure in order to make sense of the XML document). Tables can be made to work, but they aren't the right tool for the job. Everything you break down into that table will need to be built up again to make sense of it. It's a little bit like the tables are artificially injected just to look neat... like GoldPlating for the parsing problem.}

Plus, if its in a structure, one can go back and forth and up and down etc., while during-parse iterators (getNextTag(...)) generally cannot.

{I did not propose use of iterators. I'd much prefer to take a hierarchical data structure of the XML, and write a FoldFunction over it (using Haskell or OCaml-style pattern-matching) that extracts semantic information, and transform it in one step to represent the semantic information. XML is pure structure, and its semantics are largely embedded in that structure; you'll always need such a step to capture semantics. I'd be quite willing to drop into a nimble database the semantic information extracted from the XML.}

Comparison to a raw text file is not applicable because there is no pre-digestion into tokens of some sort in a character-per-character representation. With the tag tables, one is dealing with XML statements and their related attributes, not individual characters. It goes up the abstraction ladder from mere characters. It's at the token and statement level, not at the character level. Maybe I'm not understanding what you perceive as the alternatives.

{Comparison to raw text file is applicable, though perhaps a little extreme. It actually does some pre-digestion, btw - to get unicode codepoints, you'll generally need to "pre-digest" UTF-7, UTF-8, or UTF-16. But if use of 'tokens' are your primary objection, it wouldn't hurt my example to separate the text-file into tokens. Either way, having just the structure of the text is quite pointless - it isn't useful data. Neither the text format nor the XML format you propose offer real 'data' about anything except representation of 'data' in the file.}

  table: my-text-file-2
  token   text-word or other token (punctuation, formatting)
  pos     position in file (integer)

{It might be hard to see where I'm coming from without a couple examples... e.g. an XML file for configuring a GUI, and/or an XML update format for multi-library purchases, checkins, and checkouts.}

{As a note, when I'm feeling lazy about it and am uncaring about efficiency, my approach looks something like:}
  XML => Node (value-object) with sequence of sub-nodes (including text-nodes) and map of attributes
  Node with sequence of sub-nodes and map of attributes => function that extracts all semantic information
  Semantic information => manipulation of application (change settings, update DB, etc.)

{Iterators might be used over the sequences and maps for languages that lack efficient pattern-matching or fold operations (e.g. C++), but would be performed over the structure, rather than directly out of the XML. More relevantly, the function that extracts information is generally somewhat recursive and/or 'deep'... e.g. to construct recursive components. For a GUI-description, for example, it needs to design components that set up windows with certain widgets and display hooks (gauges, lists, etc.) and operational hooks (on click) and whatnot. I've done such when writing an OpenGL-only GUI for an in-video-game menu system. Relevantly, if I had put all this into table, I could have made that work too... but I'd need to do all those joins just to acquire the structure back in order to build the GUI.}

{Here's a question I have for you: (Other than the fact that you simply love tables and want to inject them into every operation you perform ;-) Why would you break down a hierarchical structure when you'll just need to do re-work to build it again? Especially for something as transitory and (by itself) meaningless as the XML parse result?}

I don't understand the question. And, XML is not necessarily hierarchical. How about you present a specific scenario. Otherwise, we'll probably keep talking past each other via generalizations.

{With XML you're legally limited to exactly one tag if you wish to entirely avoid hierarchy. More relevantly, the cases where the hierarchy is shallow are rare; there's not much point in arguing what is 'necessarily' true when actual truth is readily available. As far as talking past one another... that isn't really possible without you doing a bit more talking. I will use a scenario if you present one. The above question I have for you is from the 'then what?' aspect of breaking the XML hierarchy down so you can fit it into a table. So you broke it down... then what? My conjecture, from both experience and analysis, is that the 'then what' is (except for some possible rare cases) to rebuild the hierarchy so you can extract meaningful data out of the XML. Since the hierarchy provides context that gives the tags their meaning, doing so would often be necessary. And doing so strikes me as rather pointless rework.}

Ebay Example

Well, okay, let's say we want to mine Ebay for certain products and related info. Suppose there are a bunch of TD's in the HTML that contain description and value pairs, such as the label "price" and the value of price. Example:

  <td color="brown" align="right">Price:</td><td>$19.95</td>

tagID...seq...tagName...closeType ---------------------------------- 30......50....TD...........B 31......51....INTERTEXT....S 32......52....TD...........E 33......53....TD...........B 34......54....INTERTEXT....S etc...

tagRef..attribName..attribValue ------------------------------- 30......COLOR.......brown 30......ALIGN.......right 31......VALUE.......Price: 34......VALUE.......$19.95 etc...

(Dots to prevent TabMunging. Also note that the "intertext" convention and the "closeType" convention option is used in the example.)

If we wanted to extract that, then we can search for the text contents of "price:" between TD and end-TD tags. When we find it, then just increment the sequence number by 3 to get to the adjacent TD's contents tag, and we have price (and a sanity check on tag name). We don't have to traverse trees in this case, and even if we did, its perfectly possible to iterate through trees when in tables. (And some query languages have built-in tree traversal operations.) I find that data often naturally is either not tree-shaped to begin with, or can easily be "flattened" for the purpose at hand such that dealing with viney trees in raw form is not that common of a need. Trees are over-used and over-assumed in IT in my opinion. (I may use more intermediate tables closer to the domain/project besides just the parsed markup tables shown here. Some language/tools make ad-hoc or nimble tables easy, some don't.)

Note that I wasn't clear on how to deal with ending tags, but the process is the same regardless of which approach is taken.

{If you choose, as you've done, to relegate the problem to the pattern-matching algorithm, you've simply shifted the problem... not solved it. You'll end up relying upon a consistent pattern in the source text. This particular problem wouldn't have been any more difficult to extract all the 'Price:' components from even if you chose to use a regular expression over raw text, which indicates that extracting first to the table is still GoldPlating. And, as with regular expressions and other naive pattern-matching algorithms, your approach would ultimately fail on the exceptions. E.g. it can fail if they choose to make a price-value bold (<td><b>$19.99!</b></td>) in order to indicate that it is a 'Buy it Now!' opportunity or that the auction is about to expire, and it can fail if they have some inconsistent spacing (<td>Price:&nbsp;</td><td>19.99</td>), etc.}

{Of course, HTML in general is a horrible medium from which to extract any sort of semantic information. (Raw English text would be better!) Here is a realistic sample from the eBay table (for ONE item):

	:<div class="showcase navigation"><span class="gallery" onmouseover= "onGalleryPlus(event, '310012933651', '1', , '0', , '');"><img class="link" src= "" align="absmiddle" height="16" width="16">&nbsp;<a href="#Enlarge">Enlarge</a></span></div></td><td class="ebcTtl"><img title="New" alt="New" src= "" border="0"> <h3 class="ens fontnormal"><a href= "">Fire Starter, Pine Pitch Loaded Wood, Trade Wood,</a></h3><span class="icons"><img src="" id="sr_giftIcon_7" border="0">&nbsp;</span><div class="navigation">Low Shipping For Additional Bundles&nbsp;<br></div></td><td class="ebcPpl">&nbsp;<img src="" alt="This seller accepts PayPal" title="This seller accepts PayPal" border="0" height="16" width="16">&nbsp;</td><td class="ebcBid"><img title="Buy It Now" alt="Buy It Now" src="" class="binImg" align="middle" border="0"></td><td class="ebcPr"><span>$1.25</span><br></td><td class="ebcShpNew"><span class="shpTxt">$4.60&nbsp;</span></td><td class="ebcTim">Jan-06&nbsp;23:16</td>

Your sample perhaps can use some wiki-escaping.

And the same kind of processing can apply. We can fancy it up to ignore tags between TD and "intertext" (dummy tag) without significant re-work (using original HTML example).

{That might work until you see: "<td>Price:</td><td>&nbsp;&nbsp;<b>19.99</b></td>", at which point the spaces between the tags become 'intertext' (though perhaps not in HTML - at least in XML you can't generically dismiss spacing as insignificant). Patchwork solutions to pattern-matching problems usually give patchwork results, be they at the token level OR the raw text level. My own observation (after working with languages) is that, beyond the Chomsky Hierarchy, it is not a meaningfully 'easier' problem to match at the character level vs. the token or structure level. (I.e. any lex problem can be solved just as readily by yacc.)}

In fact, in table form that is easier to do than with regular expressions. Starting out with regular expressions creates a DiscontinuitySpike because reg-ex is either not sophisticated enough, or more work than using tables when dealing with tag-per-tag context. Reg-ex deals at the character levels, not atomized tags and attributes. That is because in tablized form, it is already a higher abstraction than what reg-ex would be looking at.

{You seem to be arguing that regular expressions are more work because they are more work in a rather circular manner. Are you ready to establish that pattern-matching problems over raw text are meaningfully more difficult than pattern-matching problems over tokenized text that just happens to be stored in a table?}

We get tag info, attribute info, and position info readily on hand. We can search, sort, and calculate on any of these with the ease of a query language (and basic app-language loops and IF's). Reg-ex makes a lack-luster query language. It is not powerful/concise enough on the token level. (Related: GrepVsDatabase)

{The conciseness of RegExp depends on how you choose to construct and represent them. They can be pretty darn concise if you use a constructive approach with sub-RegExps?, so you aren't, for example, constantly rewriting the 'extract-number' stuff.}

{That aside, what you desire to query when ProcessingMarkupLanguages is the data IN the file, not data ABOUT the file. This difference is fundamental. Unless your goal is to index or annotate the source, you really need to extract meaning from the file before you have anything worth querying. Arguing that a reg-ex is a lack-luster query language is pointless before establishing the need for one. Creating an artificial need for queries by forcing the representation of the file into a table doesn't count - doing so is a fine example of GoldPlating.}

I still don't know why you are calling it GoldPlating. I don't know what you are comparing it to in your head. If you want a code volume challenge, I'm all for it. The scenario is that we start out using tag quantity offsets, but later decide to filter out stuff such as B and SPAN tags and thus need to change such code.

{You should start with XML intended to carry semantic information in the first place. HTML is a horrible medium for semantic information - you, quite literally, are probably better off simply eliminating ALL the HTML markup and processing it as though it were plain-text.}

{But if you wish a code-volume challenge, I'll put a grammar vs. your approach after you've provided your approach. I'll never even touch tag offsets - that's just a horrible idea to start with when dealing with any sort of extensible language.}

{I've explained before that I'm comparing it to a hierarchical, language-dependent datatype that creates a single value/object that is a direct reflection of the original XML structure... e.g. in Haskell:
  type XMLTagname   = String
  type XMLAttrName  = String
  type XMLAttrValue = String
  datatype XMLData = (Tag XMLTagname [(XMLAttrName,XMLAttrValue)] [XMLData])
                   | Intertext String

{With such a structure, all I've done is represented the XML file directly within the processing language. There is nothing special about it at all. If I want to, I could write a general function collapse this to your pair of tables with a single fold function... but I'd get nothing out of doing so. More usefully, I could write a generic function that can check generic XMLData items against generic XMLSchema. I can also write a single big function to extract semantics that are quite sensitive to such things as whether 'tagx' is found under a 'tagy' vs. a 'tagz'.}

{With decent language support, or a good library, I can write a parser that operates on this structure via pattern-matches (akin to parsers that operate on strings, but starting instead with a generic XML parse tree). But if your language only supports parsers over strings, that would be the way to go. Ultimately, one extracts semantics via pattern-matching with a grammar, and the application of some function. A RegExp extraction is sufficient for the simplistic eBay example you initially solved with the offset, but higher level grammars work as one adds complexity. Either way, the parse tables are superfluous.}

  getValue(tag1, attrib1, tag2, attrib2, distance, value, compareOp, options)

This would get the N'th tag2 after tag1 (or before if negative) where attrib1 contained the given value. Tags outside of those listed are not counted in the "distance" metric. The compare-op is the comparison used (equals, contains, starts-with, etc.) Perhaps regex's could be used instead of compare-op. Or, just pass an expression to be eval'd (HOF-like in scriptland). I'd chose tabling to implement such functions, of course.

{Your example isn't any more a "mark up need" than would be scraping pages for dates or other text. It, very simply, isn't a meaningful example of 'ProcessingMarkupLanguages'. The fact that the markup is there is pretty much completely coincidental to what you are doing.}

{I imagined you were competent enough to choose a real example of ProcessingMarkupLanguages, seeing as this page is your baby. That you failed to do so is hardly my fault. I had even offered suggestions for examples prior to you providing your eBay example: library database interchange - letting others know what has been purchased, lost, checked in, checked out; and configuration file for a GUI (windows and widgets and results of pressing certain buttons encoded into XML). Both are fairly common exemplars of two distinct uses of XML (configuration & data interchange). }

. . . .

{Once the semantics are extracted, then what one does with it can vary... if it semantically is data, one might put it into a database. If it semantically is code, one might execute it. If it semantically is display information, one might display it.}

(Note that I am assuming that the markup-to-table parsing code is already written. [snip...] Plus, dumping the output of an OOP-based parser to tables may not be that difficult.)

{No problem.}

Typical Tasks for Data Conversion

RE: "I find that data often naturally is either not tree-shaped to begin with, or can easily be "flattened" for the purpose at hand" -- top

{Data often is 'flat' in a sense, but when you open an XML document, and even after you parse out the tags, you aren't working with "the data" yet. You're working with the representation and transport medium of the data. And, for XML at least, that representation IS 'tree-shaped' to begin with. CSV would be better to avoid tree-shape to begin with. Anyhow, once you get it down to {Item:X,Price:Y} tuples, THEN you'll have the "flat" data you desire. And getting there will generally require BOTH pattern-matching (syntax) and processing (to imbue semantics).}

I think you missed my point, but it is not material enough yet to rephrase.

RE: "Trees are over-used and over-assumed in IT in my opinion." -- top

{I agree that trees for data (in the 'propositions held to be true' sense) aren't a very good thing for business applications because they force navigational access to useful information, and navigational-access code creates inertia that ties one to the structure or representation of the data.}

{But trees for values are a different matter. I cannot agree with your tendency to use two or three tables (a whole micro-database worth) to represent a single value simply because you think that anything more complex than flat strings, numbers, and dates is some sort of sin against the IT profession.}

Let's compare with a scenario, per above.

{I'm not sure what we'd be comparing. My opinion and your opinion?}


CategorySyntax, CategoryXml

View edit of January 28, 2008 or FindPage with title or text search