Relational Trees And Graphs Discussion Two

Continuation (part 2) of RelationalTreesAndGraphsDiscussion, because it's getting TooBigToEdit.

Source of Utilities/Libraries

Who do you mean to call the end-user of an RDBMS? The programmer, or the person using the program? I would say the latter, but I doubt programmers will often be entering trees via a CRUD screen. I suspect in your approach, the programmer (user of the RDBMS) will usually need to provide these utility operations. Perhaps you have a different perspective because you work mostly with CBA, whereas I primarily do systems software and and so see a different user as primary.

I view the tree operations/utils as being downloadable kits or libraries that programmers/DBA's share for the DB. Some would come with the standard "build" and some could be from other's custom libraries or some shared add-on repository.

So the RDBMS will embody different set of DML/DDL standards for each kit, possibly with some common standards if the kit programmers work closely together, and the application-side languages will also need a helper API kit for each RDBMS kit. I suppose this could work, though I feel it is bordering on TheMostComplexWhichCanBeMadeToWork. I would highly favor a slightly more advanced DML/DDL that doesn't require dedicated kits to allow extension for UDTs and UDFs; I believe it would better capture EssentialComplexity.

We could add common "structure types" to the built-in DML/DDL, such as graph and trees. This is fine by me as long as we can still do our regular relational operators on the nodes. If you can find a way to add custom data types without isolating the nodes from the usual and existing operators and indexes (node X can be simultaneously "in" Stack A and Tree B and Table C, for example), then please show how it's "done right". I see either encapsulation interfering with relational on a fundamental philosophical level, or else you have to make ugly or complex compromises. Prove me wrong. -t reply at: (page_anchor: don't see encapsulation)

[What standard reference on the RelationalModel claims that "encapsulation interfer[es] with relational on a fundamental philosophical level"? Would it not be more accurate to state that encapsulation interferes with Top's TableOrientedProgramming approach?]

If we add all relational operators to a "stack", is it still a "stack"? -t

[Please explain how your question answers my question.]

Appending Trees and Node Identification

Say I have two tree tables, treeTable1 and treeTable2. treeTable2 was produced by "table-izing" a particular EssExpression via a utility function, and has its own set of node_ids numbered from 1..N for N nodes. treeTable1 is a permanent tree table, carries about ten thousand trees, and has node_ids numbered from 1..~100k. Now I wish to update treeTable1 to contain the tree described in treeTable2. That is what I mean by "adding". The difficulty in doing so involves mucking around with node numbers.

Tree nodes need *some* way to be uniquely identified in all known digital technology. Traditional RAM-pointer-based trees use RAM addresses to be unique. Thus, if your version of tree A and tree B are in the memory of one computer, then it's just a matter of linking up the two trees to "add" them. Simple, I will agree. However, this technique/assumption is limiting. What if you want to merge trees stored on different machines or one tree comes from disk or from an XML file from another company? There is no guarantee the node addresses will be unique if these cases exist. Thus, you have to solve a similar problem unless you hard-wire your system to be single-machine-only. You present your solution, and then I'll present mine.

That was dishonest of you. The issue of different formats or distribution is not analogous. The relevant comparison regarding distribution and working with different formats would be, say, a union of tables on two different machines, or where one table comes from disk and the other is in a CSV file with non-standard headers.

In the above problem, treeTable1 and treeTable2 can be presumed to be in the same format, on the same machine. One of them was a temporary table created to carry a single tree, and the other was a permanent table carrying many trees. This is a situation that exists because your your own suggested strategies for working with complex values often involve the construction of a temporary table, and because the user is responsible for maintaining the tables. The analogous situation for an RDBMS with structured DomainValues is to insert into any table a tree value described in a query, which is performed with the normal insert statement. There is no 'tree table', no 'node ids' exposed to the user; if the RDBMS must move a value from volatile memory to persistent memory then that is done under the hood with no effort from the user. That is my solution. Now present yours.

I'll give one approach now: serialize the 2nd tree, and then load it in the same way an original EssExpression-like doodad would be using an "appendTree" operation. (The appendTree and makeTree operation could be the same thing since appending to an empty table is the same as making, but may not be optimized for each task.)

So you suggest I work around the RDBMS by serializing the tree from treeTable back out then loading it again with 'appendTree'. Nice. One would have thought you would favor a relational solution to these problems, especially with all your raving about the advantages of relational, especially after all your efforts to get these trees into relational form in the first place.

Let's back up a bit. There seems to be a big miscommunication going on here. Suppose you have a "big type database" and have your two trees, A and B, in virtual memory, not actual memory, that you are planning to merge soon. (Actual memory is too small.) Suppose you need to shut down the server to fix a noisy fan. Thus, you save your "types database", with the trees, to disk somehow. When the fan is fixed, you boot up the computer and your "types DB" is somehow loaded back into virtual memory, and you now plan to do the merge/link of tree's A and B.

First question: how are two tree's nodes kept separate (unique) while stored on disk during the fan repair? (Remember, the merge has not happened yet.)

Second, when both trees are loaded back into memory after fan repair, how are the two tree's nodes kept separate (unique)?

I suspect your answer will be: "We just have one big virtual address space. God-RAM. The two trees reside in different addresses; thus, there's never overlap." But we *could* emulate this technique in a RDBMS by having a row-ID that is unique across the database. No two rows would ever have the same address even if in *different* tables. Then it would not have the ID overlap issues you describe and our tree libraries could use very similar techniques to your RAM-centric approach. (Whether this has big down-sides or not has to be checked.) -t

You said: "I suspect your answer will be: 'We just have one big virtual address space. God-RAM.'" - That is essentially correct, but there is no reason to be so disparaging about the idea. One big virtual address space is how most RDBMS's and FileSystems are implemented under the hood. There is no reason for one that supports complex DomainValues to be any different. Memory-mapped IO, a log, application-driven paging, and a few bits in each pointer for GarbageCollection and type annotation can even make this very efficient, scalable, and AtomicConsistentIsolatedDurable.

You said: "we *could* emulate this technique in a RDBMS by having a row-ID that is unique across the database" - global auto-number would be a good idea, and it would resolve many of the problem described above, and would also simplify identifying relationships between tables just by looking at the data. I'd give it a "win" overall. It does imply the user (by which I mean the programmer) must 'reserve' node IDs when constructing new trees into tables for transport, but one can escape even that using RFC 4151 instead of numbers (at some cost to verbosity). Well... I'll give you that battle if you start pushing for global IDs. (But by no means the war. ;)

Woah! We agree on something. I need a deep-scrub shower. I've proposed this in MultiParadigmDatabase. (One slight down-side is that it may take up slightly more bytes.) -t

Serializing and Transfer

Re: "with user-defined types the language knows how to serialize all of the UDTs by their construction" - That sounds odd. Their creation technique/syntax shouldn't generally affect their output layout. Sounds like arbitrary binding of input and output. Please clarify. -t

UDTs in an RDBMS would necessarily be described as part of the DataManipulation+Data Definition language, similar to how SQL has ways to describe new tables. The physical layout in the RDBMS persistent storage would be entirely up to the RDBMS and its optimizer, but would still be determined based on the UDT definition in the DML. Translation and serialization of these structures in communications between systems would also be determined automatically from the UDT description in the DML, possibly with modifications by the communications protocol (e.g. to allow lazy transmission of large values in a query). When I say "the language knows", what I really mean is "all protocols and APIs based on the DML/DDL" can automate the serialization, persistence, etc. of these values without any additional support from the user.

How does this work when a given "data structure" shares nodes with a variety of different data structures/tables?

The same as usual. Under the hood, it would be shared pointers into persistent storage, with GarbageCollection after the last pointer to a value is removed. Since values are immutable (modulo lazy evaluation) there is never an issue with sharing them... e.g. the same million-character string may be shared by a hundred thousand different 'facts' and yet have only one physical copy. Any RDBMS that supports "strings" as a type can do that sort of sharing for strings, but an RDBMS that supports UDTs can do it for any UDT. For example, a tree structure could be an element in stacks, part of various other trees, used in many different facts and tables, etc. and yet have only one physical copy. (As a side note, it is a considerable advantage for sharing to have use reference-to-child rather than reference-to-parent: ref. to parent requires copies whenever the same subtree is part of two different parent trees.)

And what keeps a RDBMS from doing the same?

I AM talking about an RDBMS, you nut. What's up with that question? It sounds like an insinuation that a system can't be an RDBMS if it supports structured DomainValues. If you meant to ask: "what keeps my favored approach of destructuring DomainValues into tables from doing the same?" then the answer is that you've pushed the complexity of node maintenance (entry, layout in tables, use in queries, garbage collection, etc.) to the user of the DataManipulation language as opposed to such things being implied from the definition of a UDT. Since this stuff is in the hands of the user, the RDBMS lacks the necessary information and control to implement useful automation. related: (page_anchor: shoving it to the user)

If a given data set fits our convention of "tree", then serialization can also be automatic.

Indeed. And the effort to establish such conventions in a manner that is reliable enough for such automation is precisely equivalent to the effort required to standardize DML/DDL to support arbitrary UDTs.

However, we might as well use the table-centric facilities already in place since our transfer file will be simpler (less formats) if we use tables instead of a different custom layout for each data structure "type". For example, if the customer in a different company does not have the "type definition" by chance, they can still read the data as tables.

You can reasonably argue that your system has fewer primitives and thus a simpler implementation, but, before you argue that "we might as well use it," you would be wise to ensure said simplification isn't introducing AccidentalComplexity elsewhere. The need to introduce new information into a layout - node_ids in particular - is not insignificant AccidentalComplexity introduced to this file transfer. When comparing file transfers and serialization, this AccidentalComplexity must be weighed against the relative complexity of supporting a few extra UDT primitives (record type, union type, etc.) that would relieve the need for the AccidentalComplexity. In this particular case, I don't believe either of us can justify a claim that one side in this debate has an advantage over the other for serialization and file transfer.

Regarding lack of "type definition", I (very strongly) favor structural typing where the serialization format automatically implies the type definition (see NominativeAndStructuralTyping); at most I need to ship with each value a version ID for the overall data format. However, with nominative typing one would ship a URI for the official type definition along with the data. Serialization and file transfer are well understood problems. One can complicate them up a bit with 'cursors' and 'lazy transport' and 'compression' and 'encryption', but I do not believe either of our favored approaches offer advantages or weaknesses on those issues.

RE: Since this stuff is in the hands of the user, the RDBMS lacks the necessary information and control to implement useful automation.'' (page_anchor: shoving it to the user)

I am not "pushing it to the user". Why do you keep saying that?

I'm not certain what you think when you see the word "user", but I consider an RDBMS to be SystemsSoftware and so (as with all systems software) "user" refers to the programmers who interface with the RDBMS and write the CRUD screens and such, most emphatically not to the people who use the CRUD screens except insofar as the categories overlap (users of the CRUD screens have a thick layer of indirection between them and the RDBMS).

With my definition of "user" clarified, you are incorrect when you claim you are not pushing these details to the "user". This is fundamental; indeed, very intentionally, you are exposing (as opposed to encapsulating) the details of representation for trees. The users - programmers - are aware of such things as 'parent_id' and 'node_id', of the need to identify a tree by both a table and a node_id. The degree to which you "push" representation details to the user is also the degree to which you take that power away from the RDBMS and its optimizer, garbage collector, etc.

Is (as I suspect) the confusion here about different understandings of the word "user" in SystemsSoftware? Or do you honestly believe you are not "pushing it to the user" even given clarification of my use of the word?

RE: If you can find a way to add custom data types without isolating the nodes from the usual and existing operators and indexes (node X can be simultaneously "in" Stack A and Tree B and Table C, for example), then please show how it's "done right". -t

Sigh. This is about the tenth time I've answered this question. For structured values, sharing is the default. The same 'string' can be (simultaneously) in Stack A and Tree B and Table C without difficulty. See CopyOnWrite, ValueObjectsShouldBeImmutable, KillMutableState. With immutable value structures, there is no writing so there also is no need to copy. Beyond persistent structure, one is also free to "internalize" value structures such that there is exactly one copy of any given value and so the same values aren't indexed multiple times. THESE ARE SOLVED PROBLEMS AND WELL KNOWN EVEN BY PRACTITIONERS/PROGRAMMERS OUTSIDE OF ACADEMIA.

And indexes over values are no more difficult than is lexical indexing for strings (which is common in SQL). One keeps an index of substructure to the full value (e.g. word to string), and from that full value to table regions, thus allowing one to ask for all entries in a table containing strings containing particular words, or all rows containing trees containing 'red' nodes. THESE ARE SOLVED PROBLEMS, ALBEIT THOSE WHO KNOW THEM TEND TO READ DBMS / INFORMATION PROCESSING BOOKS.

RE: I see encapsulation as interfering with relational on a fundamental philosophical level. You have to make ugly or complex compromises. Prove me wrong. -t (page_anchor: don't see encapsulation)

PROOF ONE: I suspect you'd agree that "it doesn't matter how the RDBMS represents strings or NULLs under the hood". Do you disagree? Are you going to argue that programmers should have exact control and knowledge of the bit-layout in the RDBMS? If you don't disagree, that is a killing blow (as sufficient and complete proof) against your assertion "I see encapsulation as interfering with relational".

PROOF TWO: RelationalModel is defined as relating DomainValues held in attributes. A 'value' is different from the 'representation' of a value... e.g. 4, four, 100 base2 can all 'represent' the same DomainValue. Because of this, an RDBMS remains compatible with the RelationalModel even if there is a layer of indirection that encapsulates the representation of DomainValues. Thus, if you don't believe that representation can't be encapsulated without interfering with relational, you should.

PROOF THREE: (Concluding) There is no encapsulation of DomainValues. (Proof) DomainValues are values. Values are immutable are defined (mathematically) in terms of operators that, when applied, return other values in a formal and definite manner. Since values are immutable, you are free to apply the same operators to the original value in any permutation you wish, and thus explore all possible properties of the value that can be reached by use of operators. Since values are defined mathematically in terms of operators returning other values, the set of properties you can reach by use of operators is precisely identical the the definition of the value. If the whole definition of something (call it X) is accessible, it is unreasonable to say that X is encapsulated. The whole definition of any DomainValue is accessible, therefore it is unreasonable to say these values are encapsulated.

From PROOF THREE we can also conclude that your complaint about encapsulation is a StrawMan. There never was any encapsulation of DomainValues to complain about, and the fact that a DomainValue is "structured" makes no difference on that fact. At most, one encapsulates representation, which is not a problem for reasons described in PROOF ONE and PROOF TWO.

What you can say, TopMind, is that DomainValues (of all sorts) are "opaque to relational operators". However, that is very different from encapsulation, has very different engineering properties. And anticipating your next complaint brings me to:

PROOF FOUR: (Concluding) "Opaque" DomainValues do not interfere with relational. (Proof) RelationalModel is defined by its creator (EfCodd) and maintainers in terms of DomainValues and 'relations' between them. DomainValues in the RelationalModel are never given greater requirement in the RelationalAlgebra and RelationalCalculus than an the ability to compare them for equality. This means that RelationalModel was defined in terms of "opaque" DomainValues. Thus, the RelationalModel would be inconsistent if opaque DomainValues interfered with relational. Assuming the RelationalModel is internally consistent, opaque DomainValues must not interfere with relational.

I used "interfere" in an informal way, and it perhaps was not the best way to state it (details were given elsewhere). Thus, this proof is not useful to settle anything outstanding. You only "proved" your personal mis-interpretation of my phrase. -t

Yes, you used 'interfere' because you believe everything should be broken into relational, and while structure DomainValues don't in any way inhibit that (i.e. an RDBMS supporting structure domain values can still be used the exact way you want to use it), they also don't require it, and you want it to be required the same way StaticTyping fans want to require you to ensure type-safety. Between your opinions on typing and your use of strings, I find your attitude on this subject to be entirely hypocritical.

That still leaves PROOF THREE as settling something outstanding. It destroys one of the premises you are using for half your arguments. You repeatedly make assertions about how DomainValues are encapsulated or how 'if' they are encapsulated it causes problems (run a search for 'ADT'), but the fact is that they aren't encapsulated, as demonstrated in PROOF THREE.

I perhaps should reword my complaint. It was written off-the-cuff. But opaqueness has down-sides, as the string example illustrates on a smaller scale. We'd get more powerful collections and reuse if strings could use relational operators and perhaps vise versa. [By which measure of "powerful"?] We cannot share operations and values across "collection types" as easily with hard types. I see no evidence that the upsides make up for this (with coded examples, not anecdotes and "trust me" HandWaving).

In short, it's better factoring and reuse if these are shared across "collection types" (where appropriate):

All else being equal, do you agree this is generally a good thing?

Well, the "where appropriate" makes it difficult to say no, but also makes any agreement non-substantive. I believe this would make a yes/no answer misleading, so I'll refrain from producing one.

More relevantly, I still have no reason to believe your assertions that, say, sharing GenericProgramming operators 'easier' with relations than with other types. There is plenty of reason to believe value sharing of data values is achieved far more thoroughly with 'hard types' since one never encounters the 'parent_id' duplication of trees (two subtrees 100% identical except different parent_ids), and one is even free to 'intern' values such that there is exactly one copy of any given structure in the entire database (e.g. via use of a hashtable). Indexing is the only possible sticking point, and on that one I'll make few comments:

Indexing - and automatically optimizing queries to take advantage of indexes - still represents a grand technical challenge. It isn't something that becomes easier with structured DomainValues.

I will agree with this much, TopMind: your approach of destructuring DomainValues like trees and graphs is better at indexing structured values other than strings today, when we don't have any RDBMSs that support structured values much less indexing of them (other than strings). But strings themselves serve as a counterpoint to your argument, and you have provided NO good reason to believe your approach is inherently better for indexing or sharing indexes. Given these points and those above, I am forced to conclude your assertions about the alleged superiority of indexing in your approach have been premature, naive, and probably involved more WishfulThinking than technical analysis.

Well, I give more weight to empirical testing than you do. You are falling into the "Greek Fallacy" of thinking that you can solve such intricate problems merely by thinking about it on paper hard enough. -t

If you honestly think you've been giving weight to empirical testing, then show me the empirical testing of one system against the other that has demonstrates the relative sharing and indexing properties of each approach that you have been claiming. If you cannot provide such empirical testing, then it is ridiculous for you to believe you are giving 'weight' to empirical testing in making those conclusions. In the absence of numbers, I at least rely on objectively justifiable statements. You don't even do that much. It seems you are arrogant enough to believe you can provide advice like "nested values shouldn't be in relational cells" without empirical observation AND without analyzing it based on known technical properties. Perhaps you should start a religion for IT practitioners... you have a book already, don't you? I often see you make a comment containing "in my book".

The revolution really started in the Renaissance when people were willing to get their hands dirty to test their thoughts. And there is also the learning curve: IT people know RDBMS. They will be easier to absorb if we add some extensions to them rather than almost start from scratch. I'll take evolution over revolution until the "type base" has been road-tested. I hope your experiments go well, but we need something in the meantime. Relational: this generation; type-base: next generation. -t

I'll repeat it again: I've not suggested being rid of RDBMS! You are pissing me off with your attitude. Seriously, your persistent misrepresentation of 'RDBMS' as though it is mutually exclusive of structured DomainValues belongs in ObjectiveEvidenceAgainstTopDiscussion. You should be ashamed of yourself. If you meant: "IT people know SQL", perhaps I'd have agreed with your statement. But I doubt you meant that. I think you're once again preaching that ridiculous idea that structured DomainValues somehow violate something in the definition of the RelationalModel.

Relational with inflexible SimplySimplistic types that force lots of workarounds: this generation; Relational with UDTs capable of supporting arbitrary domains: next generation.

Not "ridding", but demoting. You are misrepresenting me. -t

You represented yourself quite well when you used the words 'Relational' vs. 'type-base' to describe the two different approaches, and I have been quite clear in the role of Relational even with structured DomainValues. Saying I've been "demoting" relational is entirely dishonest (or at least incorrect, if you don't know the definition of "demote"). If you wish to call our approaches the "deep type" vs. "shallow type", or "structure type" RDBMS vs. "destructure type" RDBMS, do so.

You are not integrating [relational and types] well. They are kind of like neighbors who talk to each other every blue moon. Nor are you factoring CollectionOrientedVerbs (and features) well across types. -t

Whereas you look at my approach and see "neighbors who talk to each other every blue moon", I see "neighbors that each get their jobs done without need to interact or interfere with one another except at predictable places and times". Correspondingly, I'd probably describe your approach as "kind of like one person with two homes, two jobs, and a schizophrenic identity crisis".

To me, "integrated well" means PrimitivesAndMeansOfComposition and SymmetryOfLanguage where each 'primitive' does one thing and does it well, primitives don't overlap in purpose or capability, and primitives may be composed and interleaved efficiently and safely without "gotchas" or arbitrary obstructions. I consider defining relational operators for types other than relations to be a BadThing, since it means those other types overlap in purpose and capability with relations and therefore introduces LanguageIdiomClutter and makes the optimization task more difficult.

I suspect that what you consider "integrated well" would be what I'd call an amorphous, unstructured, inconsistent BigBallOfMud where where primitives overlap in purpose and capability (such as arbitrary non-relation DomainValues responding to relational operators), does them poorly (unclear as to which purpose a primitive is applied, cannot be optimized), where reuse is made a chore by arbitrary restrictions and gotchas about what can be composed where, and where simultaneously achieving safety and efficiency is a new and unique challenge for each project.

Why would I want that idea of "integrated well"?

Because your magic dream Earth-verifying type compiler database does not exist yet.

Let me repeat the question: Why would I want that idea of "integrated well"?

As far as "user", can I introduce these classifications of participant, which are more specific?:

As far as knowing node ID's, typical tools that communicate with RDBMS have to use a representation of a unique "node" identifier. Otherwise, how is an app developer going to hook say a tree GUI widget to the DB? (Think of Windows Explorer's interface) It is an EssentialComplexity of interfacing with DBs. What is your GUI widget-friendly alternative other than IDs? Paths? Possible, but not always friendly, and there's not always a way to make a unique path, depending on a given tree's rules. We can with files, because of the way file systems are defined, but this is not a guaranteed property of trees in general.

URIs and tags (RFC 4151) are quite friendly. They are verbose, I'll agree, seeing as they represent something global to the whole universe... but their being global (not tied to any particular DBMS) makes them that much more friendly, plus suitable for use in distributed systems, database integration, and transport between databases and applications without synchronization or extra message passing. I would like to take a position and assert that node IDs, especially auto-numbers, should be excised from RDBMS utilities, and replaced with use of URIs, RFC 4151, and equivalent facilities.

ObjectIdentity (including node_id) is something that should be avoided where unnecessary, since it is ultimately an artificial construct imposed by human institutions rather than nature or measurement, and is unnecessary to RelationalModel. But, where ObjectIdentity is useful, it should be done right - done just as globally as the data in the database. Auto numbering is a hack... one that is error-prone and causes much AccidentalComplexity. RFC4151 is designed for this purpose.

Also, use of paths is unnecessary complexity; tags can generally be opaque to the DBMS. At least when acting in their role as unique identifiers (as opposed to addressing and protocol), properties between objects (paths, containment, etc.) should generally be expressed as relations rather than buried inside the string.

Anyhow, I don't object to further classifying users, but I don't think it will break my habit of referring to developers as 'users' of SystemsSoftware.

(page anchor: node sharing example)

 A seven node tree:                 
               "root"                   .
               /     \                  .         
          "node"     "node"             .
          /  \        /   \             .
     "leaf"  "leaf" "leaf" "leaf"       .

in both cases the '0' id is a null.

child_id schema parent_id schema key left right value key parent seq value -------------------- -------------------- 1 0 0 "leaf" 1 0 0 "root" 2 1 1 "node" 2 1 0 "node" 3 2 2 "root" 3 1 1 "node" 4 2 0 "leaf" 5 2 1 "leaf" 6 3 0 "leaf" 7 3 1 "leaf"

Now you tell me which has better "data value sharing" and "node sharing".

Nice example. If we are only dealing with binary trees, the first approach seems simpler in general. However, if we want to scale the branch count up, it may be problematic.

RE: the first approach seems simpler in general. However, if we want to scale the branch count up, it may be problematic.

Scaling may also be performed in the child_id approach without difficulty. The underlying type requires a simple upgrade:

 type BTOS of tnode(left:BTOS right:BTOS value:String) | null
 type MTOS of tnode(children:{List MTOS} value:String) 
 //objection about change moved to (page_anchor: waitaminute) to reduce clutter

Scaling up the same tree:
          | x3
          | x4

Comparing the schema (using the ordered approach):
  child_id schema:
  TABLE node_table       
    children (key to children_table)
    value (string)
  TABLE children_table
    value (key to node_table)
    next  (key to children_table)  
  parent_id schema:
  TABLE node_table
    seq    (order)
    parent (key to node_table)
    value  (string) 

Resulting datasets. '0' id is used for null, but otherwise all IDs unique in each schema to avoid confusion.

 .child_id schema  . . . . . . . . .  . . parent_id schema
 .----------------------------------- . . ---------------------
 .node_table . . . . . children_table . . node_table  . . . . .
 .key children value . key value next . . key seq parent value
 . 1 .  0  .  "leaf" .  4 .  1  .  0  . .  1 . 0 .  0 . "root"
 . 2 .  7  .  "node" .  5 .  1  .  4  . .  2 . 0 .  1 . "node"
 . 3 . 10  .  "root" .  6 .  1  .  5  . .  3 . 1 .  1 . "node"
 . . . . . . . . . . .  7 .  1  .  6  . .  4 . 2 .  1 . "node"
 . . . . . . . . . . .  8 .  2  .  0  . .  5 . 0 .  2 . "leaf"
 . . . . . . . . . . .  9 .  2  .  8  . .  6 . 1 .  2 . "leaf"
 . . . . . . . . . . . 10 .  2  .  9  . .  7 . 2 .  2 . "leaf"
 . . . . . . . . . . .  . . . . . . . . .  8 . 3 .  2 . "leaf"
 . . . . . . . . . . .  . . . . . . . . .  9 . 0 .  3 . "leaf"
 . . . . . . . . . . .  . . . . . . . . . 10 . 1 .  3 . "leaf"
 . . . . . . . . . . .  . . . . . . . . . 11 . 2 .  3 . "leaf"
 . . . . . . . . . . .  . . . . . . . . . 12 . 3 .  3 . "leaf"
 . . . . . . . . . . .  . . . . . . . . . 13 . 0 .  4 . "leaf"
 . . . . . . . . . . .  . . . . . . . . . 14 . 1 .  4 . "leaf"
 . . . . . . . . . . .  . . . . . . . . . 15 . 2 .  4 . "leaf"
 . . . . . . . . . . .  . . . . . . . . . 16 . 3 .  4 . "leaf"

The scores (lower is better):
 . . . . . . . . . SCHEMA . . . 
 . . . . . . child_id . parent_id
 node count: . . . 10 . . . . 16
 strings:  . . . .  3 . . . . 16
 cells:  . . . . . 30 . . . . 64

The sort of sharing seen on the child_id schema is algorithmically achieved by "interning" of values. I.e. use a table that indexes values for some reason (any reason), index the new values (from bottom up) as they are added, and reuse old values wherever they exist. This allows one to share the indexing that already exists wih the old value, and (if done rigorously) also makes equality comparisons very cheap. E.g. since the child_id schema is fully interned, I can compare any two values for equality simply by comparing their reference IDs.

It's worth noting that the degree of sharing, above, would increase exponentially in favor of the child_id schema for each 'layer' of depth added to the tree. Adding another, adding 2 'flowers' per leaf, would require adding 3 nodes and 1 string to the child_id approach, and would require adding 32 nodes in the parent_id approach. The parent_id approach is, therefore, exponentially worse for sharing than the child_id approach. This is even true in common cases, not just the general case. Rather than "adding at the bottom", consider the far more common problem of "wrapping at the top". Suppose you want to create a new tree of this form:
      "new_root"     .
      /        \     .
    "root"   "root"  . (original "root" tree)
      |x3      |x3   .
     ...      ...    .

In the child_id approach one would do something like (in pseudo-SQL):
  freshKeys NewRootID L1 L2;
  INSERT INTO node_table(key children value) [(NewRootID L1 "new_root")];
  INSERT INTO children_table(key value next) [(L1 3 L2) (L2 3 0)];

That is, one adds 3 nodes and 1 string, and voila! Done! This child_id schema seems pretty efficient and 'lean'.

For the 'parent_id' schema, however, one needs to duplicate the entire "root" tree, and do so twice if the original "root" tree is to be preserved because it is still being used. One then adds one new node and tweaks the parent_id of the top two nodes to perform the update:
  freshKeys NewRootID;
  rootCpyA = helperDuplicatesTree(tree_table_in:node_table tree_table_out:node_table key:1);
  rootCpyB = helperDuplicatesTree(tree_table_in:node_table tree_table_out:node_table key:1);
  INSERT INTO node_table(key seq parent value)[(NewRootID 0 0 "new_root")];
  UPDATE node_table SET parent_id=NewRootID,seq=0 WHERE key=rootCpyA;
  UPDATE node_table SET parent_id=NewRootID,seq=1 WHERE key=rootCpyB;

Thus, to add this "new_root" tree to the node_table in parent_id schema requires adding 33 nodes, 33 strings (which I'm not showing for obvious reasons). Even if done with a destructive update, it requires adding 17 nodes and 17 strings and makes the original tree unusable. In addition, the greater verbosity increases code-bloat and the need for helper functions, and generally further reduces performance.

Scores after adding the "new_root" non-destructively (lower is better):
 . . . . . . . . . SCHEMA . . . 
 . . . . . . child_id . parent_id
 node count: . . . 13 . . . .  49
 strings:  . . . .  4 . . . .  49
 cells:  . . . . . 39 . . . . 196

Despite the fact that it's often rare to have that much sharing within a single DomainValue, the issue of one value wrapping or carrying or sharing some structure with another is very common in domain maths for both intermediate representations and final results. The parent_id schema will almost invariably involve an abundance of value copies, which is wasteful of resources and does a remarkably poor job at sharing or allowing for efficient indexing or comparisons. If you've wondered why I've been goggling and calling "ridiculous" your claims of sharing of data values and indexing, this is why. (What I've wondered is how anybody in the information processing business who has held position as a DBA doesn't already know all this and find it obvious... but perhaps this knowledge is more specialized than I had assumed.)

So... How does the above relate to DomainValues you ask? (rhetorically)

Structured DomainValues and UDTs are typically implemented based on the child_id schema, where the notion of 'child_id' is replaced with 'pointer', and access to any underlying table is encapsulated. Recall, encapsulation of representation for DomainValues is by no means 'against' the RelationalModel (see PROOF ONE, PROOF TWO above).

By keeping it under-the-hood, the RDBMS would be able to perform the algorithmic "interning" of values automatically (rather than making each application or 'kit' developer reinvent it), thus achieving maximal levels of sharing of node values and sharing of indexes. Additionally, this encapsulation of representation can (a) prevent people from accidentally mutating a shared value, (b) support automatic GarbageCollection, (c) allow the serialization of structured values to be standardized and optimized separately from its persistence layout (so each RDBMS can go its own way when deciding how to represent these structures), (d) allow one to operate on DomainValues as whole units (e.g. for an equality test) which helps composability and GenericProgramming and eliminates need for representation management helpers, (e) allows lazy evaluation where functions are only executed insofar as they are needed. None of these points are small, insignificant, trivial, or irrelevant. Don't dismiss them simply because I'm not focusing on them here. (page anchor: favoring encapsulation of representation)

People obtaining a ComputerScience education often have an option to take courses dedicated to information processing and data management techniques, and there is no shortage of them (there's a reason that databases are a multi-billion dollar per year industry), so I won't pretend to know all the indexing methods available today, but I can point out that indexing of 'strings' under MySql follows the 'structured DomainValue' approach. That is, programmers don't need to care that, under-the-hood, their variable sized strings are represented in dedicated tables and have a bunch of dedicated indexes. All they need to do is issue their "likeness" queries or whatever, and let the query optimizer take care of leveraging the indexes. Tables exist, under-the-hood, to support representation, indexing, interning, memoizations, and so on. Due to encapsulation and full awareness of the purpose of these tables, the RDBMS can maintain these tables automatically and optimally with only moderate degrees of intelligence on part of the RDBMS. One could, with some tweaking to existing RDBMS, probably achieve similar automatic maintenance of 'views' for indexing in the destructured value approach, but the improved sharing of values means the RDBMS can make more optimal shared use of its indexes.

Indexing of the sort favored by MySql, without explicit access to tables representing values, is what I favor for DomainValues.

Of course, in the absence of a "sufficiently smart" RDBMS, I'd need to tell the RDBMS what to index. A potential (and simple) way of telling the RDBMS what to index would be to hand it a function that, for any given value, returns a set of values that should be indexed to that value, from that value, or both. If I don't have a "sufficiently smart" query optimizer, I could give that function a name and simply use said name as part of any query that requires the index. For example: hasWord = (function body that returns set of words for input string), full index UDT on hasWord, later in query ... where hasWord(A,W) and hasWord(B,W). I mentioned this approach above, but I suspect at least one member of my audience of two lacked the context to understand it at the time.

And, in anticipation of a complaint: but, but, but... What about mutation? Well, the "goal" here would be to share structure across a mutation. There are a few points here:

First is that we're dealing with DomainValues and we want there to be as much sharing before the mutation as possible. Mutating a shared string directly in an exposed table would essentially mutate the string for all users sharing that value (i.e. changing all references to "red" to be "blue" in the entire database), which is probably not what you want. So, unless the approach already sucks woefully at sharing, you really don't want to update values that way without producing at least a partial copy.

Second is that most implementations of domain maths (both ObjectOriented and FunctionalProgramming) do a magnificent job of sharing structure across operations that allow for it. E.g. whether creating a new binary tree from two smaller ones, or grabbing the left side of a binary tree value. The problem is a bit more difficult for linear structures (sets and relations, arrays and tables, lists, strings, binary large objects), but even those can be shared efficiently if they are represented as ropes under the hood ('tree' representation of linear structure, see 'Rope (computer science)' on Wikipedia). Even fragments of hashtables can be shared using bucket hashing (which is the basis for efficient distributed hashtables). Due to this sharing of structure, there is a lot of node, data value, and index sharing when working with values that are produced from other values. Interning of values (described above) will take care of the rest, allowing sharing to be maximal.

The parent_id schema, by comparison, requires creating copying both smaller trees to produce a large one.

And creating a DomainValue equal to the left side of an existing binary tree DomainValue requires either destructively mutating the left-side's parent_id to 0 (thus affecting all other uses of the tree identified by said parent_id), or copying the whole left-hand tree. There is almost no structure sharing unless one performs it via destructive update. And if the goal is a destructive update that affects the entire database, even then the child_id approach is not at a loss (one simply destructively updates one of the child_ids), so there isn't even an advantage to parent_id on that issue.

One does lose the ability to explicitly destructively update DomainValues (on purpose OR on accident) if their representation is encapsulated into trees and UDTs, but even that is no great loss. With the structured DomainValue approach one may always add layers of indirection to achieve the same effect (e.g. adding a language translation table, allowing one to logically change all references from "purple" to "violet"). With the alternative approach, there is no equivalent option to strip indirection away. The expressive power and performance advantage here still favors structured DomainValues.

Anyhow, I'll conclude this by saying: The use of structured DomainValues has been theoretically favored for performance, flexibility, composability and modularity, scalability, sharing, convenience (implicit support) for entry and garbage management, indexing, safety, security for data access and secrecy preservation, ability to add versioning and bitemporal databases, and EVERY metric I have EVER tested them against except one. That ONE metric they fail on is simplicity of implementation. Admittedly, I'm only concerned about my own little corner of "micro-rigor". TopMind's approach might be better for obfuscation, job security, enhancing the industry for HDD manufacturers, improving demand for powerful CPUs, and a number of other 'macro-rigor' metrics that I simply don't bother looking at.

RE: The underlying type requires a simple upgrade ... (page_anchor: waitaminute)

Hold on, Tex. First, you had to change the underlying definition. I didn't have to change squat. -t

All the fact that you "didn't have to change squat" to move from binary trees to multi-node trees really means is that people were able to violate the DomainValue type, which was "binary tree". To call this an advantage on your side is, at the very least, extremely dubious.

Second, you didn't address what to do with any existing data (an existing binary tree that now wants more branches). -t

I would suggest converting it. This should do, and shouldn't be difficult to implement any number of ways:
 insert into node_table(key children value)[(0 0 "")]; // preserve left-zeroes
 For each (Key L R Val) in original_tree_table(key left right value):
  freshKeys NL NR;
  insert into children_table(key value next)[(NL L NR) (NR R 0)];
  insert into node_table(key children value)[(Key NL Val)]
 End Loop

You still had to do a dump/reload step.

So? If you wanted to extend a binary tree with a new branch, you'd need to copy the tree too. See the note on "but, but, but... What about mutation?" above. The parent_id approach is NOT in a better position. We're dealing with DomainValues, not data. If data requires a new DomainValue, we create a new one. Also, if we know that our domain will be dealing with binary trees and only binary trees for a long time, then a binary-tree kit for tables could be built/found for such. It's one of those optimizing for current requirements versus optimizing for future flexibility decisions. -t

Indeed, your approach allows any number of add-ons that overlap in purpose and reimplement bunches of different stuff.

Third, that's not a minor definition change. You went from two scalars to a nested structure. -t

No I didn't. type BTOS of tnode(left:BTOS right:BTOS value:String)|null is a type with two nested structures (a BTOS on the left and a BTOS on the right), and type MTOS of tnode(children:{List MTOS} value:String) is a type with one nested structure (a list of MTOS). I'll chalk the idea that those were scalars to a misreading of the type definition language, which I'll readily forgive because it's sort of a hybrid type definition language between OCaml and Haskell and Oz.

My mistake, they are not "scalars". But it's still a significant change, structure-wise. -t

Is that supposed to be a complaint? From definition, it is pointless to make insignificant changes. More importantly, it's a simple and obvious change that any competent and interested person would discover without difficulty. "Oh, I want to scale this binary tree up to an M-ary tree! Hmmm... Oh, duh! I'll just turn this binary structure into an M-ary structure!" Simple but significant is the best sort of change to make.

Unfortunately, we have no production-level data warehouse implementations of RDBMS's supporting structured data against which to compare empirical data between the use of structured DomainValues and others. That leaves us with no empirical evidence supporting either side, outside of the experiments on paper.

That's not entirely true. We have Oracle's tree and graph extensions to tables as an example of hard-wired trees and graphs "tacked on" to relational systems. It's not user-defined types, though. But it's half way to the show. -t

Oracle's tree and graph extensions would support your half, yes. But it still doesn't allow comparisons between the approaches.

Yes it does:

That's a comparison for at least one metric. -t

Whenever I think you couldn't possibly be more daft and less reasonable, you surprise me. That is not a metric supporting comparisons between the approaches.

You asked for "empirical evidence supporting either side", did you not? Implementing a (partial) solution at least proves that it is implementable (to at least some degree) and used in practice. Where's the problem? -t

I asked for "empirical data between the use of". I don't believe implementability is at issue, here. If it's only implementability, we already have approximate implementations of Relational with structure values in Haskell and Prolog, we already have implementations for serializing structure values with YamlAintMarkupLanguage and pickling of values in a wide variety of languages, etc.

But their use as a database has not been tested for large data-sets, heavy usage patterns, and multi-app sharing of data. One can use JavaScript even as a small personal "database", but that doesn't tell us much.

[Support for large data-sets, heavy usage patterns, and multi-app sharing of data are characteristics of DBMSes, mainly deriving from the architecture of their underlying storage mechanisms (large data-sets & heavy usage patterns) and/or the ubiquity of SQL and standardised access mechanisms like ODBC or JDBC (multi-app sharing of data). These characteristics have little to do with use (or not) of the RelationalModel. It is quite reasonable to create non-relational DBMSes that exhibit all of these characteristics, or create implementations of the RelationalModel that exhibit none of them. Haskell and Prolog are notable because they demonstrate production languages that support structure values. Large data-sets et al. are irrelevant. Using JavaScript as a small personal "database", for example, can be conceptually illustrative independent of its ability (or not) to handle large volumes of data or transactions.]

Implementability is covered - one can even use JavaScript as you say. But it would be a pretty damn poor scientific control of variables to compare (for performance properties, usage characteristics, and such) a JavaScript implementation of a structure-value RDBMS to an SQL system written by a multi-billion dollar company who ships it as their primary product.
PageAnchor: list3

(a) prevent people from accidentally mutating a shared value

We can use the DB security system to keep out whatever needs to be kept out.

Please justify. How will the DB security system carefully prevent mutations while still allowing the necessary mutations for the 'kits' developers use?

Encapsulation is a tree-shaped (nested) security system. A DB security offers set-based security, which is more powerful than tree-shaped.

More powerful by which measure? And what, precisely, is "set-based security"? Can you point me to a set-based SecurityModel and a justification that it is more powerful in general than encapsulation? Encapsulation is pretty powerful; it's the basis of ObjectCapabilityModel. To be honest, it is my impression that what you said here is BrochureTalk and has no technical merit whatsoever, but I'm willing to hear your answers.

(b) support automatic GarbageCollection

You keep talking about this, but never show how it's relevant to a practical problem (let alone define it in a DB context), at least in my domain. If it makes your domain easier at the expense of my domain, then fooey on it.

I'll dumb it down as far as I am able, but as I do so I seriously question your professional competence.

You have described something you do, that you stubbornly call "normalization" despite objections from peers and academics, whereby you mean factoring a database to avoid 'large' duplicate values. For example, you'll perform something like the following transform:

 . . . . O R I G I N A L . . . . . | . . . . " N O R M A L I Z E D " . . . . . . .
 a_table . . . . . . . . . . . . . | a_table . . . . str_table . . . . . . . . . .
 attr1 attr2 . . . . . . . . . . . | attr1 attr2 . . key string  . . . . . . . . . 
 . 1 . "a very long string 1"  . . | . 1 . . 10  . . 10  . "a very long string 1"
 . 2 . "a really long string 2"  . | . 2 . . 11  . . 11  . "a really long string 2"
 . 3 . "a very long string 1"  . . | . 3 . . 10  . . . . . . . . . . . . . . . . .
 . 4 . "a really long string 2"  . | . 4 . . 11  . . . . . . . . . . . . . . . . .
 . 5 . "a very long string 1"  . . | . 5 . . 10  . . . . . . . . . . . . . . . . .
 . 6 . "a really long string 2"  . | . 6 . . 11  . . . . . . . . . . . . . . . . .
 . 7 . "a very long string 1"  . . | . 7 . . 10  . . . . . . . . . . . . . . . . .
 . 8 . "a really long string 2"  . | . 8 . . 11  . . . . . . . . . . . . . . . . .
 . 9 . "a very long string 1"  . . | . 9 . . 10  . . . . . . . . . . . . . . . . .

By doing so, you have reduced the apparent duplication of the strings, albeit many RDBMSs would still have the same count of strings in the physical database both before and after the transformation. A potential "benefit" is that you could now have the option of changing ALL references to a given string just slightly (e.g. you could UPDATE str_table SET string="a short string" WHERE key=10;).

On the negative side, your maintenance requirements have increased. First, if you wish to maintain the sharing, you need to write a bunch of helper functions to query the 'str_table' and decide between using an existing key and creating a new one. Second, you need to perform GarbageCollection:

Consider the case of deleting nodes 1, 3, 5, 7, 9 from a_table because the data they represent is no longer valid, possibly from five different transactions. Since all references to str_table where key=10 have been deleted, the value "a very long string 1" is now wasted space in the RDBMS. It isn't as though 'str_table' represents 'data', or that (key:10 string:"a very long string 1") is meaningful.

Unfortunately, the RDBMS doesn't know the difference between 'str_table' and a data table, so it cannot automate this GarbageCollection for you (but see note at page anchor: annotating for GC). Unless you want an unlimited amount of this cruft building up over time, you need to occasionally take out the garbage. Of course, it wouldn't be very efficient to GC after each delete, so you'd probably leave the task to a periodic cleanup, perhaps nightly or weekly depending on how fast the cruft builds. So you add something like follows to your nightly:
  DELETE FROM str_table WHERE key NOT IN (SELECT attr2 FROM a_table);

Alright, so everything is working, so far so good. But GC rapidly grows more complicated than this simplistic example. As a simple complication, imagine a programmer decides to add 'b_table' that relates pairs of really large strings which also uses str_table in order to get decent value sharing of indexes and the strings.
  TABLE b_table { 
    strA  (key into str_table)
    strB  (key into str_table)

Well, no problem... just don't forget to change the batch process! If you forget, you'll regret it later!
 gc_batchfile: (updated)
   DELETE FROM str_table 
    WHERE key NOT IN 
      ((SELECT attr2 FROM a_table) UNION 
       (SELECT strA FROM b_table)  UNION 
       (SELECT strB FROM b_table))

That was just a simple complication, but it still demonstrates a pitfall and some growing complications.

When we start dealing with structure DomainValues, following your approach of splitting these DomainValues off into separate tables, those will also need GarbageCollection. Essentially, what your 'destructured DomainValues' approach does is the following:

I'll use 'type MTOS of tnode(String {List MTOS})' for demonstration.

 . . . . O R I G I N A L . . . . . 
 msgid value
 . 1 . tnode("X" [tnode("Y" []) tnode("Z" [])])
 . 2 . tnode("A" [tnode("B" [tnode("C" []) tnode("C" [])]) tnode("Z" [])])
 . 3 . tnode("F" [tnode("F" [tnode("F" [])])])
 . 4 . tnode("X" [tnode("Y" []) tnode("Z" [])]) 
 . 5 . tnode("A" [tnode("B" [tnode("C" []) tnode("C" [])]) tnode("Z" [])])
 . 6 . tnode("F" [tnode("F" [tnode("F" [])])])
 . . . . " N O R M A L I Z E D " . . . .
 c_table . . . . tree_table  . . . . . .
 msgid value . . key parent seq value  .
 . 1 . . 7 . . .  7  . 0 . . 0 . "X" . .
 . 2 . . 8 . . .  8  . 0 . . 0 . "A" . .
 . 3 . . 9 . . .  9  . 0 . . 0 . "F" . .
 . 4 . . 7 . . . 10  . 7 . . 0 . "Y" . .
 . 5 . . 8 . . . 11  . 7 . . 1 . "Z" . .
 . 6 . . 9 . . . 12  . 8 . . 0 . "B" . .
 . . . . . . . . 13  . 8 . . 1 . "Z" . .
 . . . . . . . . 14  . 9 . . 0 . "F" . .
 . . . . . . . . 15  .12 . . 0 . "C" . .
 . . . . . . . . 16  .12 . . 1 . "C" . .
 . . . . . . . . 17  .14 . . 0 . "F" . .
 . . . . . . . . 18  .17 . . 0 . "F" . .

(Just to note: the 'child_id' schema could have been used here, too. Sharing would be improved but GC wouldn't be any easier.)

Anyhow, what would the gc_batchfile look for in this case? Well, a simple stab at it in a procedure:
    TEMP T = SELECT value AS key FROM c_table;
      T = SELECT tree_table.key FROM T,tree_table WHERE T.value=tree_table.parent;
    DELETE FROM tree_table WHERE key NOT IN T;

This recurses down a tree. It should work so long as c_table.value always points to a 'root' tree node (where parent=0). If not, the problem gets a lot more complicated to perform efficiently. (If you don't believe me, you try it.)

Anyhow, this is the sort of GarbageCollection to which I've been referring, and so far you're only looking at relatively simple examples... for example, you're not dealing with graphs. The 'child_id' schema I described for the 'MTOS' tree would involve bouncing back and forth between two tables to decide what is conserved (but wouldn't require a search for parents).

Anyhow, the problems with these batchfiles are multifold: (page anchor: annotating for GC) Perhaps you could somehow annotate 'str_table' and 'tree_table' as being GC'd, but that harks back to what I said in RelationalTreesAndGraphsDiscussion when I asserted that the requirement for GarbageCollection serves as a practical, distinguishing property between DomainValue and WhatIsData. That is, if you do annotate tables for GC, after doing so you'll know precisely which tables could have been represented via structured DomainValues. Each GC'd table could be a structure value-type. I'd even say "should".

I'm a bit curious as to whether TopMind simply lets cruft build up in the DBMSs he manages, since he seems so unaware of the need for garbage management. Or does he push the job to the application?

Generally data removal falls into one of 4 clean-up categories:

You have a serious problem, TopMind, in that you keep thinking of those tables you "normalize" to avoid value repetition as containing 'data' rather than 'shared values'. I suspect you're simply calling it "domain data" and inappropriately making cleanup the user's job.

It's your claim, so the evidence is your burden. Give some typical biz examples of RDBMS failing to do your def of GC. Or, are you just playing word games? Is it not the domain user's job to remove products from the catalog? The programmer and the computer cannot know when marketers decide to pull a product by reading their minds. If the catalog is not the kind of scenario where you see problems, then state one in which you DO see problems.

Deleting products from a catalog is not an issue. Deleting value resources (strings, images, geometric information, etc.) associated with and shared among entries in catalog (or, more realistically, among variations and versions of a catalog that are targeted for different localities, seasons, etc.), however, would be a problem. The problem scenario is when you end up with (key,value) pairs of any sort where 'key' is a surrogate that has no meaning in the real world.

It's called "cascading deletes". And one can set up ON_DELETE triggers for more fine-tuned/custom deletes. Setting this up right requires knowledge of the domain. A compiler cannot know the needs of the domain by itself. And please flesh our your locational/seasonal example.

You know, I sort of figured you'd bring up "cascading deletes", but I let it alone in part because I wondered if you were incompetent enough to bring it up. Cascading deletes work in the opposite direction. They don't help here; their use would be more like "I deleted this image, and now all the catalog entries that used that image are gone!" - that is, they delete entries containing a foreign key in response to their primary key being deleted. Also, ON_DELETE triggers suffer every one of the problems with the gc batchfiles described above, plus greatly improved inefficiency. Anyhow, I put no small effort into explaining GC to you, and all you've done is look for excuses to ignore it, so I'm not feeling particularly charitable at the moment. When I am convinced you have made an honest effort to understand what has already been written for your benefit, perhaps I will detail the example.

You've provided no evidence that cascading deletes and on-delete triggers are not doing their job IN PRACTICE. You again appear to be inventing problems for your "solutions".

I provided reason. You appealed to cascading deletes without even that much, and with a clear lack of understanding as to their utility. When I see such incompetence from you, it does not make your claims all that convincing.

Please clarify your catalog image example. Note that whether accumpanying images will be deleted or not may be a business decision. If they paid a graphics designer thousands of $ to create many of the images, they may decide to keep them around in the off-chance it will be needed again. Re-entry of the catalog item itself may cost $10, so they don't "protect" that side. A machine cannot tell people what they want to keep and toss. It's not smart enough to make case-by-case economic decisions like that.

That is hardly at issue here. If people want to keep the image, it would be sufficient to keep a reference to it. You are directing your arguments at a StrawMan.

That's one approach, but not necessarily the superior one. For one, the "addressing" could be obscure. If Milfard Pinkerwikkle keeps personal links to it, but nobody knows about Milfard Pinkerwikkle's links, then to all others it's as good as lost. If we instead keep them in the same table, then people remember the name of the table because that's where the images are kept, deleted or not. You are using speghetti thinking.

If you want to treat the images as distinct entities, that's available no matter what the approach, and still requires keeping a reference to each image. Anyhow, your entire line of argument is irrational. You're basically arguing that disconnected pieces of data should stay around after people intentionally deleted that data, and you're inventing scenarios where it 'might' be useful but only if people didn't intend to intentiaonally delete the data. The twisted, 'spaghetti thinking' here is your own. If your goal is a versioned database where one can recover past versions of the database information, there are far more direct and efficient and correct ways to go about it.

(c) allow the serialization of structured values to be standardized and optimized separately from its persistence layout (so each RDBMS can go its own way when deciding how to represent these structures)

Databases are more than just "persistence mechanisms".

I agree. Nonetheless, any production RDBMS contains a persistence mechanism, and will need to optimize it to support rapid access, indexing, queries, AtomicConsistentIsolatedDurable transactions, and so on. Are you going to pretend this is not a relevant issue?

And nothing prevents one from re-formatting DB info as needed. "Standardized"? Please clarify.

Your approach of representing "trees" as, for example, "tables that have 'key' and 'parent_id' attributes" (as you mentioned in RelationalTreeEqualityExample) is defining a representation - "standardizing" a "layout" - for both the representation of that tree in the database and its serialization. Use of structured values allows you to separate these concerns. Only the serialization needs to be standardized.

And integrating with those "kits" you mention would often "prevent one from re-formatting the DB info as needed". So would backwards compatibility with distributed applications.

(d) allow one to operate on DomainValues? as whole units (e.g. for an equality test) which helps composability and GenericProgramming and eliminates need for representation management helpers

Nothing prevents one from making equality comparitors for table-based structures.

That reply isn't relevant. The stated benefit is the ability to operate on DomainValues as whole units, and equality-tests are but one example. Your 'tableEquals(Tbl1 ID1 Tbl2 ID2)' approach described in RelationalTreeEqualityExample is not a solution that supports composability or GenericProgramming, and is a representation management helper.

Are these common needs, or another one of your lost "solutions" looking for a problem?

[Extremely common. These issues arise any time it is necessary to store and retrieve a complex value in a database. It probably occurs very rarely when implementing common business-oriented applications like CRM, ERP, e-Commerce and so forth (the domains where SQL DBMSes hold sway), but occurs in every domain where complex values are present and it is appropriate to use a DBMS. It is quite common to essentially re-invent DBMSes on a per-application basis rather than endure the difficulty of shoe-horning complex values into existing SQL products. If SQL DBMSes possessed powerful type systems like those found in Haskell et al., this would not be an issue.]

[I'm surprised, in fact, that the major DBMS vendors haven't already latched onto this as a way to expand their product sales into relatively un-tapped markets like scientific computing, games development, weather prediction, systems programming and so on. Defining powerful and DBMS-friendly type systems is an active research area, so I suspect it's only a matter of time before they appear in industrial systems.]

Perhaps because many such shops find a way to do it without a type-heavy approach such that the intersection between the domains you mentioned and those who like a type-heavy style is too small for Oracle.

[Sorry, I don't know what you mean by "type-heavy style". Do you mean TypefulProgramming, StaticTyping, ManifestTyping or something else? Anyway, needing to deal with complex values which can be tested for equality, etc., and treated as encapsulated units is hardly a "style"; in many domains it's a requirement. This has nothing to do with StaticTyping, DynamicTyping, ManifestTyping, or any other kind of TypeSystem implementation, other than the need to create and manipulate complex values with the same ease as strings or integers.]

Well, that's not a common need in my domain. If my domain has to sacrifice flexibility and sharability to get that, then it will not fly here. If you can demonstrate that one can have such without sacrifices and without fully tossing existing RDBMS knowledge, then please do. It still appears as if "hard" encapsulation has a big penalty. You are only trying to sell the benefits without addressing the downsides for my domain. I can live with soft encapsulation where nodes can fully participate in relational operators by default for any new "structure type". You can have all the type operators you want, as long as that does not preclude relational operators on those nodes. You can only tear away relational operators from my cold dead fingers. Keep your feature-crippled types out of my domain neighborhood. -t

It has already been established at (page anchor: node sharing example) that your approach is the one that sacrifices node and index sharing, and that there is no loss in flexibility. It has also been established in PROOF ONE .. PROOF FOUR that encapsulation of representation is okay, and that there is no encapsulation of structured DomainValue. You have demonstrated none of your claims that these "sacrifices" exist, or that there are "downsides" for your domain in supporting structured DomainValues. We're not even taking relational operators away from you. Your opinions are not the product of reasoning, TopMind. Your entire line of argument has been one piece of unreasoning faith, extreme ignorance, blind denial, and fallacy after another, and that's when you're not outright making shit up like your comment about the 'power' of set-based security at point (a), above. You should be ashamed of your behavior. But, attempting to reason with you is pointless. I'll continue on integrating functional, relational, and other systems. You'll continue being a ferrous cranus with a net negative contribution to every technical forum you visit.

[Top, if complex types aren't needed in your domain, don't use them. No one is advocating that we remove canonical types or alter the RelationalModel, so everything you do now can be done exactly the same way you do it now. I can't envision any circumstance where providing support for complex types would result in a sacrifice anywhere. Adding 'tree', 'graph' or whatever types is exactly the same, in concept, as adding 'date' and 'money' types to the canonical list of 'integer', 'string', and 'float' types. Indeed, a proper type system would mean that tree values, graph values, date values, money values, integer values, string values, and <insert user-defined-type> values could be manipulated with equal ease, with no compromise to flexibility and sharability. Why should a tree value be treated any differently, in concept and from the RelationalModel's point of view, from a string, date, or integer value?]

Like we already discussed, with strings, one cannot use the "regular" relational operators on them without conversion and copying. There is no "equal ease" for them. I don't want to do the same with graphs and stacks. The author of a graph or stack may not allow/add any relational operations or ability to join with tables out of spite, stupidity, or laziness. The short-sighted will add operators for immediate needs and only immediate needs. I know how developers/designers act on a general basis. -t

So... you're a hypocrite because you use strings, eh? You stop using strings for a few years, then come back and tell us whether you still believe FirstClass support for structure DomainValues is not useful to the business domain and, by extension, to other domains. Until then, I'm calling you on sheer, utter hypocrisy. Whatever appeals you'd make in favor of strings, performance or elsewhere, apply equally to other structure DomainValues.

[By the way, if you want to be a total relational purist, the only type you should have is 'boolean' -- everything else (including operators) can be represented as boolean column values, tuples and relations. A string or even an integer is conceptually no less a "complex" type than a tree; it differs only the structure of its internal representation and in the operators that make it useful to a user. Because of established convention, we are used to trivially sharing canonical types, but this is effectively a relatively recent achievement in the history of computing. There is nothing that precludes eventually sharing trees and other complex values by convention, and there are even existing means for doing so.]

No, because relational does not define the "domain math" (cell-level).

[It doesn't have to "define the 'domain math'" for this to be true. That operators can be defined as relations on boolean values is true, even if the contents and headings thereof are not defined by the model.]

I'm not necessarily a purist. I'll accept "violations" if there's a reasonable cause. I've already described the practical issues. For example, if ADT's encapsulation "locks" tree nodes from participating in relational operations unless the ADT builder bothers to "allow" them, that is a problem because they usually won't. Experience tells me that interface authors do as little as possible and only satisfy immediate requirements. Complex structures need "soft" walls to be useful in my domain. -t

TopMind has raised practical non-issues, encapsulation of ADTs being one example (countered in PROOF FOUR, above).

[On one hand, I can see a certain limited value in providing operators to map user-defined complex values to and from relations. However, you appear to be seeking a certain relational symmetry across all elements of the language, wherein any (and every) given value is simultaneously represented as both its own type (i.e., has its own type-specific operators, presumably) and a relation. In principle, I can appreciate what you're aiming for -- and it wouldn't necessarily require that the internal representation of values be relations -- though I'd argue that ExtendedSetTheory is a better choice for experimenting with universal, er, homologous-ness than the RelationalModel. As we know, trees, graphs and other non-tabular structures don't map particularly elegantly to relations. However, extended sets map quite well to a wide variety of structures whilst easily incorporating the RelationalModel should it be deemed appropriate. D L Childs (the inventor of ExtendedSetTheory) has been advocating this approach since the late 1960s, and it's notable that in DrCodd's initial paper on the RelationalModel, the first reference is to one of D L Childs's papers on ExtendedSetTheory. Unfortunately, I suspect Childs's rather obscure writing style -- and the fact that the RelationalModel is probably more intuitive than ExtendedSetTheory -- means the former has seen wider adoption than the latter. ExtendedSetTheory has, however, been successfully used as the basis for some powerful database systems.]

[On the other hand, if some pre-written UDT doesn't provide the operators you need, you can simply create your own UDT that does. That is a strong argument in favour of UDTs in general.]

As far as the EverythingIsa relation approach, one would do better with EverythingIsa database where each 'cell' contains a full record of named relations, and each relation is a set of records of cells. Booleans are represented by TableDee and TableDum, integer N is represented by a relation containing integers 0..N-1, and only records of relations, especially including the empty record and the empty relation, are primitive.
  key: [] is relation or set, () is record, {} is application
  for now using implicit (positional) labels.
  Value . . . SetTheory . . . . . . . . . . . . Relational/DB . . . . . . . . . . . Notes
  0 . . . . . []  . . . . . . . . . . . . . . . []  . . . . . . . . . . . . . . . . Empty (primitive) 
  1 . . . . . [[]]  . . . . . . . . . . . . . . [([])]  . . . . . . . . . . . . . . {0} Union 0
  . . . . also: [0] . . . . . . . . . . . . . . . [(0)] . . . . . . . . . . . . . . 
  2 . . . . . [[] [[]]] . . . . . . . . . . . . [([]) ([([])])] . . . . . . . . . . {1} Union 1
  . . . . also: [0 1] . . . . . . . . . . . . . . [(0) (1)] . . . . . . . . . . . . 
  3 . . . . . [[] [[]] [[] [[]]]] . . . . . . . [([]) ([([])]) ([([]) ([([])])])] . {2} Union 2  
  . . . . also: [0 1 2] . . . . . . . . . . . . . [(0) (1) (2)] . . . . . . . . . . 
  false . . . []  . . . . . . . . . . . . . . . []  . . . . . . . . . . . . . . . . TableDum, equal to zero
  true  . . . [[]]  . . . . . . . . . . . . . . [()]  . . . . . . . . . . . . . . . TableDee != 1 in Relational
  (a b) . . . [[a] [a b]] . . . . . . . . . . . (a b) . . . . . . . . . . . . . . .
  (a b c) . . [[[a] [a b]] [[[a] [a b]] [c]]] . (a b c) . . . . . . . . . . . . . . (a b c) is database in relational/db
  (x:a y:b) . [['x' a] ['y' b]](*)  . . . . . . (x:a y:b) . . . . . . . . . . . . . 'x' and 'y' special in SetTheory approach.

As syntactic sugar and under-the-hood implementation, one would expect that natural numbers and such are presented in their more traditional forms even if they provide relational or SetTheory semantics in accordance with the 'representations' of the value. For simple natural numbers, the each-cell-is-a-database representation is somewhat unwieldy compared to SetTheory, but their use makes more readily for structured extensions and for typing (by defining each attribute as a whole schema, complete with inter-relation constraints). For example, one can model rational numbers using (dividend:1 divisor:3), and graphs and trees can easily be represented without confusion regarding labels vs. values common to the pure SetTheory approaches.

One is still missing a critical feature for graphs and trees is a 'scope' for names. I.e. the following two values represent the same graph up to an isomorphism on vertex names (which are typically assigned in accidental order):
  (vertices:[0 1 2 3] edges:[(1 2) (1 3) (1 0)])
  (vertices:[0 1 2 3] edges:[(2 1) (2 3) (2 0)])
  // the above could be shorthanded to 'vertices:4'
Achieving automatic isomorphism here requires some primitive support for scoping of 'identifiers' (or surrogate keys / 'points') as distinct from integers and scoped to some miniature database. This requires a simple, but significant, extension to relational, such that each database-value occurs in a scoped identifier 'space', potentially containing points from the parent space or from its own space.
  {space S: 
     (vertices:[{S 0} {S 1} {S 2} {S 3})
      edges:[({S 1} {S 2}) ({S 1} {S 3}) ({S 1} {S 0})])}
Spaces are compared isomorphically, and essentially degenerate to plain-old-values in the case the space is unused in describing values. Given scoping, 'space S' is likely to appear inside a relation or set defined inside some other 'space R' inside some other 'space Q' and so on all the way up to the 'root' database (which would need to use identifiers in a global space). Use of spaces does require some special relational or set theory support when performing unions, intersects, etc. across sets in two different spaces.

I haven't read enough literature on ExtendedSetTheory to have formed a solid opinion about it, but my impression is that it mostly focuses on the label vs. value problem (one that doesn't exist in RelationalModel) and doesn't solve the issues of accidental ordering for graphs.

While an interesting exercise in atomic simplicity, many are not ready to part with relational yet, having grown up on it (or a form of it). Thus, the more immediate goal is to extend or enhance relational, not replace it. I still believe that a large part of the value of relational is that it maps well to the concept of a physical tabular sheet. This makes it easier to get one's head around and visually inspect and conceptualize. 8-dimensional creatures in a different universe or highly adept mathematicians, on the other hand, may find it limiting; that is, the primitives are "too flat". -t

[Your goals are different from mine. My interest in ExtendedSetTheory is in providing a common representation (the extended set) for existing structures -- stacks, queues, files in a variety of formats, FrameBuffers, disks, lists, trees, database tables, etc. -- and an algebra over these. It appears to easily subsume the RelationalModel whilst incorporating other structures without the need to convert them to relations. It strikes me as having essentially the same graph representation problems as, say, XML, EssExpressions, the RelationalModel, etc., and demanding no worse solutions than these require, but I say this having given the issue only the most cursory consideration.]

I suspect you may be creating Lisp-like problems: too flexible and open-ended for its own good: insufficient conventions and culture.

[Could be, but pre-judging the assumed outcome prior to experimentation would be rather unscientific. What fun (or possible progress) is there in not trying it? If everyone had your attitude, we'd be programming (if at all) in Plankalk├╝l. Or by knocking rocks together...]

At least they'd be relational rocks :-) I'm what you might consider a "practical experimentor" in that I try to improve on existing ideas rather than seek out blue-sky solutions. Both types of research are needed. Progress comes from both evolution and revolution. -t

(e) allows lazy evaluation where functions are only executed insofar as they are needed.

Perhaps. But are you sure a RDBMS cannot also do that same?

An RDBMS can offer lazy queries - even an NQRDBMS like you favor (not-quite RDBMS where DomainValues are butchered). But offering lazy production of DomainValues is a different issue: since in your approach any access to a value requires a query of the whole 'structure value' table, laziness cannot be achieved across accesses to the value, which effectively means that laziness is simply not achieved.

(page anchor: peekaboo)

From a purely mathematical standpoint, you may be right. However, performance and other practical issues may make it unusable. -t

What evidence do you have that suggests "performance and other practical issues may make it unusable"? Or is that a baseless accusation?

It's not baseless: you haven't demonstrated how it's done, such as showing how the index pointers are hooked up to nodes in RAM/disk to provide sharing. Plus, the default is not that it's not an issue. The default is "unknown". Your "proof" only addresses theory, not practice. -t

Did you just appeal to ShiftingTheBurdenOfProof? "My accusations aren't baseless," TopMind sneers, "until you disprove them!" If you don't have evidence supporting your accusation, it's baseless.

I did not make a formal claim. If you want to play lawyer, go elsewhere. I'm only pointing out that proofs for theory don't necessarily translate to practice. Do you disagree with such a statement, in general? A time machine that requires the energy of 1,000 galaxies be harnessed may make for theoretical time travel. However, in practice, we are far away from harnessing such power, so don't have such technology. In theory one can see the bottom of a big stack by popping a million nodes from it, and then pushing them back into place to leave it the way it was. In practice, waiting for the computer to perform such may suck. -t

I could go around saying: "SMEQL might not be implementable" (it hasn't been disproven yet), or even "TopMind may be downloading child pornography again" (where's your counterproof?), etc. Would you think this is okay? The fact is that your comment insinuates a claim. You could have, with your "unknown" rule, said something else: "performance and other practical issues may make it superior to TopMind's approach by every metric". Thus, to say "performance and other practical issues may make it unusable" implies you have good reason to believe that assertion moreso than its opposite. If you wanted a neutral statement in accordance with "I'm only pointing out that proofs for theory don't necessarily translate to practice", you could have said "However, we don't know about the performance yet". Even that wouldn't be entirely true (we have plenty of OOP and FunctionalProgramming languages to which we can look for actual performance), but at least it would be as neutral as the evidence you claim to possess.

I don't see a significant difference between those two. I'm not sure why the first irratates you more than the second. But anyhow, let's move on rather than argue about arguing. Would it be accurate to say that as written, your proof does not address performance? Another issue to consider is the adaptation effort. Will independent structure types require more adaptation effort and code to serve as different "types"? -t

TopMind might be wanking off to an image of Arnold Schwarzenegger for the tenth time. Seriously, it doesn't bother you if I went around saying stuff like that simply because I "don't know" whether you're doing it or not? What if I used your real name, instead? If you think it's okay, then we'll just call it a difference of opinion, and we'll move on, but I'll be awfully tempted to say stuff like that just to see whether you get irritated at the insinuated accusations and thus prove your hypocrisy. The other option, of course, is for you to simply admit your sophistry right now and try to do better in the future. Your choice. Anyhow, performance has been addressed elsewhere (for node/value sharing vs. copies, functional operations including equality tests, indexing and memoization, etc.), and so has adaptation effort (relative difficulty of writing a function vs. a query, GenericProgramming, composition and code reuse), and so has maintenance effort (data entry, collection). Arguments on those points should not be raised again here as if they are entirely new and haven't been discussed. Raising them here makes it appear as though you're just repeating yourself and completely ignoring all the discussion on those subjects thus far, much of which offers reason to believe performance and other practical issues are not going to be a problem. Your tendency to raise issues like they're fresh even after people feel those points have been defeated in other argument is, likely, part of what earns you your reputation as a 'Ferrous Cranus' as mentioned in ObjectiveEvidenceAgainstTopDiscussion.

You are the one who did such. You re-mention PROOF X as if the case is closed. That is what prompted me to revisit this section to begin with. And you have not illustrated shared indexing with something as simple as a stack (originally in RelationalWeeniesEmbraceOo PageAnchor "milstack"). You appear afraid to illustrate it, hiding behind academic vocabulary instead. -t

The case IS closed. I re-mention PROOF X because you keep repeating the same accusations about ADTs and encapsulation that were never relevant and were demonstrated to be irrelevant in PROOF X. Your distraction tactics don't change that, and make no mistake, that's exactly what you're doing here: sharing in a stack is irrelevant to PROOF X, so raising it here is either sophistry or stupidity. But, since you bring it up, you have also not illustrated any advantage for shared indexing with something as simple as a stack (as you'd prefer to represent it) as compared to using even a completely non-indexed functional approach, and you still have not successfully completed composabile equality tests.

Just admit that you lack enough real-world experience to be equippped to deal with milstack by showing how it all works. You just handwave that FP or something else magically takes care of everything.

I've already told you how to deal with 'milstack'. The answer is memoization and pre-processing, just as is already implemented for indexing strings lexically in advance of 'likeness' queries. You lack sufficient programming competence to understand what 'memoization' is, as you demonstrate when you associate 'interning' with 'what, Monica Lewinsky?'. Thus, it appears like HandWaving to you. That is a problem with your competence, not with the solution. You repeatedly demonstrate an appalling degree of incompetence in your statements, such as your appeal to cascading deletes when it was inappropriate, your lack of comprehension about value-sharing and inability to solve a simple seven-node tree problem, and so on. I suspect even the simple stuff is all just magic to you. I am in the process of explaining the mechanism for indexing complex structures at a level that even an barely-competent layman might understand it for FileTreeMeetsRelational, which gives you a smaller chance of understanding it since I can't compensate for your HostileStudent agendas, especially not in combination with your bouts of plain idiocy. But I'll give it a go since I said I would.

I'm not your damned student. You are afraid of a concrete example because it puts you up against scrutiny; thus, you hide behind magic acronym handwaving. You exaggerated the need for dealing with repeat words, and were left looking like someone locked in the ivory tower so long that the Hunchback from Notra Dame has a deeper tan than you.

Every time you ask me for an example, you have implicitly asked to be my student at least for a little while. The fact that you are hostile to this idea is what makes you a damned student, and a moron. It takes a rude and disrespectful bastard to ask for examples, clarification, etc. then not listen carefully to the answer.

Requesting evidence == "student"? You think weird. And, I also think you are an arrogant blowhard idiot who lacks sufficient practical experience. Thus, the hate is mutual.

Requesting clarification and example == "student". Neither of those are 'evidence'; they're 'explanation'. If you don't want to be a student in a field, then gain competence in diagnosis and analysis of evidence so you don't need to ask for explanations, or at least look things like 'interning' up for yourself when you don't understand. And I'll agree that I lack 'custom biz app' experience, but I don't believe that 'practical' experience to bring into a discussion of RDBMS design.

You are a spoiled whiner who wants everything on your own terms. There's probably more than 10 practioners for every academic. Thus, it makes sense to target your communication thus. Or at least not complain about such requests like a spoiled elistist. "I don't care if the audience is English, I'm gonna use Swahilli!" People come to me on a regular and repeated basis for programming advice, clarification as to the purpose or use of a programming pattern, and occasionally for help debugging their programs. I am one of very few in my immediate workplace who came into programming from a CS degree rather than out of EE or CE; I suppose that makes me the academic among them. I don't complain about their requests, I help them to the best of my ability, and I don't consider them any less intelligent for asking, or even for admitting they blanked on a concept and requesting it be reviewed again. But there is a critical difference between them and you: they listen, they don't have agendas in attacking me when asking for clarification, they don't play HumptyDumpty games with language, they don't interject snide comments like "huh, Monica Lewinsky?", they don't pretend or delude themselves about their competencies, and they don't assume they already know the answers for which they're asking the questions. I.e. they're not self-important HostileStudents. I honestly believe most 'practioners' would be dreadfully insulted at the way you pretend to represent them, and that most of them would oust you as too belligerent, stubborn, and close-minded to be one of them. The idea that they're standing behind you is entirely your delusion. And, honestly, most academics wouldn't include me among their ranks, and I don't pretend they do. I don't do CBA, I'm not an IT person who maintains a server, but I also don't write white papers and teach classes for a living. I don't even have a Master's degree. The closest thing I have to being an academic is an academic's temperament and habit of being an EternalStudent.

I have no way to verify what other people think of you. The most treacherous people often think that everybody loves them. G. W. Bush for example. If I ask them outside of your presence, they may have a different story about you. Before I closed my email, I used to get fair volumes of what could be considered "fan mail" from people afraid enter heated debates but glad that I press people to back their GoldenHammer claims. Even from at least 2 book authors. "OOP will save the world!", "Types will save the world!". You are just yet another with pretty much the same song and dance. Anyhow, we both think each other are aholes and I doubt that will change. -t

(page anchor: What-DB?)

Re: "If you want to "sell" the idea of a Type-DB to the custom biz domain..."

I don't know what a "Type-DB" is supposed to be, but I do know about relational databases with support for user-defined types. Here's an example.

Using TutorialDee syntax, imagine a trivial database of employees and departments with no support for user-defined types:

 var employee real relation {name char, address char} key {name};
 var department real relation {name char, city char} key {name};

Then issue the following query:

 employee join department

Argh! It gives no errors! It should give an error message, because joining employee to department by matching employee names to department names is utterly meaningless. Yet, it's easy to accidently do in almost any query language, whether SQL or TutorialDee.

So, we'll improve it, using user-defined types:

 type empname possrep {name char};
 type deptname possrep {name char};

var employee real relation {name empname, address char} key {name}; var department real relation {name deptname, city char} key {name};

Then issue the following query...

 employee join department

...and get the following error:

 ERROR: An attribute named 'name' is found in both operands but differs in type.
 Line 1, column 24 near 'department'

Hooray! Imagine the significant number of careless errors that users and developers can no longer make when creating queries. {Questions regarding "significant number" raised below.}

By the way, structure types can be defined. A binary tree:

 type nametree
   possrep node {left nametree, right nametree}
   possrep leaf {name char};

A RelVar (i.e., table) with a tree-valued attribute:

 var myvar real relation {blah char, twee nametree} key {blah};

Populate it thusly:

 insert myvar relation {
   tuple {blah 'zot', twee node(leaf('zap'),leaf('zip'))},
   tuple {blah 'zoz', twee node(node(leaf('zoop'),leaf('zaz')),leaf('zip'))}

Select tuples from it:

 myvar WHERE twee = node(leaf('zap'),leaf('zip'))

And so on. Obviously, operators can be defined to provide various operations on user-defined types.

A few days ago, I used this to "sell" the idea of user-defined types to a senior DBA. Prior to seeing the above, he was quite insistent that user-defined types were useless in his domain. Now he is convinced that they are absolutely necessary.

The tree example?

Both, and the general concept.
Re: "Imagine the significant number of careless errors that users and developers can no longer make when creating queries."

Why was my complaint about the frequency claim removed? Related: AlternativesToNaturalJoins. It's the kind of error that is usually caught immediately because it is outright wrong. The problem errors are more often those that work say 99.99% of the time. Or more often, where the developer didn't understand the domain rules. -t

Because it was nonsense. The meaning was obvious. It's the kind of error that frequently makes it into late development & testing, and sometimes makes it into production, rather than being trivially caught at the point where the query is created, because although it is "outright wrong", it's easy to make and can be hard to see in a complex query. As for the "frequency claim", I changed the offending sentence slightly so there should be no need for complaint. I'll remove this bit, too, once you've read it, because all these interjections make the text difficult to read. Create another section below if you have something genuine to add.

I will agree that "number" may not necessarily be the same as frequency. But I feel this needs to be explicitly pointed out because frequency of occurrence is generally more important than the quantity of error paths in practice. Also, I offer my anecdotal observation about the actual frequency of those kinds of errors. Such is useful information to practitioners, even if it breaks your types-save-the-world HobbyHorse world-view. Further, gumming up the schema with type declarations may make the schema harder to read, creating it's own problems. Type-ness usually has a verbosity cost. Further 2, if the type declarer is not careful, they may create overly-restrictive conditions. It's another activity that has to be monitored and managed. Developers/designers are not always automatically motivated to think things through. They may create type-spaghetti as job security, for example. -t

Such errors may indeed be infrequent, overall, in simple schemas. In complex schemas (one example where I work has 40,000 tables), any mechanism to avoid any otherwise-easily-caught error represents a benefit. The notion that a schema becomes "harder to read" because it contains type declarations is nonsense, and I don't know what "overly-restrictive" conditions you're contemplating. "Type-spaghetti" as job security is pure paranoia on your part.

Your arguments make me increasingly suspect that you are not a practitioner at all, but are, in fact, a largely non-technical middle manager, whose desire to eliminate abstractions, types, etc. -- and your insistence that programming be reduced to nothing but simplistic dynamically-typed procedural constructs -- owes entirely to the fact that you're afraid the developers under you will create code that you don't understand and therefore can't alter or micro-manage.

40,000 tables smells like a design flaw. I'd like to see the justification for such, but probably will not get it from you. Anyhow, without specific scenarios to study, this will simply degenerate to a name-calling anecdote session. But would you at least agree that one could probably make "type-spaghetti" if they wanted to? In other words, heavy use of types by itself does not guarantee proper use of types.

The system with 40,000 tables is a commercial enterprise data-management system for universities that covers ERP, CRM, HR, payroll, you-name-it, etc. According to one of the DBAs, the biggest design flaws are (a) a total absense of any relationship integrity at the database level (it's all done by the applications! *gasp*), and (b) a no user-defined data types, thus causing the sort of errors noted above especially when creating ad-hoc reports and custom extensions. The tables, however, are apparently appropriately normalised and reflective of the business requirements.

I will certainly agree, however, that one can make "<x>-spaghetti" out of any language feature <x>, whether types, tables, scripts, procedures, indexes, loops, classes, methods, lambdas, variables, etc. Because a poor developer can mis-use feature <x> is no reason to avoid <x>, or we wouldn't be programming at all.

I find it notable, by the way, that you have not disagreed with my assessment of your job role.

I've said many times that I'm not a manager. My general experience is that only about a quarter of IT workers keeps a "clean" system, whether that be schema design or code design. This is largely because management does not understand such, or doesn't stay around long enough in one position to care, and thus does not value it. Heavy typing just gives an employee more tools to create job-security messes with. If your experience differs, so be it. You don't have to insult people who observe different situations. And it still smells suspicious when the DB has more tables than students. -t

Incompetent or "job-security"-seeking employees can create messes with anything -- like the incompetent developer I knew who created 100 procedures differing only in name and the initialisation of a local variable, and invoked them sequentially instead of using a 'for' loop. Because procedures can be misused, is that a justification not to use them? No matter how good the tool, an idiot can use it to break something.

But a wider group of people can keep an eye on such a thing. The more obscure or specialized the technique, the harder it is to scrutinize and monitor. Things just don't magically stay clean in most cases. There has to be an incentive and monitoring.

Have you even looked at the example above? In what way is using types an "obscure or specialized technique?" Conversely, why not consider loops to be obscure or specialized? Beginners to programming certainly consider them obscure, as do many non-technical managers. How do you determine where to divide between "commonplace and intuitive (or at least well-understood)" vs. "obscure or specialized"?

Let's just say its more things to master, leaving the merit issue aside.

Procedures mean "more things to master" than a language without them. Why don't we avoid procedures?

Actually I'd rather have warnings, or the option of warnings, for certain behaviors rather than outright errors. Sometimes when one is in a hurry, they don't have time to surf the type graph to figure out what is going on. The boss is breathing down your neck and you need to deliver ASAP. It's easy to fall into the trap of over-classifying things. -t

So... What you're saying is: You want to join VAR x REAL RELATION {Name CustomerName} with VAR y REAL RELATION {Name DepartmentName}, and it turns out that someone has inadvertently defined Name in x as type CustomerName when it should have been type DepartmentName (the only circumstance I can see that justifies your suggestion), and your boss is breathing down your neck, so you'd like to join them anyway? Fine, in DateAndDarwensTypeSystem, the components of a type are always exposed. E.g., given:

 TYPE CustomerName POSSREP {a CHAR};
 TYPE DepartmentName POSSREP {b CHAR};
 VAR x REAL RELATION {Name CustomerName} KEY {Name};
 VAR y REAL RELATION {Name DepartmentName} KEY {Name};

You can effectively bypass the type definitions by using their possrep components (i.e., a and b, which happen to be the same type) and join them as follows:

 (EXTEND x ADD (THE_a(Name) AS NameForJoin) {ALL BUT Name}) 
 (EXTEND y ADD (THE_b(Name) AS NameForJoin) {ALL BUT Name})

Okay, but it's still one extra step. I'd like to see more payoff for all that beyond just preventing unintended joins. That's not at all near the top of my problem list.

View edit of July 10, 2010 or FindPage with title or text search