As tools become more complex, they are essentially (directly or indirectly) creating databases to hold state information. Examples include:
If such tools come to grips with the fact that they are essentially creating databases, then working with them, debugging them, learning them, and extending them could be a lot simpler.
A language that exposed its internal structures could allow meta-programming without having to add new syntax features to provide them. The language could be simpler because "fancy stuff" is done by manipulating the database rather than with confusing syntax.
It is not so much being "hidden" (locked out), but that it is hard to work with because the built-in tools are meant to mostly deal with textual syntax or the like instead of the "database" itself. If the designers realize they are making a complex database (or at least a complex/compound data structure), it would possibly change their approach and encourage them to treat it more like a database than a bunch of syntax and commands that alter the database.
Then again, some people seem perfectly comfortable dealing with such structures using behavioral syntax. I personally do not understand why, but realize people think vastly different from each other. I myself would rather deal with complex constructs with a database browser and query languages or something similar instead of as syntax. It still might be the case that some people simply have never worked with good DB or data-structure browsers, and are thus too used to syntax-centric approaches to change. To me it is a guessing game. You write some Lisp code for example, and that code is turned into an internal hierarchical database that changes over time. One has to mentally translate from the syntax to the current state of the internal database. I would rather cut out the middle man and deal with the structure itself.
I also think that people would realize that relational is superior to tree-based structures if they could compare side-to-side for a while. Trees perhaps make for better textual syntax, but make for crappy databases in my opinion. -- top
[Let me just reflect on this... Anything which stores organized data (seems to be the implicit definition of a database) is a database, including things which use trees. But they shouldn't use trees, because trees make bad databases. So everything should be procedural?!? Maybe you need to take a step back and look at your assumptions. The relational model is not the be all and end all of data representation.]
Well, trees and databases are not necessarily mutually-exclusive (SyntaxFollowsSemantics
), and perhaps we should separate the concept of treating stuff as a database and which kind of database it is, be it hierarchical, relational, or something else. I find trees fine for smaller things, such as individual expressions, but on a bigger scale they become arbitrary and messy. If people would get used to more of a database view into stuff, I think they would tend to shift away from trees because they are no longer bound to just a textual view. I feel tree-centric syntax is a side-effect of being to closely bound to text as the medium.
A related issue is whether such databases should be a NavigationalDatabase
. Some suggest that navigational is faster for smaller-scale databases. However, this may be simply because nobody has tuned a relational database for such use. (Or maybe nobody has invented a SufficientlySmartDatabase
[Well, clearly NavigationalDatabase
s are the best, since the best database known to exist is the human mind, which is a node based network.]
I hope that was meant in humor.
[Of course it was!]
Whew. Thanks for confirming... it's hard to be sure these days on wiki.
If it was relational, we would have cracked it by now. -- top
Actually it is an interesting thing to think about with regards to NavigationalDatabase being "the best". Certainly one would think that it was meant in humor, however let's take it a bit more seriously and have a discussion. Why is the brain so intelligent if it has such a crappy database? Maybe it isn't crappy. Maybe it is the RightToolForTheRightJob? in that our brain works better non relationally, since it is a different beast that doesn't require so many facts (DatabaseIsaRepresenterOfFacts?)? Or does our brain use facts? If so why wouldn't it be better relational? or is our brain a combination of relational and navigational? Since our brain is a product of evolution (undesigned) it may be a poor design (since it wasn't designed!). On the other hand, we are pretty smart, even with this crappy node based system. Or is it all that crappy? Maybe it has benefits? Does our brain use pointers? Stack? heap? Not a good analogy?
Now we're getting into the definition of a "database". Top seems to suggest that any
sufficiently complex data structure is a "database", including symbol tables, call stacks. (How about a simple list or array or struct?)
Others have more stringent definitions of "database" - I tend to think that a database is something that provides some minimal persistence and/or transaction semantics (not necessarily ACID). Others go further, and exclude "toy" DBMS's from the community of databases (i.e. MicrosoftAccess
isn't a real database).
Certainly call stacks don't require persistence or transaction semantics - nor are they likely to benefit from AdHocQueries
. OTOH, we could certainly end all arguments with TopMind
by agreeing to redefine everything
as a database - hence when I program in C++, I'm doing TableOrientedProgramming
because a VeeTable
is - guess what - a database. (OK, it ain't relational
, but pfft.)
Such a conclusion would certainly make Wiki a quieter
Perhaps we need another term. "Data structure" generally implies a single part of such internal structures, such as a single map or single list. We need a name for something that is an interlinked network of multiple data structures. "Attribute base"? Attribute management engine?
- Each "interlinked network of multiple data structures" is a data structure, just at a different scale. Zoom into a map and you may find a table. Zoom into a table and you may find a list. These scales of data structures form a hierarchy. We already have a name for them: objects.
- Please clarify.
- Which part?
- I assume they are combinations of things, not necessarily nested in any kind of aggregation pattern, at least not guaranteed to be that way.
- Each whole is guaranteed to be made of parts. Classic aggregation. I can make a map that is guaranteed to contain a table of entries. I can make a table that is guaranteed to contain a list of elements.
- In practice cycles are eventually going to have to be introduced, ruining any "clean" aggregation. For example, function A may call function B and function B may call function A in a program "structure". Hierarchical aggregation would be nice if it worked in practice, but it does not beyond the trivial.
- No, cycles don't have to be introduced. We create and use these data structures every day. Even if a map has a table that has a map, it isn't the same map. Each scale has its own context.
- Well, I suppose with enough type-checking and such, one can avoid cycles, but personally I find relational more consistent. There are less possible variatioins and more standardized management techniques, such as referential integrity, etc. It is the classic HolyWar battle between NavigationalDatabase and RelationalDatabase. One may use a list of maps of lists, and another person use a map of lists of maps if doing the same thing. Plus, I feel that tables scale better with less DiscontinuitySpike's, as described in AreTablesGeneralPurposeStructures.
I think you guys are missing the point. It's not that any data structure is
a database, it's that any data structure can be represented by a (relational) database model. That shouldn't be a surprise; the theory was developed specifically to that end. -- DanMuller
But that is just a TuringComplete-like truism. We still need a name for the internal structure thingy.
Well, this isn't such an absurd idea. Most of the "hidden databases" in programs could be easily mapped to a relational database structure, modulo the "identity issue" of OO. Certainly a lot of metadata would fit. And if languages had convenient, built-in APIs for database access, this would open up some interesting introspection capabilities. The implementation
of such things wouldn't necessarily be the same as for bulk application data, but a uniform interface for data manipulation could be a very welcome thing. (I'm downplaying the quibble about the term "database"; certainly multi-user databases are more interesting than single-user, and persistent is more interesting than transient. What's of more immediate interest in this context is the queryable interface to the data.)
- The interesting question is: Do you need AdHocQueries, joins, or things that only make sense in the context of a RelationalDatabase? In some cases, the answer is yes; in other cases I'm not sure.
- Tell me you never had to urge just to just query stuff during a tricky debugging session?
- The only stuff I've had the urge to query while debugging was already in a database. The stuff in memory is already on the call stack. -- EricHodges
- You never wanted to be able to take that call stack and search, sort, filter, dump, etc. it by different criteria than the original format without lots of parsing?
- No. It's a call stack. The values on it have significance because of a frame and a name. No other search criteria matters.
- Nothing else matters, eh? One-View-Fits-All has finally arrived.
- No, I've just never needed to query RAM while debugging. Why do you want to? What can you find that isn't right where it ought to be?
- [I can't see what possible use ad-hoc queries against program state would be. When you're debugging at the level, the whole point is that you're trying to solve problems with the actual program state - everything you need is in the call stack. The structure of the call stack is part of it's nature, you can't represent it as something other than a stack and expect to get anywhere. If you're chasing down corrupt pointers, why do you think a query syntax is going to be any more helpfull than looking at memory directly? When you're working at that low of a level, the LAST thing you want is magical database abstractions getting in your way.]
- The stack references the source code all over the place. This provides the opportunity to cross-reference stuff to make it easier to sift through. Further, just because you are not used to something does not mean it is not useful. For people who have never used a telephone before, the uses and conviences it may provide may not be immediately apparent. You can still map your view to the "old way" perhaps, but may slowly start using new features over time. There is nothing universal about whatever interface your debugging tools give you such that that interface is the only and best you will ever need. There is no one right view for everything.
- And what trade off's are you making to add AdHocQueries, nothing comes for free?
- I perfectly agree that higher abstractions may cost in performance. Perhaps this concept is a decade or two before its time.
Why do you say (or imply?) that AdHocQueries
only make sense in the context of a RelationalDatabase
? Haven't you ever formed the union or intersection of data in memory? This stuff certainly comes up occasionally. More for framework or middle-layer code than for business app UI code, probably, but it does come up.
Who knows what it would cost you? In theory, it wouldn't have to impact parts of the program that don't use it. If you can find a way to map the data to your API without changing how its currently stored, that is. (All pretty hypothetical, of course - details will vary wildly.)
BTW, it's worth noting (for those who don't already know this) that relational databases follow exactly this pattern. The metadata describing tables and fields, and sometimes other things like constraints and indexes, are all made available in tables. This is way handy for writing development tools like reporters, automated database-to-program mapping utilities, code generators, etc.
Big whoop, databases are good at storing metadata, but that metadata still get's loaded into a program running in memory, which uses hashes, arrays and other such structures that can actually work at the speeds programs need to run. If you added all the overhead to make every structure available relationally, you'd just end up with another slow database. Programs aren't about data, they're about behavior, the data just parameterizes that behavior, but the data just isn't that interesting, it's just data, it's a glorified file, that's queryable.
"...but that metadata still get's [sic] loaded into a program running in memory, which uses hashes, arrays and other such structures that can actually work at the speeds programs need to run." So what? If you have an interface to these data structures that works fast but is similar to the interface you use for database data, you've got all the considerable power of a query interface or language, you don't have to remember different ways of manipulating data to achieve the same ends, it doesn't have to be slow, and you could presumably execute queries that involve both persistent shared data, memory-cached persistent shared data (e.g. slow-changing but frequently-used database meta-data), non-shared persistent data, and ephemeral local data.
The whole point is you can't have such an interface. Adding relational query capabilities adds far too much overhead. You think the language stack needs to be relationally complete, or that the symbol table does... you're living in a pipe dream, it'd bloat things to the point that nothing would work. You think all that flexibility just comes for free or what?
Relational access path... you can't have a part time relational system. If it's there, it has to be the only path, or you could violate it's constraints and fuck things up. And if you have to use it... it'll be too slow for system level operations like the language stack and everything tops pushing. I don't lack imagination, I just think you're glossing over the difficulties involved in building such a system, I think it's a pipe dream.
- Bullpuckies. For one thing, you can't say that categorically without picking a specific example, because performance is all about details. Furthermore, in many cases the data already exist in some form, and most likely wouldn't have to change form. The relational access path itself might not be fast, but there would often be no reason that you'd have to slow down the system as a whole to accommodate it. The relational interface could probably gather the data from wherever it is currently arranged. You lack imagination.
- Note that I did not write the above (IIRC), but generally agree with it. Relational is a set of rules and conventions (an "interface" you could say), not an implementation. Whether all possible implementations are necessarily slow for such use makes for an interesting question. We may have to wait for MooresLaw to catch up to TOP for many domains. --top
If you think it's so easy, then build it, I wish you the best of luck.
- Of course it's not free, you'd have to implement it. But it's by no means a given that other parts of the system that do not use the new facilities would have to pay any cost whatsoever.
- I've built similar things. It's not rocket science - multiple interfaces to the same data. Ever work with COM? It's all about interfaces. Whether it's easy in a particular situation will depend on how the data's currently arranged. This is a very hypothetical discussion, so unless you want to get into more concrete examples, it really would be appropriate to drop the "it's too hard" argument.
That could be a very "big whoop" for convenience and simplicity of programming. Convenient database query APIs are a prerequisite for this approach - for instance, a simple key-based lookup should be as easy to do as a hash table lookup.
[The "still gets loaded into memory" comment comes from somebody who apparently does not "get" databases. ]
No, you just don't get programming, and think that database are actually fast enough to do all that stuff.. well they aren't. When they are.. then great, it'd be nice to have those capabilities, but that's not in the near future, computer's are still too slow for such overhead at such a low level to be implemented efficiently. When a database can keep up with the speed of chasing pointers... then you can talk, until then, you're just wrong.
- Databases are fast, often faster than custom brewed code of your own. Databases may even use pointers internally depending on the implementation. It's not the database that is slow. Connecting, authenticating, and sending a string that needs to be interpreted (SQL) is slow (relatively, careful of premature optimization). There is tcp/ip sockets overhead too. There is also disk overhead if the database is written to a hard drive.
[RAM and Disk are immaterial. Databases can and do use cache. It is a weakness in OO that it cannot easily blur the distinction between RAM and disk. When using a database, you only create temporary, local "views" of information from the database, usually a single table (result set) or map. You don't reconstruct much of or most of the schema in RAM. That is a violation of OnceAndOnlyOnce
. Too many OO programs create a big long tangled structure that mirrors the database more or less because OO'ers just don't like working with a database for some unknown reason, so they spend (waste) a lot of code translating to and from their favorite form. -- top]
- In my mind, the issue isn't "relational" vs "OO". That gap can be bridged; and many of the impedance matches observed today (such as converting database attributes or records to objects and vice versa) can be fixed with better tools and more common infrastructure. And that's not strictly a database-vs-OO issue, anyway - migrating data from one RDBMS to another is a trick (despite having SQL as a common infrastructure), as is getting two OO languages to share objects (which is why various technologies like CORBA, DotNet, ComponentObjectModel, etc. exist). The real impedance mismatch isn't OO vs relational, it's the DistributedSharedStateConcurrency problem - it's the model of "all persistent data goes in OnePlace? (the database)" vs the model of "state distributed across the enterprise". Many procedural programmers buy into the database model - applications get data from the database and display it, or they modify it and do a commit - but data held by the application is only temporary. Many OO programmers are uncomfortable with that model - if they have an Employee object in memory on some machine somewhere, they want to manipulate it directly as if that particular object (and not an equivalent database record) is the canonical representation of the employee; the RDBMS tends to get viewed as backing-store. That seems to be the key issue in many of these discussions.
- Why do you need an employee object in the first place? Wouldn't it be better to just have an employee table from the start? Why does an employee need "methods", isn't the employee more of a data thing, without the need for object methods?
- Before we continue, is it really the physical location that is the issue? An app developer does not care where the database is or how it is partitioned as long as SQL and transactions work. However, some central "authority" has to coordinate things such as Joins and AcId compliance. OO'ers seem comfortable handling this issues in their own custom way.
- The issue isn't physical location, it's logical location. You are correct, the app developer doesn't care whether a given record or table is in RAM on the DB server, on disk, or on the machine down the hall; the DMBS abstracts all that stuff. However, in a properly designed, database-centric enterprise application, application code generally takes stuff out of the database, modifies it, and sticks stuff back in. The canonical repository of the state of the enterprise is in the database; not the application. Many OO programmers like to view the world as a big sea of interlocking mutable objects, where the objects themselves contain the canonical representation of the StateOfTheWorld?. And in many problem domains, that's fine. But introduce a database, then it becomes a problem. To the procedural programmer, explicit commits to a DBMS is a natural and normal thing to do; to a some OO programmer it might not be - one of the HolyGrails of OO is abstracting that sort of stuff away.
- Yes, location is the issue. Server based systems vs. PeerToPeerDataStorage, relational vs. OO is essentially about that. Listen to Alan Kay when he speaks about his intention for what he meant OO to be, like biological based systems, peer to peer, with no central authority, each object being it's own little completely encapsulated entity, working in cooperation with other entities to do work. Server based system can't scale like peer to peer systems, some of us think peer to peer is the future, and just hasn't been totally accepted yet. The OO mind wants a peer to peer system, objects working with objects, not a big global datastore that can't possibly scale to the ultimate goal.. a global peer to peer network of objects. That's Alan Kay's dream, and I share it, because it's based on biology, something we know works and scales to massive levels. The internet would not have worked were it server based, there's a lesson in that.
- Be careful with biology analogies. Computers are not biological. Biology comes from Evolution, which is a gigantic mess, not designed (unless you are a Creationist). Since evolution wasn't designed, why should we follow a poor (lack of) design? As for scaling - there are plenty of applications like big banks who use databases with millions of transactions daily. I would laugh if one of the big banks used Object Orientation for doing banking transactions. LaughOutLoud that would be a good joke.
- Commits are part of the interface, not an implementation. Besides, one usually only needs explicit commits if they are issuing multiple statements. They are decisions that are going to have a counter-part somewhere in an OO design also. You can't escape the fact that some transactions/changes/alterations are going to potentially conflict with other users. Transactions are simply one (good) convention for dealing with them. Databases are basically a set of time-tested conventions for dealing with state and attribute management, including multi-user contention handling. They are similar to building codes for housing. OO'ers seem to reject those standards and feel they get better results if they cowboy-up their own. I have not seen compelling reasons to reject the conventions. I don't see the up-side of rejecting them. Instead I see coded versions of shanty-towns, which are what you get without building codes. -- top
- OO'ers reject which standards? Part of the problem in this debate is that you seem to see a movement among the OO crowd to overthrow the RelationalDatabase and replace it with something else. Ignoring obvious AntiPatterns such as the XmlDatabase (which is sufficiently blatant stupidity that I doubt you'll see anyone defending it, except perhaps for very limited applications like configuration files and the like), I don't detect such a movement. Most shops have RDBMS's of some sort; it's widely considered a BestPractice in the industry to do so. And...you seem to react to this perceived problem by suggesting we get rid of OO instead. I think the two can coexist.
- I don't think very many are happy with the current state of OO and RDBMS integration. For one, solutions often violates OnceAndOnlyOnce because schema-related information often has to be mirrored class code.
- The long-term solution to that is to have a uniform way of specifying both relational schemata and class definitions. There are lots of obstacles to that (many of them political/commercial rather than technical in nature), but it's potentially doable. Agreed that the current state of affairs is ugly.
- I don't believe they can ever be resolved without limiting one or the other or both. They just have fundamentally different underlying philosophies (TablesAndObjectsAreTooDifferent).
- By "commit" I mean any query/transaction that alters the database; including single-step transactions (where one wouldn't explicitly write COMMIT in the Sql code) and complex transactions. Regardless of whether or not one issues simple or compound transactions to the database, a procedural programmer still explicitly specifies database access.
- My position on the OoRelationalImpedanceMismatch?? I think a lot of the issues can be resolved (as mentioned above) by better tools. I certainly don't think OO will go away; nor do I think OO is incompatible with databases; though some OO programmers will need to change their habits. I expect RDBMS products (and related technologies such as query languages) to continue to mature and better adopt to the OO way of doing things. I expect FunctionalProgrammingLanguages to become more important for many things - a functional (or even logic/constraint) + relational model makes sense (use the FPL for implementing queries and transactions; use the RDBMS for the persistent state). I have doubts of the ability of OO persistence frameworks (e.g. the Prevayler and such) to scale to large distributed systems - especially when complicated integrity constraints are involved (DBMS's can handle this better). OTOH, I do agree with TonyHoare that we will have to start dealing better with non-determinism in how we process data - the RDBMS (one big entity managing all the important state) approach has its own scalability limit which many organizations are already well past.
- Well, at this point in time, OO frameworks are not up to RDBMS yet. As far as your "high-end" RDBMS problem claim. I would like to see specifics on that. I agree that large organizations are swamped by complexity, but still no tool has been shown better than RDBMS under that either. How does abandoning relational rules help larger organizations? Which ones get in the way?
- As to what scales better than a RDBMS (while still providing full transactional integrity - in other words, capable of providing GlobalConsensus), we don't yet have it. 'Tis an open problem in computer science. And there are datasets out there which even the beefiest RDBMS cannot handle - the way those are dealt with currently is to often partition it into multiple logical databases and tolerate some inconsistency. RDBMS's continually improve, but the biggest issue these days limiting RDBMS scalability is I/O bandwidth and reliability. MooresLaw hasn't applied to network connections or disk I/O speeds, and it especially doesn't apply to long-haul networking. Many of these concerns are outside the purview of database theory; but the database vendor must consider them nonetheless. On the suitability of databases, I'm certainly not arguing with you - other than to suggest that continued research into object persistence engines, distributed computing models (ie TupleSpaces? or CommunicatingSequentialProcesses), and other related technologies (as well as infrastructure like CPUs, memory, and networks) is important. Regarding relational rules - again, lots of this is orthogonal to the relational model. Many datasets (like those managed by things like ActiveDirectory and the like) are handled with non-relational databases (AD is essentially a NetworkDatabase) tuned to the application (for application-specific datasets, non-relational is often appropriate). And AD, NDS, and other tools for this have the same concurrency issues as does a RDBMS. One interesting issue with the RDBMS is the ease with which global constraints can be created - if completing a transaction requires checking every table within a database, that has a major impact on scalability. Fortunately, most transactions are limited in their affect (and dependence) on a database at large.
- Cache from http://www.intersystems.com/ appears to solve the mismatch. Cache's database access allows OO and its implementation is hierarchical, but it's derived relational implementations appear to out-perform the traditional relational systems.
- Haven't seen a niche DB vendor who does not claim that. Anyhow, I would like to see how it solves the difference items listed under TablesAndObjectsAreTooDifferent.
- I include the reference to Cache only to illustrate that there are solutions to the DatabaseImpedanceMismatch OutThere. Cache's solution appears to accept the DatabaseDefinition from any of ObjectOriented, RelationalDatabase, NetworkDatabase or HierarchicalDatabase (and others). I think this is of importance to the discussion because it implies that there does exist a level of source which can allow relational implementation for ScaleAbility? (if appropriate) and object implementation for convenience (if appropriate).
Who cares which database implementation is used if the database is hidden?
I would rather aim at producing software which requires no changes when I choose the database layer.
I think that this very ancient argument about the right database is one of the continuing distractions which impede the progress of software improvement.
Given that there are at least two well-supported opposing views to this database question, is one gonna be right one day in the future?
The right database may be one of these opposing views' database, or it could be the other.
More likely it will be a new view which includes all the argued preferences.
So why argue about our particular views of databases when we could look for ways of specifying systems which make this argument redundant? -- PeterLynch
DatabaseVendorLock is often an exaggerated fear in my opinion. There is not enough standardization to easily swap DBMS, even among OODBMS, unless you use a very low common denominator, which unlike vendor switching, will impact the here-and-now. IndirectionIsNotFree?. Besides, databases are a high-level tool. Expecting to swap out a high-level tool is like expecting to be able to swap out OOP and replace it with FP.
See also: SyntaxFollowsSemantics