No More Databases

It's just one of those times I need to grump. I'll grump here on the C2/Wiki editorial page to see who notices.

I am grumpy because I am really feeling the need to say goodbye to databases. I mean forever. I don't care if they're relational or object-oriented or what. Let's say goodbye to them. File systems, files, databases, the works.

Disks make a useful third or fourth level store at the systems level, but why should my application care?

Harumph. Who needs them anymore? At least, why should we build new ones? While we're at it let's get rid of networks too.

-- PatrickLogan

A lot of work has gone into PersistentOperatingSystems? of this type, Eros ( and Rekursiv ( being two examples. However, they have all had various performance and stability related problems.

One of the few attempts to commercialize this idea was with the CanonCat; unlike most persistent systems, the Cat did not try to replace explicit disk access with virtual memory, but instead would silently save documents as needed. Documents would be referenced by text searches. While not specifically about persistence, the XanaduProject had many similar goals regarding elimination of filesystems and avoidance of explicit disk operations. -- JayOsako

This seems kind of shortsighted to me. And, and maybe this is going too far, symptomatic of a lot of what is wrong with our craft. Far too many programmers I've met focus on (1) getting the product out the door (2) Making the objects right and (3) getting the patterns down.

Where, of course, (2) and (3) are mostly lip-serviced and (1) is the dark obsession to end all dark obsessions.

An example of what's wrong with this: Nobody seems to read Knuth (see DonKnuth) anymore. I'd like to think I'm wrong, that we all worry about exponential versus n^2 in our implementations and so on but I get the feeling that even Knuth doesn't quite care any more.

The upshot (and three sentence rebuttal): We program on machines. Paying attention to the machines (and the way our code interacts with the hardware) is not a bad thing. I want to be aware of the disks and networks (well, some of the time).

-- WilliamGrosso

Hey! I read Knuth. Sometimes. Really.

I too have met many programmers that are obsessed with (1). Shipping the product is very important. Getting the product right should be more important. No, I am not talking about risking your company's inflow of cash. Programmers and engineers should always put the correctness or quality of the product first. Let the managers and salesmen worry about getting the product out the door. There should be a balance here. Fight for quality - the product will ultimately ship (even before you think it should). Don't fight for quality and the software will ship earlier. Neither the administrative staff nor the technical staff should be able to easily win the fight. Darn it, there is a pattern in here somewhere...

Closer to the topic: Don't get rid of the database. Get rid of the machines. Okay, don't get rid of them... just tame them. We, as an industry, are suffering from MachineLust.

-- ToddCoram
When I teach, I poll the students to ask how many of them know who DonaldEdmundKnuth is (well, I don't use his middle name, but the insightful will understand the joke). Typically less than 20% of my audience will have ever heard of him, let alone have read his series.

Databases are one of the precious few areas (type theory being another and complexity theory another) where there is any science left in ComputerScience. It is science that has served us well. ObjectOrientedDatabases lack this formalism, which I think will be/has been one of their great downfalls. Few object models constitute a FormalSchema?.

There are many problems that beg databases because of their scale, concurrency, and reliability needs. Too many things are called databases that aren't ("my .profile is a database of blah blah...") and databases are too often used as overkill (without scale or concurrency). Right tool for the right job - and often, databases are the right tool.

-- JimCoplien

When a database is the right tool, use it of course.

So often, though, I see folks worrying about which database product to choose, when they haven't figured out what they're going to store, or how it's going to be used.

I've been managing the growing list of IP addresses that visit for over a year now. A few months back, at 1/2 million names, Perl associative arrays (backed by text files) ran out of gas, so I switched to a SQL store.

It might seem perverse not to have switched sooner. But in fact, Perl turned out to be really useful for primordial data management. I built a lot of log analysis apps in a hurry. People used those apps. I learned what mattered to them and what didn't. When it finally came time to formalize some things in SQL, there was no wasted effort.

There is some kind of pattern here that I would liken to data-intensive equivalent of RapidApplicationDevelopment - perhaps it could be called RapidDatastoreDevelopment?.

I think Lisp and Smalltalk people do this a lot: reading and writing ASCII-ized representations of in-memory objects. It's lousy for transactions, or huge data, or fast response. But it's great for prototyping what those huge, transaction-intensive, sub-second-responsive data sets should contain.

There used to be a slew of 'lite duty' relational-influenced desktop databases that allowed for less formal management, viewing, and editing of tables. Paradox and dBASE come to mind. True, they had leaky languages bundled with them, but they made table handling a snap in a RAD sense. I miss those. The "Oraclization" of relational DB's pretty much killed them. (-Tablizer-)

-- JonUdell

None of you "get it".

One can worry about disks, caches, etc. and still have a system of objects that are never explicitly "saved" and "restored".

Look at Gemstone.
Can you say more, I don't "get" your objection. -- JimCoplien
Let me guess (I am not atropos..., but think I identified with his comment): As I recall from looking at Gemstone (long ago), Objects were all handled as if they were main memory variables, with only some minor notation identifying those that should persist. Another example was Mumps, where only some identifier name convention (an asterisk?) differentiated persistent from transient variables. (Or did I miss the point?)
I think the point being made was similar to one made in AboutFace. The gist is that "save" is a very unnatural operation for most users.

They do some editing on the file and then try to move on. At which point, the program asks "do you want to save these changes?"

Cooper's point was that this is implicitly a "2 file" model (the file in memory and the file on disk) and that the user's natural model is "1 file" (opening a document is like taking a book down from a shelf - there's only one book). The user naturally thinks "what changes?"

I think something similar is true, even at the programming level. I am a "user" of the library or database and forcing me out of whatever I'm working on (to pay attention to things like when the database stores information) is one of those mental movements that leads to errors.

-- WilliamGrosso

Uhhh... I'm using GemStone *every day* and we *Still* have to worry very much about the "database". Yes, GemStone is fully integrated into Smalltalk, but things like transactions, concurrency, and local vs shared copies are still very much a problem.

P.S. Things are explicitly "saved" in GemStone, else there would be no need for a "commit".

-- KyleBrown

Database code is unnecessarily difficult, but it seems rather naive to think that we can get rid of a "save" or "commit" action. Persistence is not really the problem; there's no reason why "temporary" work can't be saved indefinitely. It's really a multi-user problem: when data is used by more than one person, there is always going to be some point when you want to publish your changes to the rest of the group. What would email be without a "send" button? How could development happen if all your typos became live immediately? A database is a place for public data, so the users need to know when they are publishing their work. -- BrianSlesinsky

Turn that on its head Brian. We don't need to save our work. We need to be able to have multiple versions of our work, and be able to undo. 'Save' type actions should be limited to the 'publish' paradigm behind what you suggest - we shouldn't have to worry about it all the time.

For an example of a system which has taken these ideas to extremes and is better for it, see Eros: (ErosOs). Eros is interesting in all sorts of ways. Take a look... -- BrianEwins

The current EROS system is still not a complete, runnable system. We have a group at Hopkins that hopes to cross this hurdle this summer. For now, we encourage you to download only if you have a research interest in running the code, you are a potential EROS developer, or you have some desire to read other people's kernel source code. Note in particular that while the system image that lives in eros/src/base/sysimage/ now builds, it still doesn't do anything useful at this time.

re: I am really feeling the need to say goodbye to databases. You are right on, PatrickLogan, and I think you are expressing a frustration I have had for years. Databases are valuable, if for no other reason than they force some level of data design and normalization. But they, like filesystems, are an artifact of implementation mechanics, not the problem domain. I was struck by it way back when I was doing APL programming, and forced to abandon my lovely (to me) multi-dimensional array oriented mashing of problem domain data when I ran out of my (16K, later 32K) workspace limit. I was sure that more than half of my code and effort was dealing with external (to the problem) file systems. What I really wanted was an unlimited workspace (much closer to available today, the MS Memory model is a good example). With the addition of mechanisms to support persistence and interface, share, and exchange data with the outside world, we could get back to staying focused on solving the problem at hand. -- JimRussell

How are databases "an artifact of implementation mechanics, not the problem domain"? In my ExBase days, I did things similar to what you seem to want APL to do and didn't have to worry about RAM because tables were auto-buffered. If they fitted in RAM, the disk was rarely read, and if they didn't they were buffered to disk. I didn't have to care. The only thing that relational and ExBase seem to really lack compared to APL is the easy swappability of rows and columns. However, usually dimensions are represented as column factors instead of an X/Y grid. Thus, problems are tackled differently. -- top

Related: ProgrammingWithoutRamDiskDichotomy

I like databases. They're lovely repositories for storing the things we need to have remembered for us. Particularly useful for those times when our applications aren't running and the information they're keeping track of for our users needs somewhere to live. They're also pretty useful for providing the ability to get some view of the information from outside the framework of our application.

What I really, really hate is the idea that the database is the important thing, and that we need to build our applications as subservient drones working in support of the Lord High Database. My great grand dream is that one day databases (or the mechanisms to persist our application information) will be smart enough to fade into the background, keeping track of things that need to be remembered, but making minimal demands upon my time as someone trying how to build something useful to humans. Sort of like Katherine Hepburn in "Desk Set".

And one last thing... does is strike anyone that "relational" databases do, in fact, not represent information in the same manner as people, and that using them is a case of forcing people to accommodate themselves to the limitations of the machinery. -- ChrisGerrard

A number of people would suggest that "objects" don't represent information in the same manner as people. Paradigms are nasty like that. At least relations are sets, so it's a widely known & mathematically sound paradigm. -- StuCharlton

Nobody is stopping you all. Build one and make a pile of cash. -- RIH

I've lately found it fascinating how many experienced programmers, especially OO programmers, really have an almost emotional problem with RelationalDatabases. They get in your way, they expose you to fundamentally annoying problems. But it's also like we want to ignore fundamental physical and scientific problems about data integrity, associations, concurrency, recovery, etc.

I liken the distaste for databases similar to the distaste for unit testing. xUnit, XP, and TestInfected seems to have improved that somewhat - maybe we need a DatabaseInfected? movement. <ducking, running>.

Perhaps because it takes them away from the comfortably numb land of abstraction into the nitty gritties of algorithms, access paths, indexes, etc. That's the unfortunate thing about GemStone too - it gives you the illusion you can get away from that stuff, but in the end, you can't.

Certainly database products could learn from GemStone or any number of frameworks like the EnterpriseObjectFramework and do a better job of providing a LayeredAbstraction? to this stuff. Perhaps they'll actually implement the RelationalModel properly, and maybe provide a seamless mapping to a host language (with Oracle buying TopLink, this may be closer than you think).

The tinkerer/hacker in me loves RelationalDatabases because there is SO much to learn, monitor, and tweak. The BeautyIsOurBusiness part of me really beams when I read DrCodd's paper, explaining how simple and elegant the RelationalModel is. RDBMS give you a lot of power to get your job done - they're your friend, really. An OracleDatabase looks big, mean, and complex, but it's amazing how similar it is to GemStone in certain areas. This leads me to believe that perhaps the main problem OO people have isn't with the relational model or DBMS facilities, but with SQL. Interestingly enough, most relational theorists like ChrisDate also tend to hate SQL (SqlFlaws). I don't see a replacement on the horizon, and the SQL-1999 specification has grown to engulf one-third of the eastern seaboard. :sigh: -- StuCharlton

Seems to me that part of the problem with SQL is that it isn't a programming language, so it has a lot less flexibility. But complex data manipulation can be at the heart of what you're trying to do, so there's a bit of a shoehorning going on. I worked at a company once where they were trying to do extremely complicated cross-tabbed statistical analyses reports with monstrous SQL statements, and almost every time someone wanted the most trivial change it was a fairly complicated undertaking to do it. When I was called in, the SQL was so gnarly that responsibility over it had been passed around like a hot potato.

My response was to rewrite the whole thing in Java, keeping my SQL statements as simple as possible.

SQL is not supposed to do the entire app. You let the database do what it does best and the app do what it does best. It is about specialization. The more you do in your app instead of with the DB, the more you will reinvent things that DB's provide out-of-the-box (see list under DatabaseDefinition). But perhaps we need to simplify query languages and DB's in general: SimplifyingRdbms.

I'm one of those programmers that never understood why some folks liked databases so much, relational or otherwise. I just can't get excited about data for some reason. Even the talk about how the relational model is based on set theory etc. doesn't jazz me, and set theory usually jazzes me. Is something wrong with me? Is there a support group? How can I learn to care about databases? -- EricHodges

And I am one who does not get code-heavy approaches that appear to reinvent (poorly) the database in their app code. The more I can factor from code to data, the more I can use high-level relational manipulation to orchestrate new combinations and views rather than hand-code them. (Of course you can go too far with it also.) The same info factored to DB's and tables is generally far easier for me to navigate and view than the same info in code. You can't query code very well, and it is not as compact to the eyes as a data-fied version in my opinion. Perhaps it is subjective, I don't know. I just dig them. -- AnonymousDonor

I've never re-invented the database in my app code (unless you count writing a dbms as re-inventing), but I don't have any code that can be factored to data. In all of the domains I've worked in (check and credit transaction processing, UML modeling tool, portal life-cycle IDE, portal server, computer based training, banking stuff) the database was just a place to store the data when the code was done with it, or a place to fetch data from when the code needed it. -- EricHodges

Re "but I don't have any code that can be factored to data", often times that is a skill that is not readily "natural" to some people, I find. But, we would probably have to look at a specific example. Databases are about a lot more than just "storage". -- AnonymousDonor

OK, here are some example problems: How can these be refactored to data? -- EricHodges

I would have to look at detailed specifics. However, here are some suggestions: UML can be stored as data in a similar fashion that EntityRelationshipDiagrams can (see draft schema there, and note that I don't think to highly of UML). Regarding the portal, I have often data-tized web content and navigation (menu) info. Exam questions and scores can easily be stored in tables. If you want to pick one to look at more closely after specs given, I would be happy to. Note that I did not suggest the *entire* application be tablized (although if we had a modern "file system", it could perhaps be). -- AnonymousDonor

But I did those things without refactoring any code to data. After the code converted Java source to a UML model, it stored it in a relational database. It loaded the UML model from the relational database before converting it into Java source. The exam questions and scores were stored in tables. The portal content and configuration was stored as EAR and XML files, but could have been persisted in a database without much trouble.

Imagine any specs you want for those problems. How can they be refactored to data?

-- EricHodges

Well, you admitted that you used a database. Whether there would be more that could be moved to the database or SQL would probably require a complete code review. -- AnonymousDonor

I use databases, but I don't get excited about them. They seem pedestrian. Mundane. Nothing to write home about. And I never "use high-level relational manipulation to orchestrate new combinations and views". The data was never central to what I do; the behavior was. Is it just me? Is it just the kind of work I've done? -- EricHodges

I guess I just have a "data head". Data-tized stuff is easier to grok and reason about for me because it is easier to represent, query, and alter my perspective of; so I tend to try to shift much more stuff to being data-driven (queries, ControlTables, etc.). There is no "behavior algebra" comparable to relational algebra that I know of, meaning that every developer is starting from scratch into the wild untamed wilderness if you make behavior central. That's the way I see it. -- AnonymousDonor

How do you get paid to grok data? -- EricHodges

Ph.D in statistics?

OK, how does a software developer get paid to grok data? -- EricHodges

No idea. For the most part, that isn't what software developer do (or can do).

I thought this discussion (and this wiki) was set in the context of software development. -- EricHodges

And so it is. There are lots of things software developers (or most of them) can't or don't do. For the most part, non-trivial data-analysis is one of them. I was just responding to your question in an offhanded way; meaning that you (as a software developer) probably won't get paid to grok data.

And how does that make my world a more beautiful and groovy place? -- EricHodges

ItDepends. For example, if you define web forms using DataDictionary-like tables, does the developer get paid to fill those in, or somebody else? The answer probably depends on the shop or project size. -- AnonymousDonor

Filling in web forms isn't, in any remotely useful sense, characterizable as grokking data.

I am not sure what you mean. This is not filling in forms, but creating/defining them via meta data.

If most of what you want from a database is persistence, maybe you really want a PrevalenceLayer. They're typically simpler, cheaper, and much, much faster - see ThePrevayler for an open-source Java prevalence layer. -- GeorgePaci

"While we're at it let's get rid of networks too.". A PrevalenceLayer will also take care of transparent replication (fault-tolerance and load-balancing) for you. -- KlausWuestefeld

Prevayler rocks!

Databases are very powerful, convenient tools if used right. I hate to reinvent their features from scratch in application code. That can be a lot of work and complexity, and is poor "reuse" IMO. I suspect that people who want to get rid of databases either don't understand them, or are control-freaks of sort who don't want to give control of noun and state management to a database or perhaps a DBA. Perhaps I am stuck on one side of the ObjectRelationalPsychologicalMismatch divide, but you are going to have to pry good database systems/tools from my cold dead fingers.

''If all you're doing is simple CRUD operations, and you're not writing a financial/medical/etc system, why would you ever introduce all the complexity and overhead of a database? In my current app, I'm using prevayler, I develop at probably twice the speed I was with a database, and having everything in memory frees me up to do all sorts of extra things I couldn't do when I had to deal with 50-100milliseconds of db overhead per call. And my unit tests run almost instantaneously.

If you need data like this to ultimately reside in a database, write an extract program that runs as a nightly cron job. There's no reason a database needs to be around moment-to-moment - db constraints aren't an excuse, a well-written OO program shouldn't be relying on them for validation anyway.''

what about multi-user distributed systems?

Build them client-server, with Prevayler in the server part.

So you mean I should keep all objects in memory of the server and send them thru the wire to the clients when they need to access them? And I should implement all the transaction isolation stuff in the server myself? So to say, I should use Prevayler to implement my own OODBMS?

No. I mean you should build a client-server system, like you would if you had an OODBMS on the server, but without the OODBMS.

Why? The speed gain cannot be that compelling when the data have to go thru the network.

Many benchmarks on Prevayler show that it is orders of magnitude faster than databases on the same machine. So the server will be faster, and you'll need a network between clients and servers no matter what persistence solution you use.

Of course, it will be faster (if there is enough memory and no swapping is necessary), but will it be that faster? When 90% of the performance is used by the network, then I can save max. 10%. The numbers may differ, but still I am not convinced to use Prevayler on the server.

Are you saying that you don't need a fast server because networks are slow? Anyway, the simplicity and cost may be other things to consider.

Speed is good, simplicity is better, reliability even more. I would consider using Prevayler, but I suppose, after the consideration, I would choose a RDBMS. I don't know much about the Prevayler, but from what I know, I suppose, Prevayler is good on the client where no network access is needed, when all data fit into the memory and the data is used by one application only. The strengths are not so impressive when the data have to be shared among more clients or when there is a lot of data, and memory has to be swapped. Regardings simplicity: I think the normalized relational data model is pretty simple.

Simpler than no separate data model at all?

Yes. It is simpler because it is separated. It allows easier development of new applications which work with the same data. The persistent data usually lives much longer than the applications which work with the data. I often update applications, seldom databases or data formats.

What if you never need a new application that works with the same data? (see YouArentGonnaNeedIt)

If I only could predict the future ;-) But you are right, if I were sure, I will always access the data through one application, then no specified data format is indeed simpler.

I'd phrase it from the opposite perspective. If I know that multiple applications are going to access the data I provide a mechanism for that.

Well, then you are creating a database of sorts from scratch when you do that. That is kind of anti-reuse IMO. In my domain one often ends up sharing info between apps and systems. It is the nature of the biz.

I said I'd provide a mechanism, not that I'd write it from scratch. If I know that multiple applications are going to access the data and they all understand relational databases then I'll use one of those.

I think I can understand the frustrations of the the original poster. It seems we save far too much transitory information as persistent. Of course, once information is added to a database, it is far too important to delete, someone may need it some day! Instead, we add more and more complex filtering mechanisms to hide the data we don't want. Then we have to upgrade to bigger and faster machines to hold and filter through all of the data we don't want. I really wonder if in many of the systems, that if the database crashed today and we replaced it with an empty machine, whether anyone would really notice.

The programmers might not, but everyone else who got used to having information readily available to run the biz might. It is normal business practice to want to see the same piece of information in multiple perspectives. Non-relational techniques don't do this very well in my observation because they tend to hard-wire the earliest uses into the "shape" of the structure at the expense of later uses. They don't separate meaning from usage very well, and to work around this they just keep tacking on shanty-town pointer knots. -- top

This is really the heart of the question. How much transitory information do we save because someone, someday might want to see it? Do I really need returned checks from 7 years ago? Do I really need a credit card breakdown of every expenditure I made over the last year? Do I really need a listing of every telephone call I made in the last month?

Off Topic?

I am, thankfully, a psychologist now. I get paid simply to be online and accept input - far easier than programming ever was. But just to trigger your horror, I wrote a business management app (Point of Sale) in FileMaker for a 10 million dollar company.

MySQL fans wouldn't even consider it relational. There is an automatic save, however, of everything.

I charged about 125k for three years of part time work. There's no way in hell a mediocre (at best) programmer like me could have done it in C, Java or any of the exotic tools gurus see an answer.

Considering technology tools without looking at the issue of how many people are smart enough to use it is completely insane and disconnected from the reality of how the social world works.

It processed about $40 million in four years, and lost only one out of several hundred thousand invoices over that time. It ran for a year after I was done, then they replaced it. In the last year it ran with ZERO downtime or maintenance, processing about 10 million in sales (tens of thousands of invoices) at zero cost.

But FileMaker is crap, right?

Not necessarily. If it's the right tool for the job, then it's the right tool for the job. There are numerous applications for which a lightweight desktop database is entirely appropriate. There are others, however, where only an industrial strength database, with its attendant infrastructure, will do. Sometimes you can drive a Volkswagen Beetle, sometimes you need a semi-trailer, and sometimes you need an airplane. Choose the right tool for the job.

Apparently, your environment deemed it tolerable to lose one invoice out of several hundred thousand. When I was developing health care records applications, an application that lost one record out of several hundred thousand could have injured or killed a patient. In that case, FileMaker would have been entirely inappropriate - not just because the need for reliability, transactions, auditability, and referential integrity made tools like FileMaker unsuitable, but because there were user interface and multi-user support requirements that simply couldn't be implemented, or would have been much more difficult, in something like FileMaker. In that context, mediocre programmers need not apply...

See Also: ObjectRelationalPsychologicalMismatch, YagniAndDatabases, AlgorithmsDealingWithMassiveData

View edit of April 9, 2006 or FindPage with title or text search