Auto Keys Versus Domain Keys Discussion

Some discussion in lack of arguments moved here from AutoKeysVersusDomainKeys.

What if the auto-numbers end up in a foreign key, in a table that is supposed to make references only to businesses with taxpayer_id?

{Wouldn't it be other table's responsibility to enforce such rule? Referential integrity rules would take care of the stray foreign key. Nor do we need to use primary key to enforce such rules. I agree that doing such may kill two birds with one stone, but is not change-friendly as a downside. Perhaps it is a YagNi issue?}

You can play what-if with any approach. What if you choose the wrong key? What if you use that key improperly?

Wasn't the auto-numbers supposed to be worry free, because they're dumb? Maybe they are too dumb, cause in this situation they can't enforce a straightforward rule. The what-if began with the hypothetical that IRS deals with businesses without taxpayer_id, which is a stretch already, and a proper argument has been made that in this case separate workflow, separate business rules and, voila, a separate table is worth considering.

You can't choose a wrong key unless the project sponsors tells you to, and then it would still be a good key by definition, because the customer is the ultimate custodian of business rules. How one goes about with elicitation and validation of requirements (business rules), is a separate and orthogonal discussion.

The fact the there will be errors and or changes is a given, as you indicate by playing what-if.

The likelihood of errors is greatly diminished by carefully designed information systems that choose to identify and enforce all the relevant business rules. That's why natural keys are important: they directly reflect a business rule in the domain. Surrogate keys are typically a side-effect of software implementations and they do not reflect an aspect of the business. That the business rules may change, and that information systems have to accommodate the changes is another fact of life.

Anyone who uses "elicitation and validation of requirements" can not talk about pretentious. It doesn't seem pretentious to me, nor was that the intent. Again, ironic, your assigning to me motives after you complained about the same. I'll pretend you were civil and consistent and try again.

FYI: Elicitation and validation of requirements is standard software engineering terminology.

Maybe I missed that RFC. It sounds pretentious to me. Try CiteSeer. http://citeseer.nj.nec.com/cs?q=requirements+elicitation&submit=Search+Documents&cs=1

I didn't see your particular phrase used.

The problem with domain keys is they assume a representation in both type, specific attributes, and number of attributes. Any of these can change for a number of different reasons. Few people building a system would ever consider these reasons orthogonal.

Do you mean to say that DatabaseIsRepresenterOfFacts? And facts have a representation in a simple logic such as the relational model, i.e. you can't have "abstract facts".

I agree strongly with DatabaseIsRepresenterOfFacts.

For a person we have a first name and last name that someone said was the primary key. Whoops, now we need a middle name. All X places we used these keys are now invalid. So why depend on a representation?

I guess that's a StrawMan that you use it to validate your theory, rather than a representative example. Nobody in his right mind would use (First, middle, Last) names as a candidate key.

Why not? It fits every criteria for a candidate key.

The most basic criteria for a candidate key is that it is dictated by a business rule, not by your imagination. If somebody told you that this was a business rules and you haven't clarified that it wasn't through the validation process, then you were not doing your job well.

And why isn't it possible that a person's name is the correct business rule?

If you mean "that a person's name is a candidate key is the correct business rule", than I respond to you that no business person in his right mind would assume that he/she will never have to deal with two distinct people having the same name, including, the middle and Jr, Sr, etc. This is what makes it a Strawman, it is unreasonable to conceive that such an example can happen in normal practice.

What you consider unreasonable is outside the bounds of relational theory. To depend on what you can't conceive is not grounds on which anyone to make engineering decisions. I have indeed seen the name used as the primary key. I imagine people here have seen far worse. You are making a lot of statements about the real world and are not sticking to theory.

Are we entering the self covering closed system where every choice made is something you say is something stupid and obviously incorrect? That if you were only smarter and better read and more experienced you would realize the truth of what I am revealing to you? If you say so, can't help you much more.

Yet the rule of depending on person is valid. What is risky is relying on a representation. This is why I said you have confused rules with representation.

What rule is valid?? Can you spell it out for me?

Unlikely.

People have experienced problems so they use the auto key approach.

Oh, how I love argument by the people! FYI smarter people have been detached from the folklore about auto-number, with excellent results.

Using the auto key is an extra step because using domain keys is easier because they already exist. To go through the extra effort requires sufficient motivation.

If the crowd is doing it, then the crowd must be right. Or it requires you sufficient lack of imagination to miss the more plausible explanations, like the crowd didn't know better. Or couldn't do better at the time.

Cased closed when we add in the very pragmatic arguments around size, speed, query complexity, etc.

What about size, speed, and complexity? You haven't made any case about size, speed and query complexity. Actually if you knew something about databases systems, more than the folklore you try to pass as knowledge, then you'd have known that often times using surrogate keys decreases performance. Well, in other situations the reverse is true, but your should temper your rush to generalizing instead of coming up with real and preferably concrete arguments. -- CostinCozianu

Your consistent appeals to authority without specifics is tiresome. "Often times" is a very convincing argument, especially when preceded by such puffery.

For example inserting in a table with one key is always faster than inserting in a table with two candidate keys. So is deletion. That's specific and a fact. Your case "Case closed when we add in the very pragmatic arguments around size, speed, query complexity" is hand waving.

Again with the irony. But your point?

"Often times", means often enough to be an important case to consider, while you implicitly dismissed it with your "case closed" statement, which suggests that you lack some in depth understanding of both the database theory and implementations. That's I qualified your unnerving rush to judgement as trying to pass folklore as knowledge. What's your stake in here, especially that you post anonymously? And why you get offended by your lack of knowledge especially when you post anonymously and repeatedly have tried to insult and dismiss? If somebody would post anonymously that 2+2=3, should we make special provision for their feelings? Grow up, acknowledge that CriticsAreYourBestFriends, and try to read more before you post, and before you judge. And learn to accept when you're wrong rather than hide behind the bushes. You sound like a person trying more to defend his prejudices than trying to have an exchange of ideas, and maybe learn something in the process.

Grow up. Realize that you are everything you accuse others of and worse.

Please note, I haven't pronounced that surrogates are always bad or anything like it, but I contend that a decision needs to be balanced against the advantages/drawbacks of each option, according to a specific situation. I'll complete the exposition when I'll have time, and not having to respond to obnoxious "I know it all", "Costin argues from theory only", can only help. Also not having to respond to StrawMan arguments. But you guys try to defend the idea that one should always use surrogates and that has no support in either in the scientific literature or in current state of the art practice, but you defend such non-sense as if your lives depended on it.


I'm in complete agreement with this guy, using natural domain key's simply isn't a pragmatic approach and causes far too many problems in the real world. Generated keys are the only currently practical approach that solves all the real world problems associated with natural keys. If a database says SSN needs to be unique, then I'll put a unique constraint on it, problem solved, but using it as a key is unthinkable.
There is nothing sacred about the pre-existing domain identifiers. If the software used by the business needs an identifier that didn't exist before the software then it should be free to do so. Businesses made up part numbers and supplier numbers to satisfy the same sort of requirements the software needs to satisfy, they just did it earlier. It's silly to let the business create new identifiers for paper but not for bits.

Indeed, there's nothing sacred about exiting domain identifiers, but it should be business needs that drive the creation of new identifiers, and not the fact that software engineers didn't know better than adding an extra auto-number for each table. If the business model needs an extra-identifier to be introduced in the process, than, by all means, it's preferable to be generated by a software system. But redundant identifiers are very seldom a good idea. If the business has a part_number, why in the world does the software developer need an extra "part_number_oid"?

If the business has a part number, the software should use it. I thought the argument was against the creation of identifiers not already established in the business domain.

[The whole point is that those business needs vary too much too often. Software has needs too, stability being a major one, and the oid solves the problem of the ever changing business requirements by allowing the software to work with less change when those business needs change. The software is more flexible and reliable with oid key's than with business keys, I have yet to see a real project where this wasn't true.]

If software works without change when business requirements change, it follows that software doesn't cover all business rules, but a subset of them. As it happened in the hypothetical example presented above with business being identified to IRS by an auto-number rather than the taxpayer_id. It's fine for the software, it's less than perfect with the business users, who can mistakenly do operations with businesses without taxpayer_id, while an important integrity constraint cannot be enforced in the database schema.

[Fine, amended, it allows easier change. You make the flawed assumption that no constraint exists, there would still be constraints on natural key's to enforce proper business rules. The OID's just allow those rules to change easier by not requiring massive key changes and database updates, whenever a rule changes.]

That's not even a valid comparison. First, it's not far fetched, when using business values as the key, it's a fact that when the key name changes the updates need done everywhere. Some databases allow cascading updates, but that doesn't change the fact they have to be done. I never said it was a performance issue, but it does cause more problems in the code base than OID's do. While you may not agree, many many people have found this to be true and many people use IOD's all the time for that very reason.

Again, argument by the people. Smarter and more experienced people have found the reverse to be true.

Secondly, how many database pages get updated per update is purely an implementation detail and has no effect on the codebase beyond an extra column in the update statement, so that comparison doesn't hold. Quite simply, OID's have many benefits and few if any drawbacks in a practical sense.

Like what?

Your opinion on the subject makes no difference, these are the practical facts in many peoples eyes and they use IOD's for these reasons.

What you present as facts, siree, is nothing but folklore. I have yet to see such a claim substantiated in a scientific publication. On the contrary, authors with both great practical experience and authors with more theoretical inclinations have not made such a recommendation in books on databases. If you want to continue your practice regardless what can be substantiated, and being prejudiced to using OIDs, then you may as well stay out of this discussion rather than perform repeated handwaving exercises.

When you add in that with IOD's many of these tasks are completely automated via a data mapping layer, it becomes obvious that practically speaking, IOD's can be of great benefit in reducing work and increasing productivity.

The "I have yet to see a real project ..." theory is not a good argument, especially when unsigned. FYI, at least Costin Cozianu, and Joe Celko beg to differ and claim that they've seen plenty.

[I didn't say I hadn't seen them, only that they aren't ever as flexible as OID systems I've seen. Oh, and unsigned has no bearing on anything at all, anonymous contributions are just as valuable, if not more than signed contributions. I'm not the one with the big ego that needs to see my name everywhere.]

When data quality is paramount, and loss of data integrity leads to business consequences, flexibility is not always a virtue. Of course that enforcing all business rules makes the system less flexible. Flexibility is what allows user to make mistake, and once they have that possibility, you can bet that they will.


As for signatures, they are important not for bragging rights, but because they encourage taking responsibility for what one writes. Especially when stuff on wiki can be thought as good engineering advice. It wouldn't hurt if contributors would read more before they post (to avoid AuthorsDontRead), and posting anonymously is often times very encouraging of "I'll throw my 2c regardless", attitude.

I'd be curious how many of my anonymous interlocutors have made their due diligence in reading the relevant literature on this issue, and how many are more on the "2c but with firm conviction". -- Costin

It wouldn't hurt if authors like you quit presuming superiority to those you have discussions with. Asking if we've read the literature is a childish backhanded way of saying we're ignorant of the subject. If you want to have a discussion, try not insulting others so much, it'll work better.

I can't insult people who don't even take responsibility for what they write. I'm not presuming anything, I'm just stating the obvious that some anonymous poster haven't done their due diligence, or otherwise, we wouldn't have ignorant superstitions presented as facts, with no effort made to substantiate. After the spectacular "case closed" statement, we progressed to:

" Quite simply, OID's have many benefits and few if any drawbacks in a practical sense."

You see, a typical phrase that communicates nothing but the unwillingness to reason, so it is nothing but pure ConversationalChaff. Oh, and the eternal argument by the people. Wiki is not a popularity contest where we vote what's our most sacred cow. I moved this discussion aside, for you to recover whatever you feel is worth recovering and present it as a decent argumentation.


I'd be curious how many of my anonymous interlocutors have made their due diligence in reading the relevant literature on this issue, and how many are more on the "2c but with firm conviction"

Again, argument by the people. Smarter and more experienced people have found the reverse to be true.

Oh, how I love argument by the people! FYI smarter people have been detached from the folklore about auto-number, with excellent results.

-- Costin

Notice a pattern here? Notice how you assume people haven't read. You assume people's experience is worth nothing. You assume people should accept on face value your testament of what "smarter and more experienced" think. You assume argument from authority is valid form of argument. And all the while you provide almost no facts or evidence.

Of course, the replies were an ironic response to a non-argument. I value people's experience when they can substantiate that as valid argumentation. Other than that, our industry if full of half-baked experiences.

I do have the claim that my experience or Joe Celko's experience, corroborated with a wealth of scientific literature and knowledge of database internal, is different from the so-called experience argued from anonymity and with no concrete examples or no concrete argumentation put forward.

People have given their arguments and reasons. It's clear that no form of argument that disagrees with you will ever make into the "valid argumentation" camp.

It's clear to me that no argument is to be considered automatically a valid form of argumentation just because it takes one side or the other. If you say to me "Quite simply, people's experience is X", it's empty bragging regardless of what X is. Who are you, and how have you come to speak in the name of the people?

You have a wide range of tactics you use. It would help if you provided links to this wealth of evidence. Your one link to a usenet posting was weak at best.

That one was good enough to refute the "always use OIDs" position. For more you have to pay and read that guy's books. Or for free, you can have the patience to collaborate with me on this wiki, and even better, you can make some effort to substantiate your position better, and I'll help you clarify a few things. Is "argument by the people" from anonymity the best you can do?

You have a wide range of tactics you use. It would help if you provided links to this wealth of evidence. Your one link to a usenet posting was weak at best. But I know, if we all weren't so useless we would already know these references, have read them, and agreed with you. The mere fact I asked for references shows how unqualified I am for this discussion. And you are much too busy to provide such well known and obvious references. Yet you aren't too busy to continually respond.

Have you finished: FundamentalsOfObjectOrientedDatabases and Joe Celko's post? There are plenty of others that are available in printed books but not online, like in ChrisDates IntroductionToDatabaseSystems?, in Joe Celko's "Data and Databases" that has a chapter on keys and identifiers. No book on database theory or implementation that I know of recommends or advocates the use of OIDs, and I know a few.

Not Data and Databases, but SQL for smarties was good. Also IntroductionToDatabaseSystems?, and many more. Now what?

So where in the database literature, have you been able to find an endorsement/pro-argumentation of what you seem to advocate "always add an OID (autonumber) column"?

Never did I say always, though someone else may have. And if I didn't find a pro OID argument in the literature that means I shouldn't use my own brain and make this decision?

An uninformed decision? Try to borrow "Data and Databases" from your local library. Read the chapter on keys.

I am uniformed if Ii hadn't read that book yet have read many others? What did we do before it was written? Perhaps you would care to summarize?

You are uninformed not because you haven't read a particular book, but because you are not aware a particular argument and point of view, and take a simplistic stance that OIDs are good. Many times they are counterproductive. Can you enumerate a bunch of representative situations when OIDs are not good?

If you are the same guy who decreed, "when considering size and performance issue, case closed", that was definitely uninformed because it contains a false assumption about "performance issues". And if you want me "to care to summarize", I'd very much care to summarize little by little as I find time to spare, but having to face uninformed, rushed and sometimes obnoxious interventions is not a big incentive.

Yet you are unable to even come up a with a few clarifying points. Your argument style has been to stick to procedural issues rather than specifics. The time argument is bogus. If you actually knew the points you could have wrote them down as easily as the above unhelpful text. The only logical conclusion is you don't know.

That just shows how illogical you are. Writing good contributions takes a hell of a time, unlike having to deal with obnoxious guys who try to promote their ignorance as knowledge, and instead of taking the responsibility for what they write, they prefer attacking others from anonymity. It's true that even if it takes little time to respond to guys like you, it eats from my time and mood to contribute to the real issues. But it is completely idiotic for you, who probably don't know a quarter from what you ought to have known before contributing, to attack me and make demands and/or judgements about the pace that I choose to write my contributions. You've exposed your ideas anyways: "OIDs are good", the "experience of the people says that it is so", now you can sit back and relax until I'll damn please to spoonfeed you with the knowledge you ought to have acquired by careful research (including buying books, and investing time and mental effort to study) and not by waiting for others to give it for free on wiki.

I guess that's why you like to stay at the meta level so much, so you don't have to write good contributions. That's a rather lot of text you wrote. I'll take even some bad contributions if they are specific. Personally I would like to talk about real issues. I would like to know why OIDs are ungood. But I haven't seen any specifics other than the usenet posting which wasn't very convincing. But I am just an idiot so I am probably unable to grasp the arguments, even after a lot of study. If you don't want to contribute for free then it would be best if you didn't post, the meta argumentation is a waste of time. There isn't a single issue where your useless "buy books study effort research" meta argument can't be used as a non-reply.


Re If the business has a part number, the software should use it. I thought the argument was against the creation of identifiers not already established in the business domain.

No, even if they have a part number... I wouldn't use it for the key. IOD for the key.. part number just a field with a unique constraint.

Perhaps there are good domain identifiers and poor identifiers. Consider:

Pros: all domain keys (keys already existing in the domain, whether natural or artificial) avoid redundancy, and avoid potentially costly mistake with regards to data integrity. If you already have an identifier, why create another one?

Have you analyzed the logical implications of having two identifiers, which one of the two identifier (the domain vs. the surrogate) really identifies the entity in the domain?


I haven't read this entire discussion, but I would like to suggest that some of it might be less acrimonious if a distinction were made between logical and physical schemas. In the logical schema, which takes into account only the business-dictated aspects of the database, there is definitely no need for artificial keys. Some (in fact most, perhaps all) of the arguments made for auto keys have to do with implementation issues - specifically, performance shortcomings, or inadequate tools to efficiently propagate changes. It can be perfectly legitimate to add surrogate keys for these reasons, but these should be considered changes to the physical schema, not the logical schema. Given this point of view, it's also perfectly OK, in fact recommended, to hide these details of the physical schema from the user.

Fabian Pascal has a chapter on this topic in his book "Practical Issues in Database Management", a book which I recommend. It's a slim volume with much good advice. -- DanMuller

Yes... I consider the IOD to be part of the physical schema, it's only necessary due to implementation issues, but it's almost always necessary. It is not at all necessary for the logical model.

Whether or not "OIDs" make sense for a specific project depends on the characteristics of the particular DBMS being used, the nature of the available domain keys compared to those characteristics, and your implementation priorities. To give a blanket recommendation either for or against them (like "almost always necessary") is simply wrong. However, a blanket recommendation to leave them out of logical schema designs is usually correct. The decision depends on physical implementation issues, and those vary quite a lot.

BTW, "OID" is a really lousy term for this thing. There are no objects in relational databases (at least in most current implementations), and there's already a common term for the higher-level construct that you're talking about - entities, used in the context of entity-relationship modelling. (And furthermore, if objects were integrated into databases, they should probably appear as values, not as rows - thus an OID wouldn't identify an object. But that's another whole discussion to be found elsewhere.) It is found in FundamentalsOfObjectOrientedDatabases, where it is argued that OID identification is not good enough for OODB. Also on this site is discussed on ObjectIdentity and RelationalHasNoObjectIdentity.

I just re-read Pascal's section on surrogate keys, and found that I misused the term somewhat. He contrasts surrogate keys with natural keys, and these are both, in fact, part of the logical schema. His criteria for adding surrogate keys is when they greatly enhance simplicity (compound keys can be cumbersome) and/or stability (where a natural key is expected to change frequently). Personally, I think they should be used sparingly.

Interestingly, Pascal has a side note discussing the need for a natural key to really provide uniqueness, and uses first, last and middle person names as an example of a bad choice of key for this reason.

-- DanMuller

Of course, surrogate keys are part of the logical schema, because they are exposed to the application logic, and app developer simply can't do without them. A detail in a physical level would never have to appear in application logic.

The simplicity argument goes like this: why would I have to write 3 = clauses in a join when the tabl;es are related through a 3 column keys, when I can use only one = if I was using a surrogate key? Well, you could always replace a join clause in application logic with selecting from a predefined view, and you're even better - with or without surrogate keys. -- Costin

Given that the purpose of normalization is not to minimize the number of columns, this argues more for a logical-level macro capability (to allow us to refer to a group using a simple text expansion) rather than physical-level semantic intervention. In an ideal world, we'd only use artificial keys when the natural-domain is not a canonical representation of the required identity. Characteristics of identity are that it is (a) immutable and (b) homogeneous (by which I mean that every instance (row) of the identified thing can use the same set of fields as its identifier)

That's an interesting thought, regarding a "macro" capability. DateAndDarwen, in TheThirdManifesto, recommend the ability to define a column domain as a tuple type (tuples being analogous to rows), which almost allows you to group the fields of a composite key into a convenient package. I say "almost" because tuples can be compared for equality, but not for relative order, since there is no precedence defined for the attributes (fields) within a tuple. (But then again, strictly speaking, keys don't need to be orderable, just equality-comparable.) With a flexible user-definable type system, which commercial products generally don't have, it would be easy to define an atomic data type that encapsulates the multiple parts of a natural key. (This wouldn't always be an appropriate approach.)

Generally, I find the "simplicity" argument to be weak, but not easily dismissible on practical grounds, because of the shortcomings of SQL and commercial relational database products.

The author who said, above, "the app developer simply can't do without them" is simply wrong. (An absolute statement evokes an absolute judgement.) Given the availability of stable natural keys, the question of adding a surrogate key comes down to a matter of weighing conveniences and inconveniences against one another. It's not a matter of necessity. -- DanMuller

Necessity is always hard to argue because in the end nothing is necessary. Would "often useful" be acceptable?

Yes.


Here is a real example of domain keys creating problems. I was working on an equipment inventory system. Some items were to be uniquely identified by a compound key: serial number plus "tag number". Tag-number came from an internal sticker that had an ID number on it. Some items only had a serial number and some only had a tag number and some had both (or at least we assumed some would).

I pointed out that serial and tag numbers may sometimes be mistyped. Further, serial numbers were not guaranteed to be unique across vendors. The second issue was judged to be too rare to worry about. But the solution proposed to the mistyped numbers was to delete the item and then type in a "new" one with the right numbers. Aside from being a pain, it created the possibility of dangling or "split" references for historical info. I proposed a single generated key, but was rebuffed. Apparently the shop was "used to" similar problems because relational databases were relatively new and compound keys were common in the pre-relational legacy system. They thus "expected" such problems. Odd. It was a case where BeingFamiliarMoreImportantThanBeingRight?.

Using a surrogate key instead of compound keys can be good OnceAndOnlyOnce. Here is an example that shows how a surrogate key can simplify the schema:

Compound key:

 tableQ: foo1, foo2, key1, key2
 tableR: foo3, ..., fkey1, fkey2
 tableS: foo4, ..., fkey1, fkey2
 tableT: foo5, ..., fkey1, fkey2
 tableU: foo6, ..., fkey1, fkey2

With surrogate key:

 tableQ: foo1, foo2, key1, key2, ID
 tableR: foo3, ..., fID
 tableS: foo4, ..., fID
 tableT: foo5, ..., fID
 tableU: foo6, ..., fID
The second schema has 3 fewer columns.

-- top

The DB systems I've worked with support cascaded updates, which can usually be applied so as to eliminate the dangling reference problem. And here are a couple more thoughts that haven't been mentioned yet on this page:

-- DanMuller

The schema simplification described above was mostly considering schema simplification rather than performance issues. As far as cascading updates, sure it is possible, but certainly more complicated. -- AnonymousDonor

What's complicated about cascading updates? Declare them once in the schema, and they take care of themselves. Far simpler than the alternatives, and simpler even than adding another field.

Performance issues are complex. By themselves, they are not an argument for or against using natural keys. Generally, don't optimize for performance before measuring performance.

-- DanMuller


Here is a real-life case of being burned by auto-numbers. Articles for a website were auto-numbered. If one area referenced an article, it did so by the reference number. There was a disk crash, and attempts to restore the database from backups sequentially reassigned the ID numbers. Gaps from prior deleted numbers were ignored. (The backup process does not track auto-assigned keys for some reason.) Thus, all the references were wrong upon restore. Controlling the ID assigment via explicit code may have reduced this problem. But, the alternative to auto-keys: hand-made keys, have their own problems. I blame this on the back-up and restore process designed by the DB vendor. --top

That is not an argument against automatically-generated surrogate keys. It is a possible weakness of sequential generation which does not apply to non-sequential generation (e.g. UUID, time-salted hashing). I would agree with your assessment that the failing was with the DB itself, insomuch as either or both of the backup and restore procedures were flawed in respectively not including or ignoring the generated key. I have seen methods use where a stored procedure is used to generate an ID (by any method necessary) and hooked by an ON INSERT trigger to at least render the process atomic.


CategoryDiscussion

EditText of this page (last edited January 5, 2014) or FindPage with title or text search