Is there a canonical source supporting “all-surrogates”?

https://softwareengineering.stackexchange.com/questions/204521

29-09-2020
|

Question

Background

The "all-PK-must-be-surrogates" approach is not present in Codd's Relational Model or any SQL Standard (ANSI, ISO or other).

Canonical books seems to elude this restrictions too.

Oracle's own data dictionary scheme uses natural keys in some tables and surrogate keys in other tables. I mention this because these people must know a thing or two about RDBMS design.

PPDM (Professional Petroleum Data Management Association) recommend the same canonical books do:

Use surrogate keys as primary keys when:

There are no natural or business keys
Natural or business keys are bad ( change often )
The value of natural or business key is not known at the time of inserting record
Multicolumn natural keys ( usually several FK ) exceed three columns, which makes joins too verbose.

Also I have not found canonical source that says natural keys need to be immutable. All I find is that they need to be very estable, i.e need to be changed only in very rare ocassions, if ever.

I mention PPDM because these people must know a thing or two about RDBMS design too.

The origins of the "all-surrogates" approach seems to come from recommendations from some ORM frameworks.

It's true that the approach allows for rapid database modeling by not having to do much business analysis, but at the expense of maintainability and readability of the SQL code. Much prevision is made for something that may or may not happen in the future ( the natural PK changed so we will have to use the RDBMS cascade update funtionality ) at the expense of day-to-day task like having to join more tables in every query and having to write code for importing data between databases, an otherwise very strightfoward procedure (due to the need to avoid PK colisions and having to create stage/equivalence tables beforehand ).

Other argument is that indexes based on integers are faster, but that has to be supported with benchmarks. Obviously, long, varying varchars are not good for PK. But indexes based on short, fix-length varchar are almost as fast as integers.

The questions

- Is there any canonical source that supports the "all-PK-must-be-surrogates" approach ?

- Has Codd's relational model been superceded by a newer relational model ?

Solution

"All PKs are surrogates" is not a very sound strategy at all and certainly not one that you are ever likely to find an "authoritative" source for.

Firstly think about what is meant by "primary key" in this context. In the relational model there are no "primary" keys - meaning no one key which is fundamentally different from any other key of the same table. In principle all keys in a relational database can and do enjoy the same status and have the same features and function, except to the extent that the database designer chooses otherwise. The singling out of any one key in a table with multiple keys is therefore essentially arbitrary (that was the word used by E.F.Codd), subjective and purely psychological (the view of Chris Date, Codd's colleague and collaborator). Unless it is explained what distinction is being drawn between a "primary" key and any other key it is therefore pretty meaningless and of no merit at all to assert that such a key "should" or "must" be anything.

Secondly, the argument has very little to do with indexes, which are a physical storage feature. Keys are a logical matter, not a physical one and there is no absolute reason to assume that the storage considerations of a "primary" key are or should be any different to other keys (see previous paragraph). We might reasonably assume that whatever storage structures are used, the storage overhead will in some measure be greater with a surrogate key than with no such key but as always the best answer here is "it depends". Storage decisions should be made on a a case-by-case bases and blanket rules are of very little help.

Thirdly, from a logical point of view the absolute requirement of a surrogate key makes very little sense. The requirement for a natural key is exactly the same with or without a surrogate. The need for information to be identifiable in the domain of discourse (i.e. with a natural key AKA "business key", "domain key") is the same. Yes, keys may need to be updated but then that's the nature of things sometimes. Adding a surrogate doesn't in itself necessarily make key updates easier to handle and sometimes it can make them harder.

OTHER TIPS

Primary and Foreign Keys do not have to be readable. Their purpose is to maintain the internal relational structure of the database, not to be read by a human.

Naturally, if there is an appropriate natural key that will never change (I claim these are as rare as hen's teeth or four-leaf clovers, but...), you can use that, and some customers will make that one of their requirements.

But why add the additional complexity to a database system, for little appreciable benefit? Primary Surrogate keys are system-generated, guaranteed to be unique, guaranteed to never change, and are the same data type for all tables. They will have the same reliable behavior under all circumstances.

If you're looking for a canonical resource that supports this practice, you won't find one. There are just as many designers on the other side of the aisle that will viciously defend their use of natural, composite keys with clustered indexes as primary keys, and all of the canonical resources say that it is the designer's choice.

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange