Broken referential integrity: What would Edgar Codd say?

https://stackoverflow.com/questions/651118

19-08-2019
|

Question

I'm trying to understand rules of relational model as originally defined by Edgar Codd in 1970.

Specifically I'm interested whether referential integrity is part of his relational model or not. I'll try to demonstrate on following example (just to make this question pretty):

Customers

+------+------------
| Name | Address
|------+------------
| John | ....
| Mike | ....
| Kate | ....
+------+------------

Invoices

+------+------------
|  ID  | Customer
|------+------------
|   1  | John
|   2  | John
|   3  | Mary
+------+------------

Now, obviously as you can see, we have one invoice where customer (foreign key) is Mary. Would this violate his relational model? Would Edgar Codd look at this and say, gee, what the heck? Or would he say, it's perfectly fine...

This is theoretical question.

Solution

For a language to be considered relationally complete (a phrase coined by Codd) it must support a set of relational operators, known as a relational algebra. Note there is no one true relational algebra: Codd proposed the first one but others have since refined and built upon Codd's (e.g. The Third Manifesto) and I'm sure he would see this as right and proper.

Referential integrity is not a relational operator and therefore is not a requirement for relational completeness of a language. Whether referential integrity constraints are a useful or necessary feature of a DBMS is another matter.

OTHER TIPS

If there is no customer named Mary in the Customers table, then there is no referential integrity between the tables. Specifically, a foreign key refers to a non-existent primary key.

Does this break the relational model? No. It's defined in the relational model (i.e. lack of referential integrity) and is an indication that there is a problem with the underlying data.

From "A Relational Model of Data for Large Shared Data Banks" by Edgar Codd (from Communications of the ACM, Volume 13, Number 6, June 1970):

It could be the case that the user intended to insert some other element into P - an element whose insertion would transform a consistent state into a consistent state. The point is that the system will normally have no way of resolving this question without interrogating its environment (perhaps the user who created the inconsistency).

So, it is assumed that there will be referential integrity issues and that they will need to be resolved by the user or the system via some programmatic method.

The Relational Model doesn't require referential integrity features to apply to every relational database - that would be absurd if such constraints weren't relevant or desired. Think of a club membership list consisting of name, address and membership number. There wouldn't necessarily be any use for RI constraints there, but it's still a relational database if the data is stored in the form of a relation.

Even Codd's 13 rules don't require that a RDBMS has to support the ability to create RI constraints. It's just that foreign keys are so useful that most RDBMSs are expected to have them.

I read the following as clearly stating that referential integrity is included in the relational model:

Two integrity rules apply to every relational database:

1 Entity integrity:
No mark of either type is permitted in any attribute which is a component of the primary key of a base relation

2 Referential integrity:
Let D be a domain from which one or more single-attribute primary keys draw their values. Let K be a foreign key which draws its values from domain D. Every unmarked value which occurs in K must also exist in the database as a value in the primary key of some base relation.

"Missing information (applicable and inapplicable) in relational databases," E. F. Codd, ACM SIGMOD Record, vol. 15, no. 4, pp. 53-78, 1986.

By "mark of either type" he is referring to an unknown value, for which we use NULL today. This paper suggested two different types of unknown values, one for "applicable but missing," and one for "inapplicable."

By "unmarked" he means not NULL.

Re comment from @dportas: Indeed, you don't even need the referenced relation to be empty to make your argument. It can contain some rows, but since the A-mark in K cannot be said to be equal to any value that exists in that referenced relation, there's no way to say that the hypothetical missing value satisfies the constraint. Therefore allowing an A-mark must become an act of faith that once a value is supplied, it will satisfy the constraint, because otherwise the row would have been invalid from the moment it was inserted, and we'd have to support the concept of a retroactive constraint violation, which is senseless.

First you ask is RI part of the RM:

whether referential integrity is part of his relational model or not

Yes. From Codd's classic "Is your DBMS really relational?" Computerworld, October 14, 1985:

It is, however, vitally important to remember that the relational model includes three major parts: the structural part, the manipulative part and the integrity part -- a fact that is frequently and conveniently forgotten.

Rule 10: Integrity constraints specific to a particular relational data base must be definable in the relational data sublanguage and storable in the catalog, not in the application programs.

But then you paraphrase by a different and ambiguous question:

we have one invoice where customer (foreign key) is Mary. Would this violate his relational model?

If you mean: Does the RM allow a declared FK be violated, ie not stopped by the DBMS?

No. That would be a DBMS that is letting you declare a FK constraint but isn't enforcing it. Such a DBMS is non-relational in that respect.

If you mean: Does the RM allow a business rule that says an Invoices Customer must also appear in Customers Name (ie that all valid database states are like that, ie that there is a FK constraint from Invoices Customer to Customers Name) to be not declared to the DBMS (eg via a FK declaration)?

Yes. But that's a bad design because it allows some invalid states.

I think that whether this is fine or not depends on your design.

An invoice should contain the data as it was at the moment the invoice was created or sent. As such it would appear to need data that is related to customer data but not directly a foreign key especially if you are using a natural key.

For instance suppose Mary Jones ordered something and was invoiced on May 31, 2010. On Sept 12, 2010, she changed her name to Mary Jones-Smith and moved to her husband's address. The invoice, being a picture in time, should retain the name Mary Jones and the orginal address it was sent to. It is best of it can retain a link to the current customer and her information as well (Which is why I would have a customer ID in the customer table as names change and an FK of Customerid inteh incvoice table). But storing Mary Jones when Mary Jones no longer exists in the customer table is not only OK, it is necessary to have a trail of what actually happened.

Same thing with products and prices and invoices. You would not want the invoice to reflect the price now, but the proce at the time of the invoice even if that doesn't directly relate to what is there now. In this cases the the Product table might be more of a lookup table than a true parent child relationship. If you store all the details of the product in the invoice detail table, then you don't need a foreign key to products, you only need it to look up active products at the time the order is placed. In fact the model number of a past invopice may certainly no longer be in the products table if the vendor changed it or dropped the product entirely. But you wouldn't want to lose the data about which of those products were bought in the past.

On the other hand if the relationship requires the data to stay consistent with the current values, a formal foreign key is the best method.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow