Question

I have about 15 years software engineering experience, writing business software with relational databases. Mainly SQL Server and Oracle. I've always been of the opinion that you should define relations in your database and let your database handle the relational integrity. And in my case, 99.9% of the time I utilize a unique identity column as a primary key.

Now that my opinion is out of the way, I'd like some advice on an application we have at work. It's a third party application that my application has to interface with. Both applications use SQL Server as a database. They have no interface for sending and retrieving data. Because of that, they've given me portions of their database schema and descriptions on how to save the data with SQL queries.

It was clear from the schema sent to me that they were not using relations nor were they using identity values for primary keys in most cases. In following communications with them, they pretty much said that identity columns are the exception, not the norm. They also said they handle all referential integrity in their code.

I'm pretty much convinced they have made a horrendous mistake. For instance, I have to save data to this database and I don't have any of their "referential integrity code" in my app. And if someone had to run some queries directly in the database, that also poses a problem. I think this is a maintenance nightmare.

Can someone make any reasonable arguments to objectively support the decision this vendor made? I'm having a hard time coming up with any good reason.

Was it helpful?

Solution

The reasons for such design decisions are often not technical ones, but organizational ones. I have seen this happen in the real world, in situations of the following kind:

  • At the time when the system is designed, there is only one application which has exclusive access to the database, so referential constraints and ID columns are not that important at the beginning.

  • The developers of the system are well-trained in application development, but less trained in database design.

  • The designers may had some bad experiences with referential integrity constraints in the database, because they usually don't allow easily to implement "exceptions in special cases" - such exceptions are way easier to handle at the application side.

    For example: "this column in table A must be a not-null foreign key ref to the ID column of table B" - and then someone says "oh, except when the isTemplate column in table A is true. To solve such kind of requirements at the DB side one would need to replace the FK constraint by some more or less complex DB trigger.

  • The - normally very good - approach of "let's start with a most simple working solution first and improve later" is interpreted as let's start without relationships in the DB first and add them later (instead of let's start with strict constraints and make them less strict later).

Unfortunately, once a system is build without referential integrity and it first version goes into production, sooner or later it will contain a lot of data which could not have been inserted into the db with some sensible contraints enabled.

Then it becomes hard to add those contraints to the db afterwards. It is often easier for the devs to modify the application instead to make it handle this "low quality" data, so the data can stay unmodified in the database, instead of initiating a "data improvement process", which would be necessary to fix the problems at the DB side, but may require support by users and administrators.

Of course, when a system grows, and it is extended to have more than one application accessing the DB, especially more than one which writes data into it, there is surely a point where it would be better to have a more rigid the DB design. But the later one tries to get there, the harder it becomes.

OTHER TIPS

This is a trend I'm seeing more and more across many industries - a flatter data structure that is far more sanguine about data redundancy.

Perhaps we shouldn't be too surprised about this considering:

  • Storage is cheap, CPUs are fast and memory is plentiful, meaning data redundancy is far less of a performance issue
  • The rise of LINQ in .NET languages meaning data slicing and dicing is far easier on the client side
  • The more relaxed nature of modern languages to the established dogma of the past e.g. (from the Zen of Python) Flat is better than nested

I would have agreed with the points from Doc Brown's answer in full about five years ago, but that is really only a small part of it. A lot has changed since then - both in the software development and database arenas.

I recall a news story a few years back where the non-relational DB structure of Facebook was mentioned to a top guru in the RDBMS world who scoffed: "That will never work, and it will never scale". As we've seen since - it did and it does.

What's the real problem ?

The real problem here, is not the absence of referential integrity and other critical constraints: it's the fact that the database is used as an interface media without ensuring proper validation of the data sent to the application.

Sooner or later, this will lead to inconsistencies:

  • Database constraints could help to prevent such a situation by ensuring a minimum of consistency when the data is added.
  • A better approach would be the application to offer some interface (e.g. API or file based) to aquire the data and sanitize it before using it.

Are there historical arguments that explain absence of referential integrity?

First of all, I must admit that I totally share your opinions. Not only will referential integrity strengthen the reliability of the data, but in addition it will provide additional information to the optimizer for doing its job.

This being said, teams may be mislead by historical arguments against referential integrity:

  • I started to develop with Oracle 5 in 1986, at that time there was a table locking scheme for any update. There was no referential integrity before 1992 with Oracle 7.
  • For a certain time, referential integrity was implemented in a couple of DBMS at the expense of performance (I remember for example one case where extra locks on foreign keys reduced concurrency).
  • Autonumbered columns where vendor specific extensions before it became SQL standard in 2003. This made the use of identify columns a delicate thing (eg, query, update, fail, retry). Therefore natural (meaningful) keys were preferred when ever possible. For example, and order ID was almost always managed as a unique ID, but order line items did rarely have their own unique iD, but a combined ID (order ID + line number was preferred because simpler to implement). Or you had a combined key device id + timestamp instead of a unique id for every measurement recorded by the device. Why does this matter ? Because in such context, using referential integrity made the change of natural keys very difficult (unless your DB supported cascading changes).

So the question is not AS400 or not, but previous experience made in former times

Are there still valid reasons for not having referential integrity ?

These historical reasons shall not lead to think that absence of referential integrity would be a sign of obsolete practice !

Several valid reasons in favor of not using referential integrity:

  • Database independence: document oriented databases (MongoDB etc...) do not provide for referential integrity, since they favor less structured data sets. So if you develop apps that intend to use such DBMS, or that want to ensure database independence, you will need to care for integrity in your application. So you will not really care at the DB level.
  • DDD and value objects: value objects differ from entities in that they do not have an identity. A value object is solely identified by the value of its attributes. Translated into SQL, this means that you do not use a unique ID to identify value object. The referential integrity is then less obvious, so that you'll start without.

I often skip referential integrity in the DB i.e. FK constraints

But I always have unique PKs.

So in my case its more like a lack of strict relationships over all. Sure you can have a customer address with no customer - knock yourself out. Everything will work except the bits that don't.

Plus I hide the DB behind an API so any relationships that need enforcing can be enforced.

This works well with distributed data where the constraint can't be enforced atomically in any case and you have to deal with a potential error.

Your vendor has gone a step further by not having IDs, but let's be generous and assume there are unique composite keys. In theory that works just as well.

But then they also allow you direct access bypassing an API. Even if they had strict constraints at the DB level, this is still suspect. You can't enforce all the rules in the DB so you are bound to be able to break it if you really try.

If it's the lack of API that's the cardinal sin in this case, I think you have to put yourself under the magnifying glass for requesting this low level access to third party software.

Rather than being intriniscally bad It sounds like this software simply lacks the features you require.

Licensed under: CC-BY-SA with attribution
scroll top