Do these specific tables need surrogate keys?

https://softwareengineering.stackexchange.com/questions/204340

29-09-2020
|

Question

Background

I have this tables

+-------------------------+  +------------------------+
|Airport                  |  |Country                 |
|-------------------------|  |------------------------|
|airport_code string (PK) |  |country_code string (PK)|
|address string           |  |name string             |
|name  string             |  +------------------------+
+-------------------------+

+-------------------------+
|Currency                 |
|-------------------------|
|currency_code string (PK)|
|name string              |
+-------------------------+

airport_code is the IATA (International Air Transport Association ) airport code, you can see them in your luggage tags when you travel by plane.

enter image description here

country_code is the ISO 3166-1 A3 standard country code, you can see them in the olympics.

enter image description here

currency_code is the IS0 417 standard 3-chars currency code, you can see them in international currency exchange display boards.

enter image description here

Questions

Are these natural PKs good enough ?

Is using world respected standards, which are accepted by whole industries good enough for PKs ?

Do this tables need surrogates no matter what ?

Solution

No, they don't. Those keys are definitely good enough!

They're unique, ~~not~~ rarely going to change, and meaningful, which is a step up over a surrogate key. That's pretty much the definition of a good PK.

The restrictions about PKs being immutable and numeric-integer are not part of the Relational Model (Codd's) or any SQL standard (ANSI or other).

OTHER TIPS

I think need is a very strong word, and in a strict sense, the tables probably do not need surrogate keys.

However, if it were my database, I would probably add surrogate keys anyway. I may not necessarily want my database design to depend on a bunch of third parties (IATA, ISO), regardless of how stable their standards are. Or, I may not want to depend on a particular standard at all (are there other currency code standards? I don't know). I would probably model my tables with surrogate keys like so:

+-------------------------+  +------------------------+
|Airport                  |  |Country                 |
|-------------------------|  |------------------------|
|airport_id       int (PK)|  |country_id     int (PK) |
|iata_airport_code string |  |iso_country_code string |
|icao_airport_code string |  +------------------------+
|faa_identifier    string |  
|address           string |  
|name              string |  
+-------------------------+

+-------------------------+
|Currency                 |
|-------------------------|
|currency_id int (PK)     |
|iso_currency_code string |
|name string              |
+-------------------------+

In other words, unless those industry standard codes are inherently important to my application, I wouldn't use them as the PK of my tables. They're just labels. Most of my other tables will probably have surrogate keys anyway, and this setup would add consistency to my data model. The cost of 'adding' the surrogate keys is minimal.

Update based on some of the comments:

Without knowing the context of the example tables, it's impossible to know how important things like IATA Airport Codes are to the application using the database. Obviously, if IATA codes are centrally important to and used pervasively throughout the application, it might be the correct decision, after proper analysis, to use the codes as the PK of the table.

However, if the table is just a lookup table that's used in a few corners of the app, the relative importance of the IATA codes may not justify such an prominent spot in the database infrastructure. Sure, you may have to make an additional join in a few queries here and there, but that effort might be trivial in comparison to the effort it would take to do the research to ensure that you fully understand the implications of making the IATA codes the primary key field. In some cases, not only do I not care, but I don't want to have to care about the IATA codes. @James Snell's comment below is a perfect example of something I might not want to have to worry about affecting the PK of my tables.

Also, consistency in design is important. If you have a database with dozens of tables that all have consistently designed surrogate keys, and then a few lookup tables that are using 3rd party codes as PK, that introduces an inconsistency. That's not altogether bad, but it requires extra attention in documentation and such that may not be warranted. They're lookup tables for goodness sake, just using a surrogate key for consistency is perfectly fine.

Update based on further research:

Ok, curiosity bit me and I decided to do some research on IATA airport codes for fun, starting with the links provided in the question.

As it turns out, the IATA codes are not as universal and authoritative as the question makes them out to be. According to this page:

Most countries use four-character ICAO codes, not IATA codes, in their official aeronautical publications.

In addition, IATA codes and ICAO codes are distinct from FAA Identifier codes, which are yet another way to identify airfields.

My point in bringing these up is not to begin a debate about which codes are better or more universal or more authoritative or more comprehensive, but to show exactly why designing your database structure around an arbitrary 3rd party identifier is not something I would choose to do, unless there were a specific business reason to do so.

In this case, I feel my database would be better structured, more stable, and more flexible, by forgoing the IATA codes (or any 3rd party, potentially changeable code) as a primary key candidate and use a surrogate key. By doing so, I can forgo any potential pitfalls that might crop up due to the primary key selection.

While having surrogate keys on the fields is fine and there is nothing wrong with that something to consider might be the index page size itself.

Since this is a relational database you'll be doing a lot of joins and having a surrogate key of a numerical type might make it easier on the database to handle i.e. the index page size will be smaller and thus faster to search trough. If this is a small project it won't matter and you'll get by without any issues however the bigger the application gets the more you'll want to reduce bottlenecks.

Having a BIGINT, INT, SMALLINT, TINYINT or whatever integer-like data type might save you some trouble down the road.

Just my 2 cents

UPDATE:

Small project - used by a few, perhaps even a few dozen people. Small scale, demo project, project for personal use, something to add to a portfolio when presenting your skills with no experience, and the like.

Large project - used by thousands, tens of thousands, millions of users daily. Something you'd build for an national / international company with a huge user base.

Usually what happens is a select few of the records get selected often, and the server caches the results for fast access, but every now and then you need to access some less used record, at which point the server would have to dip into the index page. (in the above example with the airport names, people often fly domestic airlines, say Chichago -> Los Angeles, but how often do people fly from Boston -> Zimbabwe)

If VARCHAR is used that means the spacing is not uniform, unless the data is always the same lenght (at which point a CHAR value is more effective). This makes searching the index slower, and with the server already being busy handling thousands and thousands of queries per second now it has to waste time going trough a non-uniform index, and do the same thing again on the joins (which is slower than regular selects on an un-optimized table, take DW as example where there are as few joins as possible to speed up data retrieval). Also if you use UTF that can mess with the database engine as well (I've seen some cases).

Personally, from my own experience, a properly organized index can increase the speed of a join by ~70%, and doing a join on an integer column can speed up the join by as much as around ~25% (depending on the data). As the main tables start to grow and these tables get used on them, would you rather have an integer datatype occupy the column that has a few bytes vs having a VARCHAR / CHAR field that will occupy more space. It comes down to saving on disk space, increasing performance and the overall structure of a relational database.

Also, as James Snell mentioned:

Primary keys must also be immutable, something IATA airport codes are definitely not. They can be changed at the whim of the IATA.

So taking this into consideration, would you rather have to update 1 record that is bound to a number, vs having to update that one record plus all the records in the table on which you join to.

If you take the "I use surrogate keys all the time" approach, you get to bypass this type of concern. That may not be a good thing because it's important to give your data some thought, but it certainly saves a lot of time, engergy and effort. If anyone were to adopt an acception to this rule, the listed examples certainly qualify because it takes an a near "act of congress" to make the change.

Ad hoc queries of a database with these natural keys is certainly helpful. Creating views that do the same thing by including the lookup tables can work just as well. Modern databases do a much better job with this type of stuff to the point where it probably doesn't matter.

There are some cases specific to the US, where standards were drastically changed: Postal code expanded from 5 - 9 digits, State abbreviations to a consistent 2 letters and get rid of the period (Remember when Illinois was Ill.?), and most of the world got to deal with Y2K. If you have a real-time app with data spread all over the world containing billions of records, cascading updates are not the best idea, but shouldn't we all work in places that face such challenges? With that dataset, you could test it for yourself and come up with a more diffinitive answer.

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange