Natural vs surrogate keys on support tables [closed]

https://stackoverflow.com/questions/12748473

05-07-2021
|

Question

I have read many articles about the battle between natural versus surrogate primary keys. I agree in the use of surrogate keys to identify records of tables whose contents are created by the user.

But in the case of supporting tables what should I use?

For example, in a hypothetical table "orderStates". The valuse in this table are not editable (the user can't insert, modify or delete this values).

If you use a natural key would have the following data:

TABLE ORDERSTATES
{ID: "NEW", NAME: "New"}
{ID: "MANAGEMENT" NAME: "Management"}
{ID: "SHIPPED" NAME: "Shipped"}

If I use a surrogate key would have the following data:

TABLE ORDERSTATES
{ID: 1 CODE: "NEW", NAME: "New"}
{ID: 2 CODE: "MANAGEMENT" NAME: "Management"}
{ID: 3 CODE: "SHIPPED" NAME: "Shipped"}

Now let's take an example: a user enters a new order.

In the case in which use natural keys, in the code I can write this:

newOrder.StateOrderId = "NEW";

With the surrogate keys instead every time I have an additional step.

stateOrderId_NEW = .... I retrieve the id corresponding to the recod code "NEW"

newOrder.StateOrderId = stateOrderId_NEW;

The same will happen every time I have to move the order in a new status.

So, in this case, what are the reason to chose one key type vs the other one?

Solution

The answer is: it depends.

In your example of changing the order state inside your code, ask yourself how likely it is that you would create constants for those states (to avoid making typos for instance). If so, both will accomplish the same.

In the case that a new order state gets submitted via a form, you would build the drop down (for example) of possible values using either the natural or surrogate key, no difference there.

There's a difference when you're doing a query on the order table and wish to print the state for each order. Having a natural key would avoid the need to make another join, which helps (albeit a little).

In terms of storage and query performance, the surrogate key is respectively smaller and faster (depending on the table size) in most cases.

But having said all that, it just takes careful consideration. Personally I feel that surrogate keys have become something like a dogma; many developers will use them in all their tables and modeling software will automatically add them upon table creation. Therefore you might get mixed reactions about your choice, but there's no hard rule forbidding you to use them; choose wisely :)

OTHER TIPS

In a nutshell:

natural key may lead to less JOINing¹,
but also require more space² (and therefore hurt the cache performance³).

There are no hard-and fast rules here. First determine whether you need such JOIN at all, and if you do, whether eliminating it is worth paying the price in increased storage. The only way to do that is to measure on realistic amounts of data.

BTW, there other considerations in the natural vs. surrogate debate, such as...

cascading updates,
clustering,
diamond-shaped dependencies etc.

...but they, for the most part, don't apply to your case.

¹ Natural key will be migrated through the FK into the "main" table, so if you need to get it together with the main table rows, you can avoid the JOIN altogether. BTW, if you need a different JOIN (for getting a non-key), you won't be able to eliminate it this way.

² Presumably, the "main" table is large, in which case storing many strings (for migrated natural key) is less space-efficient than storing many ints (for migrated surrogate). If the main table is small, than it pretty much doesn't matter either way.

³ Rows are "fatter", so less rows will fit into a single database page. Caching is typically implemented at the page level.

If I understand correctly, your first example shows that the table's primary key is a string (varchar) whereas in the second example, the primary key is an integer. The primary key presumably will be a foreign key in another table.

Obviously storing an integer uses less disk space than storing a varchar, especially as one has to allocate space for the longest varchar (in your case, 'management'). I imagine that indexing by an integer is faster than indexing by a string (the index will also take up less room).

The first example has the primary key and the 'name' field having the same value; whilst changing the name would not change the primary key (and thus would have no effect on a table using 'OrderStates' as a foreign key), there would be a logical disconnection - you could have as primary key 'NAME' but value 'Person'.

It is customary to write queries such as

select orders.ordname
from orders
inner join orderstatus on orders.status = orderstatus.id
where orderstatus.name = 'NEW'

although to be honest, I would use a flag field to show whether the status indicates the initial, 'new', status, as opposed to checking the status's name - the status will still be the initial status, even if you change its name.

You can use a generator to provide a key which is guaranteed to be unique, whereas you would have to check for collisions if you use a 'natural' key.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow