Are Surrogate Primary Keys needed on a Fact table in a Data Warehouse?

https://stackoverflow.com/questions/930246

06-09-2019
|

Question

When I asked our DB designers why our Fact table do not have a PK, I was told that there is no set of columns in the table that would uniquely identify a record, even if all the columns were selected. Whenb I suggested that we an an identity column in that case I was told that "I'd just be wasting space and that it wasn't needed."

My feeling is that every table in the source system should have a PK, even if it is an identity column. Given that the data warehouse (DW) is a recipient of data from other system-how would I otherwise be able to ensure that the data in the DW accurately reflects what is in the source system if there is no way to tie individual records? If you have a runaway load program that screws up data and has run for a week, how would you reconcile the differences with a live transaction source system w/o some sort of unique constraint to compare?

Solution

Database table without primary key seems like a poor design choice and making lots of room for different types of anomalies i.e. how would you delete or update single record in such table?

OTHER TIPS

A data warehouse is not necessarily a relational data store, although you may choose to make it one, so relational definitions don't necessarily apply.

A primary key is only required if you want to do something with the data that requires a unique identifier (like trace it to a source, but that's not always required or necessary or even possible anyway); and data in a data warehouse can often be used in ways that don't require primary keys. Specifically, you may not need to distinguish rows from each other. Most often for constructing aggregate values.

Time is not a required dimension in constructing data warehouse tables.

It may be psychologically uncomfortable, and wasted space is a trivial issue, but your colleague is correct - PKs aren't necessary.

An identity type column is a "surrogate" key that replaces one of your "candidate" keys (simply put). Adding a surrogate key columns adds nothing if you can't identify a row without it. Which requires a candidate key.

You should at least have a natural key on the fact table so you can identify rows and reconcile them against source or track changes where this is necessary.

On SQL Server an identity column gives you a surrogate key for free and on other systems using sequences (e.g. Oracle) it can be added fairly easily. Surrogate fact table keys can be useful for various different reasons. Some possible applications are:

Some tools like to have numeric keys on fact tables, preferably monotonically increasing ones. An example of this is MS SQL Server Analysis Services, which really likes to have a numeric, monotonically increasing key for fact tables used to populate measure groups. This is especially required for incremental loads.
If you have any relationships between fact tables (for example a written - earned premium breakdown for those familiar with Insurance) then a synthetic key is helpful here.
If you have dimensions living in a M:M relationship with a fact table (e.g. ICD codes) then a numeric key on the fact table simplifies this.
If you have any self-join requirements for transactions (e.g. certain transactions being corrections to others) then a synthetic key will simplify working with these.
If you do contra-restate operations within your data warehouse (i.e. handle changes to transactional data by generating reversals and re-stating the row) then you can have multiple fact table rows for the same natural key.

Otherwise, if you won't have anything joining to your fact table in a 1:M relationship then a synthetic key probably won't be used for anything.

I would agree with you.

"I was told that there is no set of columns in the table that would uniquely identify a record, even if all the columns were selected." - this seems to break something fundamental about relational databases as I understand them.

A fact consists of additive values plus foreign keys to dimensions. Time is an obvious dimension that is common to every dimensional model that I know. If nothing else, a composite key that contains timestamp would certainly be unique enough.

I wonder if your DBAs have much knowledge about dimensional modeling. It's a different way of thinking from the normal relational, transactional style.

You are correct--sort of. Without a primary key, a table does not meet the minimal definition of being relational. It's fundamental to being a relation that it must not permit duplicate rows. Tables in a Data Warehouse design should be relational, even if they're not strictly in normal form.

So there must be some column (or set of columns) in the row that serve to identify rows uniquely. But it doesn't necessarily have to be an identity column for a surrogate key.

If the Fact Table has no set of columns that can serve this role of being a candidate key, then more Dimension Tables are needed in this DW, and more columns are needed in the Fact Table.

This new Dimension alone may not be the primary key; it may be combined with existing columns in the Fact Table to create a candidate key.

If the fact table is at the center of a star schema, then there is in reality a candidate key. If you take all the foreign keys in the fact table together, the ones that point to rows in the dimension tables, that's a candidate key.

It probably would not do much good to declare it as a primary key. The only thing it would do is protect you against a rogue ETL process. The folks who run the warehouse might have the ETL processing well in hand.

As far as indexing and query speed is concerned, that's a whole different issue with star schemas than it is with OLTP oriented databases. The people who run the warehouse may have that in hand as well.

When designing a database for OLTP use, it's unwise to have a table without a primary key. The same considerations don't carry over into warehouses.

I always think that a table should be ordered by its most common queries or performance hitters, therefore the clustered index of a table should be in line with the most difficult or common query.

The primary key does not have to be a clustered index so I know you might be wondering where I am going with this but my concern is more about the clustered index than the primary key (and let’s be honest, they normally follow each other).

So the initial question for me is not "should I have a surrogate primary key on the fact table?" but more like "should I have a clustered index on the fact table?" I think the answer is yes you should have one (and yes there are other posts on this site covering this question but I still think it’s worth mentioning in here just in case this is the question people are really asking despite wording it wrong)

There are times you want a surrogate key but I would heartedly recommend that the surrogate is NOT the table’s clustered index. Doing so would order the table in line with the meaningless surrogate key. (Often people add a surrogate identity column to a table and make it the primary key and also the clustered index by default)

So what columns to make the clustered index on? Personally I like date for fact tables and to this you might add some other dimension’s FK for uniqueness but this will increase size and possibly not provide any benefit as in order for the index to be useful the relevant dimensions would have to be referenced (in the order of importance that the key was generated with).

To get around this (and the reason I answer this here) I think you SHOULD add a surrogate and then create the clustered index on the date key and followed by the surrogate (in that order). I do this because the date alone is not going to make a unique row but adding the surrogate will. This keeps the data ordered by date which helps all other non-clustered indexes and also keeps the clustered index size reasonable.

Additionally as the data grows, you may want to partition it in which case you will need a partition key which will invariably be date. Building the clustered index with date as the primary part of key makes this easier. With partitioning you can now use sliding window technique to archive old data or in loading.

Not having a unique identifier for each row is even worse than it first seems. Sure, it is precarious and it's easy to inadvertently delete some rows.

But performance is much worse too. Each time you end up asking the database to get you the rows for Employees with EmployeeType = 'Manager' you are doing a string comparison. Identifiers are just faster and better.

Besides, storage is cheap and in this case I imagine the impact on space will be less than a quarter percentage point if that--as a data warehouse you are probably designing for terabytes of data.

http://www.ralphkimball.com/html/controversies.html

Fable:

The primary key of a fact table consists of all the referenced dimension foreign keys.

Fact:

A fact table often has 10 or more foreign keys joining to the dimension tables’ primary keys. However, only a subset of the fact table’s foreign key references is typically needed for row uniqueness. Most fact tables have a primary key that consists of a concatenated/composite subset of the foreign keys.

using the combination of dimension surrogate keys as the primary key of the fact table doesnt work in all cases. Consider the case where there are three dimensions a, b and c. In most designs we usually have a dimension row for the "unknown", assume i always assign this row the surrogate key of -1. I could easily have two rows in my fact table that have keys a=n1, b=n2 and c=-1, ie duplicate keys because the two rows have not got valid values for dimension c and so both resolve to the unknown row.

You're conflating two issues here -- identifying a unique record in the fact table, and tracing records from the source system through to the fact table.

In the latter case it's quite possible for a single record in a source system to have multiple fact table records. Imagine a source system record that represents a transfer of funds from one account to another. There might be two fact table records to represent this, one for the debited account and one for the credited account. Furthermore there might be multiple fact records to represent different states of the source system records at different points in it's lifecycle.

For the issue of the primary key on the fact table, there's really not a "correct" answer. There are desirable/essential characteristics that you might want (for example for the identity of a single record to be communicated easily between users of the system, or for a single record to be deleted or updated easily). However for an Oracle system a ROWID might very well do for that as long as it doesn't matter if it occasionally changes.

Really though, there's so little overhead in maintaining a single synthetic key that you might as well do it anyway. You might choose not to index it, as the index is going to be a much larger resource consumer than the column itself.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow