Question

I have heard textbook definitions of how to design a star schema regarding what goes in the fact table and what goes in the dimension tables, such as:

The fact table should contain core information about an object and dimensions should contain information about the facts

(paraphrased)

However, practically in business, I have seen a star schema designed where the fact table contains a surrogate key, a business key, and all single-valued fields of an object, and each dimension stores all the multi-valued fields of an object (hence the word dimension). For example, a person may be the object represented in a fact table. A person has one name, one age, etc., which all make viable facts in a fact table. A person may own multiple cars, each with their own attributes, which would represent a person's car dimension, stored as a dimension table with several columns to describe each car's attribute. In this example, this dimension table also includes a foreign key representing the business key of the corresponding row from the fact table.

So, if we can agree that this may be a suitable design, the problem that I am trying to overcome is how to do SCD type 2 (historical) on a multivalued dimension table. For my fact table full of single facts, it is obvious. I include two extra columns, an effective date and expiration date, and I use the business key to link common records where the most recent record has a NULL expiration date, and all other historical records for the same business key have both an effective and expiration date indicating at what point in time they were the most current record.

How do I use this same concept on a dimension which represents a multivalued list? I essentially would like the same concept where I can (1) identify the current list (in this example, the cars a person owns) and (2) identify what the list was at any given moment in history. Can I just put an effective and expiration date on each dimension value? How then do I differentiate between values added after a certain time? Or deleted values?

But, if we do not agree on this design approach, please tell me what industry standard is so I can do this correctly.

Was it helpful?

Solution

Usually, dimension tables contain a single valid time (start and end date) for all fields and SCD2 would apply to the complete record. It is good practice to use an non-null end value ahead in time to mark currently valid records as this simplifies queries. An end date in the past would signify deletion or any other semantic you define (like person left country or is not employed anymore). Also add surrogate keys to your dimension tables to uniquely identify records.

Fact tables usually contain "measures" like sales or cost or signify events like a placed call or durations of these calls. One would usually use aggregates on these columns in reports.

A star-schema is a way to model a sparsely populated "cube", where each axis of the coordinate system is given by one of the dimension tables. "Slice and dice" operations and "drill up / drill down" operations in reports translate nicely into SQL using this model.

In your cars and people example, I would use two dimension tables, one for cars and one for people, each historized (according to SCD2), and a factless fact table comprising foreign keys to the dimension tables, referencing the respective identifier (entity identifier), and valid time columns (SCD2). You would not add a record according to SCD2 rules in the fact table, if one of the dimension tables changes, in this design.

This way you can model changes in each entity, like name changes in people, color changes in cars and the relationship between cars and people, for example ownership. Each table would use non-overlapping valid times (start and end values) for each business key, recording the history of these entities independently. The fact table would in this model basically be a m:n linkage table, for which a separate history of valid times is kept.

You would identify the current and past lists by using x between start and end on each table for as-of now (or past) queries (answers your (1) and (2) - ignoring if the intervall is right open or left open).

Summary statistics like how many cars do we have in some city with full history (assuming city is part of the people dimension table), can now be answered using temporal joins and "sequenced" queries, which are sometimes also called "coalesce" queries, see Snodgrass, chapter 6.3ff

Licensed under: CC-BY-SA with attribution
Not affiliated with dba.stackexchange
scroll top