Question

I have designed a datawarehouse with 3 dimensions and one fact, and to do that I've read some books from Kimball, Bouman, Malinowski...

In the Kimball's & Caserta book named The Data Warehouse ETL Toolkit, on page 128 talks about the Audit Dimension. I have understood that it is a dimension linked with the fact like the other dimensions, and it's used mainly to evaluate the data quality.

The question is... This audit dimension is actually used on enterprise environments? The big companies uses it on their datawarehouse projects?

I am doing my Final Degree Project and I don't know if I should include this dimension because I have seen it only on books, but it seems a good way for data quality purposes.

Thanks in advance.

Was it helpful?

Solution

OP asked,

This audit dimension is actually used on enterprise environments? The big companies uses it on their datawarehouse projects?

Short answer is: yes, sometimes

Long answer is, Audit dimension is used when it is really required. Audit dimensions are supposed to store ETL metadata information. And some of these metadata can be directly stored in fact table itself. Data such as, load date, loading batch number, job name, user name etc. you could directly store in your fact table.

But as a matter of fact, when you decide to store these information in fact table itself, you soon realize that many of these information are actually going to be same for a large number of records of the fact table.

For example, if you load 100K records in your fact table per day, the loading job name, source file name, user who executed the job, batch number etc. are going to be same for all these 100K records. So, it does make sense if you remove these information from your fact table, and maintain it in a separate table and refer the surrogate key of that separate table to your fact. This reduces data redundancy, space requirement and may improve loading speed. Normal data normalization techniques, you know.

Off course, there are some information that you should not put in your audit dimension. Say, load date-time of the records. This will be unique for all the records in your fact - so obviously if you are to put this information in your audit dimension, your audit table will be as big as your fact. Instead, you should put such information in your fact table itself.

I have personally seen / worked in some of the world's largest data warehouses in retail and telecom sector and have witnessed some kind of audit dimension in those data warehouses.

OTHER TIPS

Yes, it is useful, because it allows you to store process metadata about every row. This may include:

  • name of the job that inserted the row,
  • identifier of the job execution,
  • date and time when it was executed,
  • name of the source system or source file,
  • user who executed the job,
  • number of processed rows.

This information is invaluable both for regular monitoring as well as debugging when something goes wrong. Think about a very simple example - when somebody loads a wrong source file by mistake, how can you quickly identify rows that should be deleted with no audit dimension?

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top