Data warehouse multivalued attributes

https://stackoverflow.com/questions/23688811

23-07-2023
|

Question

Disclaimer: I have never created a data warehouse before. I have read several chapters of Kimball's Data Warehouse Toolkit.

Background: Plant (factory) management team needs to be able to slice and dice production information in various ways, and we want a consistent reporting format across manufacturing plants in our division. Through business analysis, we have concluded that the fact grain is 1 row per process completed. A completed process can either mean "machine" or "assemble." I am calling this the "Production fact".

The questions that the business needs to answer are the following:

Who was working when the process completed?
What was the cycle time of the process?
What is the serial number of the part was being produced by the process?

My schema includes the following first-level dimensions. I do not have any dimensions beyond the first level, but there are some cross relations between the plant dimension and the part type, shift, and process dimensions.

Part Type (Attributes: Surrogate Key, Part Number, Model, Variant, Part Name)
Plant (Attributes: Surrogate Key, Plant Name, Plant Acronym)
Shift (Attributes: Surrogate Key, Plant Key, Start Hour24, Start Minute, End Hour24, End Minute)
Process (Attributes: Surrogate Key, Plant Key, Production line, Process Group, Process Name, Machine Type)
Date (typical date dimension attributes)
Time of Day (typical time of day dimension attributes)

The non dimensional facts are:

Part serial Number (instances of a part type)
Cycle time
Employee ID(s) *MULTI-VALUED*

Problem

My problem is that more than one employee may have been working the process at the time. So, I am wondering if I need to change my model and how to best represent the employee in the model. We are not trying to house employee information, just their company employee ID. I've considered the following options:

Allow for multiple employee IDs in the employee column of the fact table (e.g. comma separated). Disadvantage: the number of employees working on the process is a variable number. Would I need to create the field big enough to accommodate up to X number of employees? What should X be?
Create a record for each production fact per employee. This would be mean more than one record for the same fact; that would be bad. :)
Create an employee dimension and an "Process Employees" bridge table between the employee dimension table and the fact table. Problem: the employees working on the process at the time are not represented in the fact table.
Create an Employee dimension, a Process Employees Group table, and a bridge table between Process Employees Group table and the Employee dimension table. The employee group and bridge tables would need to be a) pre-populated with all possible employee combinations--this is not practical on any level since we have thousands of employees-- or b) populated on the fly during ETL. 4b would require a check to see if a given group employees already existed for each process; this might be taxing on the DBMS/ETL system if the source records are batched more frequently than a few times per day (e.g. 10 X's per hour for near real-time reporting).

My Question(s)

I'm thinking that option 3 is the most viable option, but I have some reservations. Are there potential watch-outs? Are there other alternatives that I should consider? Is it okay to take the employees who worked on the process out of the fact table?

Thank you for any advice.

Solution 2

I've had time to think about my options, and none of the 4 options listed in my original post are correct. The problem discussed seems to be a classic "coverage" problem; the business needs to know which employees were working which processes at a given time. If we have that information, we will know who worked who was working on a particular part when a given process completed. This would best be represented as a fact-less fact table between an employee dimension and the production process dimension.

This approach helps also helps me to save space and improve querying power because a single employee "coverage" fact will span multiple process production facts.

OTHER TIPS

There is a concept called slowly changing dimensions. These are considered dimensions; basically over here the table which I will call PartEmployee;

The structure of this table will be

PartId - PK
EmployeeId - PK
EmployeeStartDate - PK
EmployeeEndDate

The End Date will be null if the employee is still working on the part. When a new employee starts working on the part, the previous employee record for the part will be closed and a new record created for the part with the new employee.

Add an employee on the PartFact table;

EmployeeId

This column will hold the current employee; This fact record will be updated everytime a new employee starts working on the part...

This will give you the historical perspective of which employees worked on the part and also the information of the employee who worked on the part last.

Hope this helps...

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow