Question

I'm working with a star schema for a data warehouse and I am running into a problem with header and line items from different data sources.

CREATE TABLE DataSourceAHeader
(
     OrderId INT NOT NULL
    ,TotalCost MONEY NOT NULL
    -- Date, etc...
);

CREATE TABLE DataSourceALine
(
     OrderId INT NOT NULL
    ,LineNumber INT NOT NULL
    -- Dates, etc...
);

CREATE TABLE DataSourceBLine
(
     OrderId INT NOT NULL
    ,Cost MONEY NOT NULL
    ,LineNumber INT NOT NULL
);

I have data sources A and B which represent the same data in different ways. Data source A contains headers and line items, but it only has the net outcome (Total Cost) in the header. Data source B contains only line items and each item has an outcome (Cost).

I could keep two fact tables (one for the header and one for the line items), but I have researched and it seems inadvisable. Is there a strategy to deal with this kind of mismatched format or should they be stored in separate data warehouses (one warehouse per data source)?

My current strategy:

CREATE TABLE Fact.Order
(
     Id BIGINT IDENTITY PRIMARY KEY
    ,OrderId INT NOT NULL
    ,Cost MONEY NOT NULL
    -- Date key, etc...
);

CREATE TABLE Fact.OrderLine
(
     Id BIGINT IDENTITY PRIMARY KEY
    ,OrderFactId BIGINT NOT NULL REFERENCES Fact.Order (Id)
    ,LineNumber INT NOT NULL
    -- related line stuff
);

DataSourceAHeader and DataSourceBLine are inserted into Order and OrderLine. DataSourceBLine is split one line per row.

Here is an example for a DataSourceAHeader and DataSourceALine

SELECT * FROM Fact.Order;
|------------------------------------|
|   Id   |   OrderId   |   Cost      |
|   1    |     1100    |   12000.00  |
|   2    |     1101    |   10000.00  |
|------------------------------------|

SELECT * FROM Fact.OrderLine;
|-------------------------------------------|
|   Id   |   OrderFactId   |   LineNumber   |
|   1    |        1        |       1        |
|   2    |        1        |       2        |
|   3    |        1        |       3        |
|   4    |        2        |       1        |
|   5    |        2        |       2        |
|   6    |        2        |       3        |
|-------------------------------------------|

Here is an example for a DataSourceBLine

SELECT * FROM Fact.Order;
|---------------------------------|
|   Id   |   OrderId   |   Cost   |
|   1    |     1000    |   12.00  |
|   2    |     1000    |   10.00  |
|---------------------------------|

SELECT * FROM Fact.OrderLine;
|-------------------------------------------|
|   Id   |   OrderFactId   |   LineNumber   |
|   1    |        1        |       1        |
|   2    |        2        |       2        |
|-------------------------------------------|

Edit:

the TotalCost in the header cannot be brought down to the line level. I chatted with an architect acquaintance and his advice was to implement two separate fact tables, one for header (summary) and one for the lines (detail), and just have NULL values for the missing line information for DataSourceA.

Edit2:

I'm trying to be generic with the OrderId since I have several more data sources that may contain similar OrderId schemes (collisions). I have implemented a Mapping table in order to translate the source identifiers into the warehouse.

Edit3:

With the intention that this question be helpful to more than just myself, I would like the answer to have the following details (mostly to compile what everyone has already reasoned about):

  • In general what are the approaches to resolving related disjoint data sets taking the form of summary/detail (single fact table or summary/detail fact tables)?
  • What are the drawbacks to each approach?
  • What kind of structure could the fact table take to cope with missing (or irrelevant) data?
  • (two fact table approach) In what cases would it be prudent to roll down the summary versus rolling up the details?

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with dba.stackexchange
scroll top