Question

I am using Talend to populate a data warehouse. My job is writing customer data to a dimension table and transaction data to the fact table. The surrogate key (p_key) on the fact table is auto-incrementing. When I insert a new customer, I need my fact table to reflect the id of the related customer.

As I mentioned my p_key is auto auto_incrementing so I can't just insert an arbitrary value for the p_key.

Any thought on how I can insert a row into my dimension table and still retrieve the primary key to reference in my fact record?

More info:

What if the incoming data isn't normalized? For instance I have a csv with the following data:

order #   date        total customer# first_name last_name
111       1/2/2010    500    101      John        Smith     
222       1/3/2010    600    101      John        Smith

Obviously, I want the customer info to appear in the dimension table, and the transaction data in the fact table:

dimension
101  john smith

fact
111       1/3/2010
222       1/3/2010

As you mentioned, the key of the dimension table will be auto incrementing. The fact table needs to reference this key. How do you design the etl job so that the surrogate key is returned after an insert to ?

Also, if the customer data is deduped (as above) how do you handle the keys?

Was it helpful?

Solution

I may have misunderstood you problem, however:

  1. A fact table may or may not have an auto-incrementing PK, usually a PK in a fact table is a composite of several FKs referencing dimension tables.

  2. A dimension table should have an auto-incrementing PK.

  3. A new customer should "land" into the customer dimension table before the transaction fact reaches the DW (or at least the fact table).

  4. A dimension table should have a BusinessKey which uniquely identifies a customer -- like email, full name + pin, or similar.

  5. An incoming transaction row should have the customer BusinessKey field too -- that's how we identify the customer.

  6. Use the BusinessKey to lookup the customer PrimaryKey from the customer dimension table before inserting the transaction into the fact table.

EDIT

If your new customer data is bundled with the transaction, find a way to extract customer data and route it to the DW ahead of the transaction.

UPDATE:

Load dimCustomer first, decide on BusinessKey -- so the dimension would look like:

CustomerKey = 12345 (auto-incremented)
CustomerBusinessKey = john_smith_101 (must uniquely identify the John Smith)
CustomerFirstName = John
CustomerLastName = Smith

During dimension loading process, you have to segregate incoming rows int two streams, existing and new customers. Rows from the "existing customer" stream update the dim table (type 1 SCD), while rows from the "new customer" stream are inserted. There should be no duplicates in the stream of rows that are being inserted; you can accomplish this by inserting them into a staging table and removing duplicates there, just before the final insert into the dimension table. You can also extract duplicates and route them back into the loading process to update customer records; they may contain newer data -- like updated phone numbers or similar.

Once the customer is in, load facts.

The fact table should look something like:

DateKey     (PK)
CustomerKey
OrderNumber (PK)
Total

I have used composite primary key of the DateKey and the OrderNumber, allowing for the order number sequence to reset from time to time.

During loading process, modify the fact record to look something like:

DateKey CustomerBusinessKey OrderNumber Total
20100201  john_smith_101       111       500
20100301  john_smith_101       222       600

At this point we need to replace the CustomerBusinessKey with the CustomerKey from the dimension table using a lookup. So, after the lookup the stream would look like:

DateKey CustomerKey OrderNumber Total
20100201 12345       111         500
20100301 12345       222         600

This can now be inserted into the fact table.

I have also cheated a bit -- did not lookup a date key from the dimDate, and did not look for existing rows in the fact table. When loading the fact table, you can look for existing (DateKey, OrderNumer) combination before loading, or you can leave it up to the primary key to protect agains duplicates -- your choice. In any case make sure that an attempt to re-load the same data into the fact table fails.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top