Question

Background

I am designing a Data Warehouse with SQL Server 2012 and SSIS. The source system handles hotel reservations. The reservations are split between two tables, header and header line item. The Fact table would be at the line item level with some data from the header.

The issue

The challenge I have is that the reservation (and its line items) can change over time.

An example would be:

  • The booking is created.
  • A room is added to the booking (as a header line item).
  • The customer arrives and adds food/drinks to their reservation (more line items).
  • A payment is added to the reservation (as a line item).
  • A room could be subsequently cancelled and removed from the booking (a line item is deleted).
  • The number of people in a room can change, affecting that line item.
  • The booking status changes from "Provisional" to "Confirmed" at a point in its life cycle.

Those last three points are key, I'm not sure how I would keep that line updated without looking up the record and updating it. The business would like to keep track of the updates and deletions.

I'm resisting updating because:

  1. From what I've read about Fact tables, its not good practice to revisit rows once they've been written into the table.
  2. I could do this with a look-up component but with upward of 45 million rows, is that the best approach?

The questions

  1. What type of Fact table or loading solution should I go for?
  2. Should I be updating the records, if so how can I best do that?

I'm open to any suggestions!

Additional Questions (following answer from ElectricLlama):

  1. The fact does have a 1:1 relationship with the source. You talk about possible constraints on future development. Would you be able to elaborate on the type of constraints I would face?
  2. Each line item will have a modified (and created date). Are you saying that I should delete all records from the fact table which have been modified since the last import and add them again (sounds logical)?
  3. If the answer to 2 is "yes" then for auditing purposes would I write the current fact records to a separate table before deleting them?
  4. In point one, you mention deleting/inserting the last x days bookings based on reservation date. I can understand inserting new bookings. I'm just trying to understand why I would delete?
Was it helpful?

Solution

If you effectively have a 1:1 between the source line and the fact, and you store a source system booking code in the fact (no dimensional modelling rules against that) then I suggest you have a two step load process.

  1. delete/insert the last x days bookings based on reservation date (or whatever you consider to be the primary fact date),

  2. delete/insert based on all source booking codes that have changed (you will of course have to know this beforehand)

You just need to consider what constraints this puts on future development, i.e. when you get additional source systems to add, you'll need to maintain the 1:1 fact to source line relationship to keep your load process consistent.

I've never updated a fact record in a dataload process, but always delete/insert a certain data domain (i.e. that domain might be trailing 20 days or source system booking code). This is effectively the same as an update but also takes cares of deletes.

With regards to auditing changes in the source, I suggest you write that to a different table altogether, not the main fact, as it's purpose will be audit, not analysis.

The requirement to identify changed records in the source (for data loads and auditing) implies you will need to create triggers and log tables in the source or enable native SQL Server CDC if possible.

At all costs avoid using the SSIS lookup component as it is totally ineffective and would certainly be unable to operate on 45 million rows.

Stick with the 'insert/delete a data portion' approach as it lends itself to SSIS ability to insert/delete (and its inability to efficiently update or lookup)

In answer to the follow up questions:

  1. 1:1 relationship in fact What I'm getting at is you have no visibility on any future systems that need to be integrated, or any visibility on what future upgrades to your existing source system might do. This 1:1 mapping introduces a design constraint (its not really a constraint, more a framework). Thinking about it, any new system does not need to follow this particular load design, as long as it's data arrive in the fact consistently. I think implementing this 1:1 design is a good idea, just trying to consider any downside.

  2. If your source has a reliable modified date then you're in luck as you can do a differential load - only load changed records. I suggest you:

    1. Load all recently modified records (last 5 days?) into a staging table
    2. Do a DELETE/INSERT based on the record key. Do the delete inside SSIS in an execute SQL task, don't mess about with feeding data flows into row-by-row delete statements.
  3. Audit table:

The simplest and most accurate way to do this is simply implement triggers and logs in the source system and keep it totally separate to your star schema.

If you do want this captured as part of your load process, I suggest you do a comparison between your staging table and the existing audit table and only write new audit rows, i.e. reservation X last modified date in the audit table is 2 Apr, but reservation X last modified date in the staging table is 4 Apr, so write this change as a new record to the audit table. Note that if you do a daily load, any changes in between won't be recorded, that's why I suggest triggers and logs in the source.

  1. DELETE/INSERT records in Fact

This is more about ensuring you have an overlapping window in your load process, so that if the process fails for a couple of days (as they always do), you have some contingency there, and it will seamlessly pick the process back up once it's working again. This is not so important in your case as you have a modified date to identify differential changes, but normally for example I would pick a transaction date and delete, say 7 trailing days. This means that my load process can be borken for 6 days, and if I fix it by the seventh day everything will reload properly without needing extra intervention to load the intermediate days.

OTHER TIPS

I would suggest having a deleted flag and update that instead of deleting. Your performance will also be better.

This will enable you to perform an analysis on how the reservations are changing over a period of time. You will need to ensure that this flag is used in all the analysis to ensure that there is no confusion.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top