سؤال

I need to build out a warehouse structure that is populated by periodic feeds consisting of a mix of scraped data, authority sources, and primary sources. In order to prioritize which data should be used for a data mart, I need to keep track of each data point's creation (when did we get this data?), update (when did the data change?), and existing match (when was the last time this data showed up?). From my research it looks like 5NF is the level of normalization I should use.

For example, I may get a feed which gives Student Names along a list of their favorite classes and rank order. Another feed may come in with the same data, but might have an incomplete list of classes, or lack rank order. Every feed would come in with a unique and reliable student identifier

I would was thinking there would be at least three tables for just that data:

  1. Student (s_id, name)
  2. Student_Class (s_id,class_id)
  3. Student_Class_Order (s_id,class_id,order_num)

Each of those would also have columns for isDeleted, created_time, created_feed_id, modified_time, modified_feed_id, matched_time, matched_feed_id.

What kind of structure would make the most sense? It seems like any new combination of datapoint should be inserted, and I should not have updates (except for a soft-delete flag). Any ideas?

هل كانت مفيدة؟

المحلول

The short answer is: define your end result ( requirement) and work backwards from there.

Normalization goes out the window pretty quick with a DW schema.

In my case we have 3 schemas for our ETL.

  1. Staging: Tables are extracted into our DW exactly as they appear in the source system plus a few columns for meta data as you describe. The key one being the extract time.

  2. Transform: These tables have exactly the same schema as the end result dimensions and fact tables. At this point we are denormalising data from an OLTP schema into a DW star schema. Again it has fields for ETL meta data. The original extract datetime can travel with the records.

  3. Store: This our end result, a DW star schema which is read by our OLAP cube. To Load our data we compare the transform with the final datastore tables. New rows get inserted. If a row has changed we retire the old and insert the new. To do this we have fields for is_current, valid_from datetime and valid_to datetime.

Technically we dont need 3 schemas. It's just a convention, and it allows us to keep things tidy.

Read about Kimball methodology and Slowly Changing Dimensions. Type 1 or 2 etc. You will see examples of what fields you will need to store the meta data required.

Talk to your business about what they want or need the DW to be capable of. Whether you need SCD's or not will have a significant impact on your schema, meta data required and ultimately what level of normalisation that is possible.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى dba.stackexchange
scroll top