Question

I am new to working with SSIS packages and am confused about the best practices to find the delta when bringing the data from landing to staging. The requirement is to create a couple of reports from the Consolidated Data Store (CDS) and the data flows from dource to Landing to Staging to CDS as ETL packages and finally the reports are built from the CDS.

I have successfully created packages to move data from source to Landing as they are pretty straight forward. Moving the data from Landing to Staging is a bit confusing because the date modified is not enough as 2 changes from the last ETL run can return the data to the previous date's data meaning no changes since the last ETL run. For example, a value changes from A to B and then back to A - means that the data was returned to the original state but the modifiedOn column changed.

So, should all the columns from Landing be compared to the existing columns in Staging in a row or just the columns that are for example relevant to delivering a report? Or is there another way to find the delta?

Please let me know if this is unclear or needs more details.

Was it helpful?

Solution

This is a question for your business. We expect them to define what is considered a delta in our requirements document. For some it is only a few fields and for others it is everything. It depends onteh business need. I would shoot the question to whoever gave you the requirement to begin with. If you unserstand your business well, you could include inteh email a suggestionfor what you think the delta would be and most of the time they are relived not to have to figure it out themselves and will accept your suggestion. But only do that if you really understand the normal business needs associated with the data. You can also provide a pro and con to them of the various possibilities to help them decide.

OTHER TIPS

Why do you want to get the exact delta? I mean why don't you want that a row already committed to your CDS, will be re-committed if no real change has happened? If you've no business reasons (reports needs) to do this, it sounds that you're adding complexity where it's not needed.

Anyway, if you really need this, I'd recommend to calculate a CHECKSUM of the interesting columns and check your new row's checksum with the old row's checksum. This blog should help you to understand how to use a checksum.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top