Question

I'm an ETL developer using different tools for ETL tasks. The same question rises in all our projects: the importance of the data profiling before the Data Warehouse is build and before the ETL is build for data movement. Usually I have done data profiling (i.e. finding bad data, data anomalies, counts, distinct values etc.) using pure SQL because ETL tools does not provide a good alternative for these (there is some data quality components in our tools, but they are not so sophisticated). One option is to use R programming language or SPSS Modeler etc. kind of tools for this kind of Exploratory Data Analysis. But usually these kinds of tools are not available or does not qualify if there is millions of rows of data.

How to do this kind of profiling using SQL? Is there any helper scripts available? How do you do this kind of Exploratory Data Analysis before data cleaning and ETL?

Was it helpful?

Solution 2

I found a good tool for this purpose: Datacleaner. This seems to do most of things I want to do with data in EDA process.

OTHER TIPS

Load the data into the some staging system and use the Data profiler task from SSIS. Use this link http://gowdhamand.wordpress.com/2012/07/27/data-profiling-task-in-ssis/ to verify how to data analysis. Hope this helps.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top