Question

I'm looking for a good algorithm / method to check the data quality in a data warehouse. Therefore I want to have some algorithm that "knows" the possible structure of the values and then checks if the values are a member of this structure and then decide if they are correct / not correct.

I thought about defining a regexp and the check each value whether it fits or not.

Is this a good way? Are there some good alternatives? (Any research papers?)

Was it helpful?

Solution

I have seen some authors suggest adding a special dimension called a data quality dimension to describe each facttable-record further.

Typical values in a data quality dimension could then be “Normal value,” “Out-of-bounds value,” “Unlikely value,” “Verified value,” “Unverified value,” and “Uncertain value.”

OTHER TIPS

I would recommend using a dedicated data quality tool, like DataCleaner (http://datacleaner.eobjects.org), which I have been doing quite a lot of work on.

You need a tool that not only check strict rules like constraints, but also one that will give you a profile of your data and make it easy for you to explore and identify inconsistencies on your own. Try for example the "Pattern finder" which will tell you the patterns of your string values - something that will often reveal the outliers and errornous values. You can also use the tool for actual cleansing the data, by transforming values, extracting information from them or enriching using third party services. Good luck improving your data quality!

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top