Question

The question title is probably not correct because part of my question is to try and get some more understanding on the problem.

I am looking for the advantages of making sure data that is imported to a database (simple example: Excel table to Access database) should be given using the same schema and should also be valid to the business requirements.

I have an Excel table containing none normalised data and an Access database with normalised tables.

The Excel table comes from multiple third parties, none of which stick to the same format as each other or the database.

Some of the sources also do not supply all the relevant data.

Example of what could be supplied

contact_key, date, contact_title, reject_name, reject_cost, count_of_unique_contact

count_of_unique_contact is derived from distinct contact_title's and should not be imported. contact_key is sometimes not supplied. title is sometimes unknown and passed in as such "n/a", "name = ??1342", "#N/A" etc. rather random. reject_name is often miss spelled. the fields are sometimes not even supplied, e.g. date and contact_key are missing.

I am trying to find information to help explain the issues with the above. Issues only related to incorrect data or fields making it difficult to have useful data in the database such as not being able to report a trend on reject costs in a month when the date is not supplied. Normalising the excel file is not an option available to me.

Requesting the values and fields in the Excel files to match the business requirements and the format to be the same for every third party that sends them is what I want to do but the request is falling on deaf ears.

I want to explain to the client that inputting fake data and checking for invalid/existing rejects/contacts all the time is wrong and doing it is going to fail or at the best be difficult without constant maintenance of a poor system.

Does anyone have any information on this problem?

Thanks

Was it helpful?

Solution

This is a common problem; this gets referred to in data processing circles as "garbage in, garbage out". Essentially, what you're running up against is that the data as given is of poor quality; you're correct to recognize that the problem is that it will be hard (if not impossible) to use this data to extract any useful information.

To some extent, this is a problem that should be fixed at the source; whatever your source of your data is, they need to be convinced that the data quality must improve. In the short term, you can sanitize your data; the term refers to removing or cleaning the bad entries to make the remainder of the data (the "good" data) importable into your database. Depending on just what percentage of your data is bad, you may or may not be able to do useful things with the sanitized data once you import it.

At some point, since you're not getting traction with management about the quality of the data, you will simply have to show them that the system is not working as intended because the quality of the data is bad. They'll need to improve their processes at that point to improve the quality of the data you get in at that point. Until then, though, keep pressing for better data; investigate the process of sanitizing the data and see what you can do with the remaining data. Good luck!

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top