Fixing bad data in a database - redo or incremental

https://softwareengineering.stackexchange.com/questions/261402

05-10-2020
|

Question

I have pseudo-ownership of a fairly old db (original data from 30 years ago; current design is >15 years old). In my opinion, the schema is pretty broken, and one of the implications of this is that there are a lot of inconsistencies/issues with the data. I'm planning to write a new schema and port the data across, which is a fairly simple task as new data comes in rarely.

Would you attempt to fix the inconsistencies in the old database first, or iron them out as part of the migration process? I'm tempted to go with the latter - since I'll need proper validation anyway, and some errors will pop out naturally with the different schema design - but fixing the data first would break the task into smaller chunks and allow people familiar with the old db to validate the fixes.

Thoughts?

La solution

Your latter strategy is likely the better choice. It will be difficult to find all the problems in the data while it is still resting in its current format.

I would treat this an ETL process of sorts, combined with an iterative approach. Something like this:

Build a beta version of the schema.
Write a program to read the old data, scrub/transform it, and finally load it to the new schema. Log any new and unexpected problems with the data that your program's scrub/transform logic and/or the schema can't handle.
Review problems detected in step #2 and change the schema and/or ETL program.
Delete all the saved data
Run the revised program against the revised schema.
Lather rinse, and repeat until you and the data experts are comfortable with the schema design and transformed data. Move data to production.

Edit

If the data experts are concerned about what will happen during the transform process, make extra effort to keep an open dialog going with them concerning what you are seeing in the data and how they want to you to handle it.

This process will likely be a great way to clarify what the business rules and logic are for the data. As the load program attempts to confirm the data to these rules, it may discover problems with the data that no one knew even existed. It may also discover previously unknown scenarios that require new rules. The end result is both better data and a better understanding of what it can tell you.

Licencié sous: CC-BY-SA avec attribution

Non affilié à softwareengineering.stackexchange