Safely fixing production database data

https://softwareengineering.stackexchange.com/questions/207987

29-09-2020
|

Question

Bugs happen and sometimes data has to be fixed in production. What is the safest way to go about this from a big company standpoint? Are there tools that can help? Here are some considerations driving this requirement...

We need to log who ran the query and what they ran
Ideally we need to give the person access to only run queries against the tables of interest and only for a short time
Whatever is running the queries needs to have some smarts about it to not allow long running and locking SQL to run without explicit permission
This process needs to be DB agnostic or at least understand DB2, Oracle, and SQL Server.

We are trying to reduce the risk of ad-hoc prod fix-up queries from doing the "wrong thing" and at the same time add some security/audtis to the process. Thoughts or Ideas?

Solution

Never ever update production databases manually.

Write scripts.

Triple check them, and have multiple people do that, not just a single person doing it three times.

Include post-change validation queries in those scripts.

Whenever the situation allows, test the whole change within a transaction which is rolled back at the end, after the post-change validation has run. When confident with the results, change the rollback to a commit.

Test those scripts ad nauseam against a test database.

Make a backup prior to running the script against the production database.

Run the scripts.

Check, validate and triple check the changed data using the post-change-validation scripts.

Do a visual check anyway.

If anything seems off, back off and restore the backup.

Do not proceed with the changed data as the production data until you are absolutely sure that everything is ok and you have sign off from the (business) managers involved.

OTHER TIPS

The answer by Marjan Venema is technically valid and should be followed when possible. Alas, Marjan answers from the point of view of a theorist, or a purist database administrator who likes to make things cleanly. In practice, sometimes business constraints make it impossible to do things in a clean way.

Imagine the following case:

There is a bug in the software product which causes it to stop working when it detects what it thinks being some data inconsistency in the database,
All developers who could potentially fix the bug in the application are unreachable,
The company is currently losing thousands of dollars per hour (let's say $6 000, which means $100 per minute),
The bug is affecting several tables, one of which is huge, and concerns only the data itself, not the schema,
In order to circumvent the bug, you should experiment a bit with the data, which involves both removing and changing it,
The database is large and it would take three hours to take or restore the backup,
The last full backup was taken three weeks ago; there are also daily incremental backups, and the last daily incremental backup was done 14 hours ago,
Database backups are assumed reliable; they were severely tested, including recently,
Losing 14 hours of data is not acceptable, but the loss of one to two hours of data is,
The staging environment was lastly used six months ago; it seems it is not up to date, and it may take hours setting it up,
The database is Microsoft SQL Server 2008 Enterprise.

The clean way to do things is to:

Restore the backup in staging environment,
Experiment there,
Check the final script twice,
Run the script on the production server.

Just the first step will cost $18 000 to your company. The risk is pretty low if you do the third step flawlessly, but since you work under extreme pressure, the risk would be much higher. You may end up with a script which worked perfectly well in staging, then screws the production database up.

Instead, you could have done like this:

Create a snapshot (Microsoft SQL Server supports that, and it takes seconds to revert (and nothing to create) a snapshot of a database which takes an hour to backup; I imagine that other database products support snapshots as well),
Experiment directly on the production database, reverting to the snapshot if something goes wrong.

While a purist would fix the database in a clean way and still have a risk to screw things up given the time pressure while wasting more than $20 000 of his company, a database administrator who takes in account business constraints will fix the database in a way which will minimize the risks (thanks to snapshots) while doing it quickly.

Conclusion

I'm a purist myself, and I hate doing things in a non-clean way. As a developer, I refactor the code I modify, I comment the difficult parts which couldn't be refactored, I unit-test the codebase and I do code reviews. But I also take in consideration the circumstances where either you do things cleanly and the next day you're fired, or you minimize both the risks and the financial impact by doing a quick hack which works.

If some IT guy wants doing things cleanly just for the sake of cleanness while it causes thousands of dollars of loss for the company, this IT guy have a deep misunderstanding of his job.

Safely fixing production database data. What is the safest way to go about this from a big company standpoint? Are there tools that can help?

It is a bad practice and an invitation gate for more data problems and issues. There is even a phrase that describes this approach as "Quick and Dirty".

Continuing fixes/updates directly on a production server is very dangerous, as it will cost you/your company a fortune (law suits, bad/dirty data, lost businesses, etc.)

However, bugs will be there and need to be fixed. The de-facto industrial standard is to apply patches / (deployment scripts) on a Staging (pre-production environment with latest copy of prod database) and let data analyst/ QA to verify the fix. Same script should be version controlled and applied to the Prod environment to avoid issues.

There are number of good practices mentioned in this related post - Staging database good practices

Good set of references to look are:

In most organisation's I have worked updating data in the live environment was always done by a small group of people with the access rights to do so, typically with a job title such as DBA. As updates could only be done by the small number of people there's at least a chance that they gain familiarity with the data and therefore reduces (but not eliminates) the risk of problems.

The person writing the update script would do so in test (as per other answers) and get serious sign off from non-techies (those who know the system, plus someone with senior authority) that the features appear to be 'right again' in addition to their own paranoid testing. The scripts, and the data, would be independently verified by another techie (often the DBA role I mentioned) on test before being ran into production. The results would be checked against the anticipated values (unique for every scenario, but often things like rowcounts etc.)

In one company I worked for, taking backups wasn't a realistic option, but all rows to be updated were written off to a text file for reference BEFORE the update, and then again AFTER the update should anyone ever need to refer to it. The scripts and this data kept in a properly organised Data Change Log.

Every business is unique, and the risks on updating some data is clearly greater than in others.

By having a process that makes people have to jump through hoops to do these updates, hopefully you promote a culture that makes people want to treat this as a last resort, and create a healthy "double check, triple check" attitude around this stuff.

There are times when you must fix data on Prod that doesn't exist on other servers. This isn't just from bugs but could be from an import of data from a file that a client sent that was incorrect or from a problem caused by someone hacking into your system. Or from a problem caused by bad data entry. If your database is large or time critical, you may not have the time to restore the latest backup and fix on dev.

Your first defense (and something no Enterprise database can afford to be without!) are audit tables. You can use them to back out bad data changes. Further, you can write scripts to return data to the previous state and test them on other servers long before you need to revert audited data. Then the only risk is that you identified the correct records to revert.

Next all scripts to change data on production should include the following:

They should be in explicit transactions and have a TRY Catch block.

They should have a test mode that you can use to rollback the changes after you see what they would have been. You should have a select statment from before the change was made and one run after the change to ensure the change was correct. The script should make sure the number of rows processed is shown. We have some of this pre-set up in a template which makes sure the pieces get done. Templates for changes, help save time in writing the fix too.

If there is a large amount of data to change or update, then consider writing the script to run in batches with commits for each batch. You do not want to lock the whole system up while you fix a million records. If you have large anmounts of data to fix, make sure a dba or someone who is used to performance tuning reviews the script prior to running and run during off hours if at all possible.

Next all scripts to change anything on production are code reviewed and put into source control. All of them - without exception.

Finally devs should not run these scripts. They should be run by dbas or a configuration management group. If you have neither of those, then only people who are tech leads or higher should have the rights to run things on prod. The fewer people running things on prod, the easier it is to track down a problem. Scripts should be written so that they are simply run, no highlighting parts and running one step at a time. It is the highlighting stuff that often gets people in trouble when they forgot to highlight the where clause.

I have updated data many times in running production databases. I agree with the answer above, that this would never be standard operating procedure.

It would also be expensive (we would look over eachothers shoulders and discuss 2 or 3 maybe)

And the golden rule: always make a select statement to show what would be done before doing an update/delete/insert statement

The golden rule being enforced by the other two people in the team!

re: MainMa's answer...

There is a bug in the software product which causes it to stop working when it detects what it thinks being some data inconsistency in the database,

How do you know it's a "bug"? The data is inconsistent according to the rules the software product developer laid out.

All developers who could potentially fix the bug in the application are unreachable,

The company is currently losing thousands of dollars per hour (let's say $6 000, which means $100 per minute),

Apparently a loss of $100/minute is not important enough to the company management for them to locate and insure that competent developers return to fix their mistake and help you restore the database.

The bug is affecting several tables, one of which is huge, and concerns only the data itself, not the schema,

All database problems "concern" the schema. How the schema is designed is what is going to determine how you solve this problem.

In order to circumvent the bug, you should experiment a bit with the data, which involves both removing and changing it,

That's what your staging database is for. You may need to repopulate it with "corrupted" data from the production database right after you take a full online backup of production.

The database is large and it would take three hours to take or restore the backup,

Then you better get that started right away so it can run while you're analyzing the problem, developing your correction scripts, testing and refining them along with the developers and other DBAs helping you.

The last full backup was taken three weeks ago; there are also daily incremental backups, and the last daily incremental backup was done 14 hours ago,

You don't have at least daily full online backups? You're screwed. But you're probably used to that. Good thing that full backup you started above is running. Be sure management tracts every minute of the costs that could have been avoided with daily online backups.

Database backups are assumed reliable; they were severely tested, including recently,

Excellent! Then you might not have to restore the database more than once.

Losing 14 hours of data is not acceptable, but the loss of one to two hours of data is,

Under the scenario you've described, all bets are off. This is an "information disaster management" situation. A good thing for management to be doing throughout this is documenting the costs that could be avoided in the future with prpoer backups and recovery procedures and resources.

The staging environment was lastly used six months ago; it seems it is not up to date, and it may take hours setting it up,

If your backup system supports online backups (i.e. database fully operational during the backup), then you can do the extract to repopulate the staging database at the same time if you have sufficient hardware resources to avoiding slowing down the backup.

The database is Microsoft SQL Server 2008 Enterprise.

Harder to do all this but not impossible. Good Luck!

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange