How to handle unexpected schema changes to production database

https://softwareengineering.stackexchange.com/questions/235785

03-10-2020
|

Question

Is it a common or normal practice to encounter schema changes, especially renamed or removed columns, in a production database, without having been notified of the change? If unexpected schema changes are anticipated in the production database, how are they guarded against, other than basic exception handling?

I've worked mainly with credit union software, DoD software, high volume e-commerce web applications, real estate systems, robotics, and more recently CMMS, GIS, and SCADA systems. Although the '90s and early 2000s were more of the wild west, everything I have been involved with over the last 10 years or so has required change management. Modifications to a production database without change control is not allowed by policy and by security controls.

Comments in the accepted answer to this question - Why is “Select * from table” considered bad practice make me wonder how much of this is academic and how much is reality.

Solution

If I may be so bold ;)

The problem isn't really about unexpected schema changes to production, though that's certainly a symptom.

In reality it is:

"Why aren't different members of my organization communicating about changes that affect each other"

Once you view it that way then it makes more sense to look at:

current processes of code and database changes
current process for notifying others
current process for running tests for such changes
current methods used to communicate (email, IM, Project tracker, etc.)

etc.

OTHER TIPS

The short answer: Log it, maybe notify the user, and maybe continue execution of your application.

Changes happen all the time. In production. Without warning. Often by mistake.

Even with change control policies carved in stone, mistakes happen. A DBA that is tuning a query drops a table. A developer with too much access accidentally removes a column. One of the DBAs moves your tables to a different schema. A database becomes corrupted.

These are some of the things that I have seen occur in live production databases over the years. Each one made me very glad there were backups of the production applications.

There will always be something somewhere that goes wrong. That is one of the reasons that you have error handling code in production applications: to prevent the failure of one section of code from causing the overall failure of all of it.

Errors

Log everything that is an error with the most details you can. Include the full error message (where possible). Make sure that your code can gracefully shutdown when an error is sufficiently detrimental to require it. And log that your app forcefully terminated itself. Displaying errors to the user is something that you should do in most, but not all circumstances.

Warnings

Log warnings, but only grab the relevant minimum information you need to diagnose and correct the issue. You may want to set some flag that will conditionally allow you to log warnings (and/or informational messages). This will help in keeping the log file small. Warning messages are mostly something that you will want to show to the user, especially when the user can make changes to correct the issue.

Informational Messages

Only log informational items when you need normally unimportant information. In production code, I have found it useful for informational messages to include when you enter and leave a function, method, properties (when they do more that get/put values), start up and shut down singleton instances (e.g., SerialPort). These informational messages do not need to be shown to the user.

Ideally, NO unexpected changes happen to a production system, EVER.
In reality of course, they happen all the time.

The way you handle it is by having proper procedures in place to minimise them, AND having proper procedures in place to deal with them when they happen (lists of people to call, things to check, rollback processes, etc. etc..
Risk can also be reduced by properly testing all changes that are scheduled as good as you can, in a situation as close as possible to the life system (identical hardware and software configuration for example), then having procedures in place that make it imperative that no manual changes are performed to production systems except those listed explicitly in the upgrade/installation manuals for your product.

Notice there's not a single piece of monitoring software, no coding practice, nothing technical in there. It's all procedural and documentation.

Combine that with proper access control to production systems (basically only the people who do installations and disaster recovery have write access there, and by procedure they only log in with those specific accounts when performing those specific tasks), and you've done all you can.
Only thing to remain to do in your code is to make it as robust as you can towards schema changes.
Of course you can't guard against a column you need being deleted or having its data type made narrower. But you can guard against a column changing to NOT NULL by ensuring all columns are always written with values, and you can guard against most column additions (obviously not NOT NULL columns) by writing your code so as to only retrieve and write explicit columns, only the minimum set you need for your operations.

You don't.

Database / schema changes are not things that "just happen". Changes like this are not desired or supported.

Any change that may be needed will have to go through the regular release management and change management process, which includes testing the existing functionality (nothing is broken) as well proving that the change has the desired effect. We have a dev, test, staging and production environment just for this. We do not write code in production.

It may take a bit longer than just changing a database in production, but it is the only way to make sure users don't get confronted with random errors because "someone" changed "something".

You cannot code against a changing database schema. What if the DBA renames a table? What if the DBA drops related records? What if the DBA disables validation, or adds extra validation?

The only valid option is to log errors, and to let the application fail.

I suggest making clear rules with the operations team about these kinds of things. How would they like it if you start renaming servers for them?

In my environment we make schema changes only as part of a release, and we do that with tooling that transforms the schema to the exact form as designed. other changes that may be made in an environment will be undone by this.

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange