Zero Downtime Deployment - Transitional Db Schema

https://softwareengineering.stackexchange.com/questions/308541

11-12-2020
|

Question

Achieving Zero Downtime Deployment touched on the same issue but I need some advice on a strategy that I am considering.

Context

A web-based application with Apache/PHP for server-side processing and MySQL DB/filesystem for persistence.

We are currently building the infrastructure. All networking hardware will have redundancy and all main network cables will be used in bonded pairs for fault-tolerance. Servers are being configured as high-availability pairs for hardware fault-tolerance and will be load-balanced for both virtual-machine fault-tolerance and general performance.

It is my intent that we are able to apply updates to the application without any down-time. I have taken great pains when designing the infrastructure to ensure that I can provide 100% up-time; it would be extremely disappointing to then have 10-15 minutes downtime every time an update was applied. This is particularly significant as we intend to have a very rapid release cycle (sometimes it may reach one or more releases per day.

Network Topology

This is a summary of the network:

                      Load Balancer
             |----------------------------|
              /       /         \       \  
             /       /           \       \ 
 | Web Server |  DB Server | Web Server |  DB Server |
 |-------------------------|-------------------------|
 |   Host-1   |   Host-2   |   Host-1   |   Host-2   |
 |-------------------------|-------------------------|
            Node A        \ /        Node B
              |            /            |
              |           / \           |
   |---------------------|   |---------------------|
           Switch 1                  Switch 2
    
   And onward to VRRP enabled routers and the internet

Note: DB servers use master-master replication

Suggested Strategy

To achieve this, I am currently thinking of breaking the DB schema upgrade scripts into two parts. The upgrade would look like this:

Web-Server on node A is taken off-line; traffic continues to be processed by web-server on node B.
Transitional Schema changes are applied to DB servers
Web-Server A code-base is updated, caches are cleared, and any other upgrade actions are taken.
Web-Server A is brought online and web-server B is taken offline.
Web-server B code-base is updated, caches are cleared, and any other upgrade actions are taken.
Web-server B is brought online.
Final Schema changes are applied to DB

'Transitional Schema' would be designed to establish a cross-version compatible DB. This would mostly make use of table views that simulate the old version schema whilst the table itself would be altered to the new schema. This allows the old version to interact with the DB as normal. The table names would include schema version numbers to ensure that there won't be any confusion about which table to write to.

'Final Schema' would remove the backwards compatibility and tidy the schema.

Question

In short, will this work?

more specifically:

Will there be problems due to the potential for concurrent writes at the specific point of the transitional schema change? Is there a way to make sure that the group of queries that modify the table and create the backwards-compatible view are executed consecutively? i.e. with any other queries being held in buffer until the schema changes are completed, which will generally only be milliseconds.
Are there simpler methods that provide this degree of stability whilst also allowing updates without down-time? It is also preferred to avoid the 'evolutionary' schema strategy as I do not wish to become locked into backwards schema compatibility.

Solution

It sounds like what you are really looking for is not so much High Availability as you would need Continuous Availability.

Essentially your plan will work but you seem to have noticed that the major flaw in your setup is that database schema changes in a release could result in either downtime or failure of still available node to operate correctly. Continuous Availability approach solves this by essentially creating a number of Production environments.

Production One

This environment is your current live version of the software being utilized by users. It has its own web servers, application servers, and database servers and tablespace. It operates independently of any other environment. The Load Balancer which owns the domain resolution endpoint for these services is currently pointing to these web servers.

Production Two

This is basically release staging environment that is identical to Production One. You can perform your release upgrades here and do your sanity tests before your go live event. This also affords you to safely perform your database changes on this environment. The Load Balancer does not point to this environment currently.

Production DR

This is another duplicate at a separate data center that is located in a different region of the world. This allows you to fail over in the event of catastrophic event by doing a DNS switch at the Load Balancer.

Go Live

This event is essentially updating the DNS record to cycle to Production Two from Production One or vice-versa. This takes a while to propagate throughout the DNS servers of the world so you leave both environments running for a while. Some users MAY be working in existing sessions on the old version of your software. Most users will be establishing new sessions on the upgraded version of your software.

Data Migration

The only drawback here is that not all data during that window is available to all users at that time. There is clearly important user data in the previous version database that now needs to be migrated safely to the new database schema. This can be accomplished with a well tested data export and migration script or batch job or similar ETL process.

Conclusion

Once you have fully completed your release event, Production Two is now your primary and you begin working on installing the next release to Production One for the next deployment cycle.

Drawbacks

This is a complex environment setup and it requires a large amount of system resources, often times two to three times the system resources to do successfully. Operating this way can be expensive, especially if you have very large heavy use systems.

OTHER TIPS

Your strategy is sound. I would only offer to consider expanding the "Transitional Schema" into a complete set of "transaction tables".

With transaction tables, SELECTs (queries) are performed against the normalized tables in order to assure correctness. But all database INSERTs, UPDATEs, and DELETEs are always written to the denormalized transaction tables.

Then a separate, concurrent process applies those changes (perhaps using Stored Procedures) to the normalized tables per the business rules and schema requirements established.

Most of the time, this would be virtually instantaneous. But separating the actions allows the system to accommodate excessive activity and schema update delays.

During schema changes on database (B), data updates on the active database (A) would go into its transaction tables and be immediately applied to its normalized tables.

On bringing database (B) back up, the transactions from (A) would be applied to it by writing them to (B)'s transaction tables. Once that part is done, (A) could be brought down and the schema changes applied there. (B) would finish applying the transactions from (A) while also handling its live transactions which would queue just like (A) did and the "live ones" would be applied the same way when (A) came back up.

A transaction table row might look something like...

| ROWID | TRANSNR | DB | TABLE | SQL STATEMENT
    0        0       A    Name   INSERT INTO Name ...
    1        0       A    Addr   INSERT INTO Addr ...
    2        0       A    Phone  INSERT INTO Phone ...
    3        1       A    Stats   UPDATE Stats SET NrOfUsers=...

The transaction "tables" could actually be rows in a separate NoSQL database or even sequential files, depending on performance requirements. A bonus is that the application (website in this case) coding gets a bit simpler since it writes only to the transaction tables.

The idea follows the same principles as double-entry bookkeeping, and for similar reasons.

Transaction tables are analogous to a bookkeeping "journal". The fully normalized tables are analogous to a bookkeeping "ledger" with each table being somewhat like a bookkeeping "account".

In bookkeeping, each transaction gets two entries in the journal. One for the "debited" ledger account, and the other for the "credited" account.

In an RDBMS, a "journal" (transaction table) gets an entry for each normalized table to be altered by that transaction.

The DB column in the table illustration above indicates on which database the transaction originated, thus allowing the queued rows from the other database to be filtered out and not reapplied when the second database is brought back up.

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange