Question

Suppose I have a server setup w/ one load-balancer that routes traffic between two web servers who both connect to a database that is a RAM cloud. For whatever reason I want to upgrade my database, and this will require me to have it down temporarily. During this downtime I want to put an "upgrading" notice on the front page of the site. I have a specific web app that displays that message.

Should I:

  • (a) - spin up a new ec2 instance with the web app "upgrading" on it and point the LB at it
  • (b) - ssh into each web server and pull down the main web app, put up the "upgrading" app
  • (c) - I'm doing something wrong since I have to put a "upgrading" sign up in the first place
Was it helpful?

Solution

If you go the route of the "upgrading" (dummy/replacement) web app, I would be inclined to run that on a different machine so you can test and verify its behavior in isolation, point the ELB to it, and point the ELB back without touching the real application.

I would further suggest that you not "upgrade" your existing instances, but, instead, bring new instances online, copy as much as you can from the live site, and then take down the live site, finish synching whatever needs to be synched, and then cut the traffic over.

If I were doing this with a single MySQL server-backed site (which I mention only because that is my area of expertise), I would bring the new database server online with a snapshot backup of the existing database, then connect it to the live replication stream generated by the existing database server, beginning at the point-in-time where the snapshot backup was taken, then let it catch up to the present by executing the transactions that occurred since the snapshot. At this point, after the new server caught up to real time playing back the replication events, I would have my live data set in essentially real time, new database hardware. I could then stop the application, reconfigure the application server settings to use the new database server, verify that all of the replication events had propagated, disconnect from the replication stream, and restart the app server against the new database, for a total downtime so short that it would be unlikely to be noticed if done during off-peak time.

Of course, with a Galera cluster, these gyrations would be unnecessary since you can just do a rolling upgrade, one node at a time, without ever losing synchronization of the other two nodes with each other (assuming you had the required minimum of 3 running nodes to start with) and each upgraded node would resync its data from one of the other two when it came back online.

To whatever extent the platform you are using doesn't have comparable functionality to what I've described (specifically, the ability to do database snapshots and playback a stream of a transaction log against a database restored from a snapshot... or quorum-based cluster survivability), I suspect that's the nature of the limitation that makes it feel like you're doing it wrong.

A possible workaround to help you minimize the actual downtime if your architecture doesn't support these kinds of actions would be to enhance your application with the ability to operate in a "read only" mode, where the web site can be browsed but the data can't be modified (you can see the catalog, but not place orders; you read the blogs, but not edit or post comments; you don't bother saving "last login date" for a few minutes; certain privilege levels aren't available; etc.) -- like Stack Overflow has the capability of doing. This would allow you to stop the site just long enough to snapshot it, then restart it again on the existing hardware in read-only mode while you bring up the snapshots on new hardware. Then, when you have the site back to available status on the new hardware, cut the traffic over at the load balancer and you'd be back to normal.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top