How do I coordinate database hardware (EC2) upgrades on a web site behind ELB?

Question

If you go the route of the "upgrading" (dummy/replacement) web app, I would be inclined to run that on a different machine so you can test and verify its behavior in isolation, point the ELB to it, and point the ELB back without touching the real application.

I would further suggest that you not "upgrade" your existing instances, but, instead, bring new instances online, copy as much as you can from the live site, and then take down the live site, finish synching whatever needs to be synched, and then cut the traffic over.

If I were doing this with a single MySQL server-backed site (which I mention only because that is my area of expertise), I would bring the new database server online with a snapshot backup of the existing database, then connect it to the live replication stream generated by the existing database server, beginning at the point-in-time where the snapshot backup was taken, then let it catch up to the present by executing the transactions that occurred since the snapshot. At this point, after the new server caught up to real time playing back the replication events, I would have my live data set in essentially real time, new database hardware. I could then stop the application, reconfigure the application server settings to use the new database server, verify that all of the replication events had propagated, disconnect from the replication stream, and restart the app server against the new database, for a total downtime so short that it would be unlikely to be noticed if done during off-peak time.

Of course, with a Galera cluster, these gyrations would be unnecessary since you can just do a rolling upgrade, one node at a time, without ever losing synchronization of the other two nodes with each other (assuming you had the required minimum of 3 running nodes to start with) and each upgraded node would resync its data from one of the other two when it came back online.

To whatever extent the platform you are using doesn't have comparable functionality to what I've described (specifically, the ability to do database snapshots and playback a stream of a transaction log against a database restored from a snapshot... or quorum-based cluster survivability), I suspect that's the nature of the limitation that makes it feel like you're doing it wrong.

A possible workaround to help you minimize the actual downtime if your architecture doesn't support these kinds of actions would be to enhance your application with the ability to operate in a "read only" mode, where the web site can be browsed but the data can't be modified (you can see the catalog, but not place orders; you read the blogs, but not edit or post comments; you don't bother saving "last login date" for a few minutes; certain privilege levels aren't available; etc.) -- like Stack Overflow has the capability of doing. This would allow you to stop the site just long enough to snapshot it, then restart it again on the existing hardware in read-only mode while you bring up the snapshots on new hardware. Then, when you have the site back to available status on the new hardware, cut the traffic over at the load balancer and you'd be back to normal.