Deployment race condition causing CDN to cache old or broken files

Question 1

step 4 in your procedure should be:

git archive --remote $yourgithubrepo --prefix=$timestamp/ | tar -xf -
stop-server
ln -sf $timestamp current
start-server

your server would use the current directory (well, a symlink) at all times. no matter how long the deploy takes, your application is in a consistent state.

Question 2

As a general rule of thumb:

Don't do live upgrades. (unless the language supports it, but even then think twice)

Pulling code using git pull and then waiting for the app to notice changes to files sounds a lot like the 90's: uploading php files to an apache web server using ftp (or sftp if you are cool) and waiting for apache to notice that they were updated. It can't happen atomically, so of course there is a race condition. Some users WILL get a half built and broken site.

I recommend only upgrading your live and running application while no one is using it. Hopefully you have a pool of servers behind a load balancer of some sort, which will allow you to remove them one at a time and upgrade them.

This will mean that users will be able to use both the old and the new site at the same time depending on how and when they access it, but that is much better then not being able to access it at all.

Ideally you would be able to spin up copies of each of the web servers that you have running with the new version of the site. Check that the new version does work, and then atomically update the load balancer so that everyone gets bumped to the new site at the same time. And only once everything is verified to be working perfectly the old machines are shut down and decommissioned, or reused.

Question 3

I'll go ahead and post our far-from-ideal monkey-patch that we're using right now.

We deploy once which may or may not go as planned, once we're sure the code is deployed on all the servers we do another build where the only thing that changes is the version number.

Then we deploy again server by server.

The race condition still exists but because the application code between the two versions is the same this masks the issue since no matter which server the CDN hits it gets the "latest" code.