Question

I have a cluster of three mongrels running under nginx, and I deploy the app using Capistrano 2.4.3. When I "cap deploy" when there is a running system, the behavior is:

  1. The app is deployed. The code is successfully updated.
  2. In the cap deploy output, there is this:

    • executing "sudo -p 'sudo password: ' mongrel_rails cluster::restart -C /var/www/rails/myapp/current/config/mongrel_cluster.yml"
    • servers: ["myip"]
    • [myip] executing command
    • ** [out :: myip] stopping port 9096
    • ** [out :: myip] stopping port 9097
    • ** [out :: myip] stopping port 9098
    • ** [out :: myip] already started port 9096
    • ** [out :: myip] already started port 9097
    • ** [out :: myip] already started port 9098
  3. I check immediately on the server and find that Mongrel is still running, and the PID files are still present for the previous three instances.
  4. A short time later (less than one minute), I find that Mongrel is no longer running, the PID files are gone, and it has failed to restart.
  5. If I start mongrel on the server by hand, the app starts up just fine.

It seems like 'mongrel_rails cluster::restart' isn't properly waiting for a full stop before attempting a restart of the cluster. How do I diagnose and fix this issue?

EDIT: Here's the answer:

mongrel_cluster, in the "restart" task, simply does this:

 def run
   stop
   start
 end

It doesn't do any waiting or checking to see that the process exited before invoking "start". This is a known bug with an outstanding patch submitted. I applied the patch to Mongrel Cluster and the problem disappeared.

Was it helpful?

Solution

You can explicitly tell the mongrel_cluster recipes to remove the pid files before a start by adding the following in your capistrano recipes:

# helps keep mongrel pid files clean
set :mongrel_clean, true

This causes it to pass the --clean option to mongrel_cluster_ctl.

I went back and looked at one of my deployment recipes and noticed that I had also changed the way my restart task worked. Take a look at the following message in the mongrel users group:

mongrel users discussion of restart

The following is my deploy:restart task. I admit it's a bit of a hack.

namespace :deploy do
  desc "Restart the Mongrel processes on the app server."
  task :restart, :roles => :app do
    mongrel.cluster.stop
    sleep 2.5
    mongrel.cluster.start
  end
end

OTHER TIPS

First, narrow the scope of what your testing by only calling cap deploy:restart. You might want to pass the --debug option to prompt before remote execution or the --dry-run option just to see what's going on as you tweak your settings.

At first glance, this sounds like a permissions issue on the pid files or mongrel processes, but it's difficult to know for sure. A couple things that catch my eye are:

  • the :runner variable is explicity set to nil -- Was there a specific reason for this?
  • Capistrano 2.4 introduced a new behavior for the :admin_runner variable. Without seeing the entire recipe, is this possibly related to your problem?

    :runner vs. :admin_runner (from capistrano 2.4 release) Some cappers have noted that having deploy:setup and deploy:cleanup run as the :runner user messed up their carefully crafted permissions. I agreed that this was a problem. With this release, deploy:start, deploy:stop, and deploy:restart all continue to use the :runner user when sudoing, but deploy:setup and deploy:cleanup will use the :admin_runner user. The :admin_runner variable is unset, by default, meaning those tasks will sudo as root, but if you want them to run as :runner, just do “set :admin_runner, runner”.

My recommendation for what to do next. Manually stop the mongrels and clean up the PIDs. Start the mongrels manually. Next, continue to run cap deploy:restart while debugging the problem. Repeat as necessary.

Either way, my mongrels are starting before the previous stop command has finished shutting 'em all down.

sleep 2.5 is not a good solution, if it takes longer than 2.5 seconds to halt all running mongrels.

There seems to be a need for:

stop && start

vs.

stop; start

(this is how bash works, && waits for the first command to finish w/o error, while ";" simply runs the next command).

I wonder if there is a:

wait cluster_stop
then cluster_start

I hate to be so basic, but it sounds like the pid files are still hanging around when it is trying to start. Make sure that mongrel is stopped by hand. Clean up the pid files by hand. Then do a cap deploy.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top