How to gracefully shut down or remove AWS instances from an ELB group

https://stackoverflow.com/questions/7665358

07-02-2021
|

Question

I have a cloud of server instances running at Amazon using their load balancer to distribute the traffic. Now I am looking for a good way to gracefully scale the network down, without causing connection errors on the browser's side.

As far as I know, any connections of an instance will be rudely terminated when removed from the load balancer.

I would like to have a way to inform my instance like one minute before it gets shut down or to have the load balancer stop sending traffic to the dying instance, but without terminating existing connections to it.

My app is node.js based running on Ubuntu. I also have some special software running on it, so I prefer not to use the many PAAS offering node.js hosting.

Thanks for any hints.

Solution

This idea uses the ELB's capability to detect an unhealthy node and remove it from the pool BUT it relies upon the ELB behaving as expected in the assumptions below. This is something I've been meaning to test for myself but haven't had the time yet. I'll update the answer when I do.

Process Overview

The following logic could be wrapped and run at the time the node needs to be shut down.

Block new HTTP connections to nodeX but continue to allow existing connections
Wait for existing connections to drain, either by monitoring existing connections to your application or by allowing a "safe" amount of time.
Initiate a shutdown on the nodeX EC2 instance using the EC2 API directly or Abstracted scripts.

"safe" according to your application, which may not be possible to determine for some applications.

Assumptions that need to be tested

We know that ELB removes unhealthy instances from it's pool I would expect this to be graceful, so that:

A new connection to a recently closed port will be gracefully redirected to the next node in the pool
When a node is marked Bad, the already established connections to that node are unaffected.

possible test cases:

Fire HTTP connections at ELB (E.g. from a curl script) logging the results during scripted opening an closing of one of the nodes HTTP ports. You would need to experiment to find an acceptable amount of time that allows ELB to always determine a state change.
Maintain a long HTTP session, (E.g. file download) while blocking new HTTP connections, the long session should hopefully continue.

1. How to block HTTP Connections

Use a local firewall on nodeX to block new sessions but continue to allow established sessions.

For example IP tables:

iptables -A INPUT -j DROP -p tcp --syn --destination-port <web service port>

OTHER TIPS

I know this is an old question, but it should be noted that Amazon has recently added support for connection draining, which means that when an instance is removed from the loadbalancer, the instance will complete requests that were in progress before the instance was removed from the loadbalancer. No new requests will be routed to the instance that was removed. You can also supply a timeout for these requests, meaning any requests that run longer than the timeout window will be terminated after all.

To enable this behaviour, go to the Instances tab of your loadbalancer and change the Connection Draining behaviour.

The recommended way for distributing traffic from your ELB is to have an equal number of instances across multiple availability zones. For example:

ELB

Instance 1 (us-east-a)
Instance 2 (us-east-a)
Instance 3 (us-east-b)
Instance 4 (us-east-b)

Now there are two ELB APIs of interest provided that allow you to programmatically (or via the control panel) detach instances:

Deregister an instance
Disable an availability zone (which subsequently disables the instances within that zone)

The ELB Developer Guide has a section that describes the effects of disabling an availability zone. A note in that section is of particular interest:

Your load balancer always distributes traffic to all the enabled Availability Zones. If all the instances in an Availability Zone are deregistered or unhealthy before that Availability Zone is disabled for the load balancer, all requests sent to that Availability Zone will fail until DisableAvailabilityZonesForLoadBalancer calls for that Availability Zone.

Whats interesting about the above note is that it could imply that if you call DisableAvailabilityZonesForLoadBalancer, the ELB could instantly start sending requests only to available zones - possibly resulting in a 0 downtime experience while you perform maintenance on the servers in the disabled availability zone.

The above 'theory' needs detailed testing or acknowledgement from an Amazon cloud engineer.

Seems like there have already been a number of responses here and some of them have good advice. But I think that in general your design is flawed. No matter how perfect you design your shutdown procedure to make sure that a clients connection is closed before shutting down a server you're still vulnerable.

The server could loose power.
Hardware failure causes server to fail.
Connection could be closed by a network issue.
Client looses internet or wifi.

I could go on with the list, but my point is that instead of designing for the system to always work correctly. Design it to handle failures. If you design a system that can handle a server loosing power at any time then you've created a very robust system. This isn't a problem with the ELB this is a problem with the current system architecture you have.

I can't comment cause of my low reputation. Here is some snippets I crafted that might be very useful for someone out there. It utilizes the aws cli tool to check when an instance been drained of connections.

You need an ec2-instance with provided python server behind an ELB.

from flask import Flask
import time

app = Flask(__name__)

@app.route("/")
def index():
    return "ok\n"

@app.route("/wait/<int:secs>")
def wait(secs):
    time.sleep(secs)
    return str(secs) + "\n"

if __name__ == "__main__":
    app.run(
        host='0.0.0.0',
        debug=True)

Then run following script from local workstation towards the ELB.

#!/bin/bash

which jq >> /dev/null || {
   echo "Get jq from http://stedolan.github.com/jq"
}

# Fill in following vars
lbname="ELBNAME"
lburl="http://ELBURL.REGION.elb.amazonaws.com/wait/30"
instanceid="i-XXXXXXX"

getState () {
    aws elb describe-instance-health \
        --load-balancer-name $lbname \
        --instance $instanceid | jq '.InstanceStates[0].State' -r
}

register () {
    aws elb register-instances-with-load-balancer \
        --load-balancer-name $lbname \
        --instance $instanceid | jq .
}

deregister () {
    aws elb deregister-instances-from-load-balancer \
        --load-balancer-name $lbname \
        --instance $instanceid | jq .
}

waitUntil () {
    echo -n "Wait until state is $1"
    while [ "$(getState)" != "$1" ]; do
        echo -n "."
        sleep 1
    done
    echo
}

# Actual Dance
# Make sure instance is registered. Check latency until node is deregistered

if [ "$(getState)" == "OutOfService" ]; then
    register >> /dev/null
fi

waitUntil "InService"

curl $lburl &
sleep 1

deregister >> /dev/null

waitUntil "OutOfService"

A caveat that was not discussed in the existing answers is that ELBs also use DNS records with 60 second TTLs to balance load between multiple ELB nodes (each having one or more of your instances attached to it).

This means that if you have instances in two different availability zones, you probably have two IP addresses for your ELB with a 60s TTL on their A records. When you remove the final instances from such an availability zone, your clients "might" still use the old IP address for at least a minute - faulty DNS resolvers might behave much worse.

Another time ELBs wear multiple IPs and have the same problem, is when in a single availability zone you have a very large number of instances which is too much for one ELB server to handle. ELB in that case will also create another server and add its IP to the list of A records with a 60 second TTL.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow