How can I prevent EC2 instance termination by Auto Scaling?

Question 1

Update

As noted by Ryan Walls (+1), AWS meanwhile provides Instance Protection to control whether Auto Scaling can terminate a particular instance when scaling in (see the introductory blog post Instance Protection for Auto Scaling for a walk through):

You can enable the instance protection setting on an Auto Scaling group or an individual Auto Scaling instance. When Auto Scaling launches an instance, the instance inherits the instance protection setting of the Auto Scaling group. [...]

It's worth noting that this instance protection only applies to regular Auto Scaling scale in events:

Instance protection does not protect Auto Scaling instances from manual termination through the Amazon EC2 console, the terminate-instances command, or the TerminateInstances API. Instance protection does not protect an Auto Scaling instance from termination if it fails health checks and must be replaced. Also, instance protection does not protect Spot instances in an Auto Scaling group from interruption.

As usual, the feature is available via the AWS Management Console (menu Actions->Instance Protection->Set Scale In Protection)), the AWS CLI (set-instance-protection command), and the API (SetInstanceProtection API action).

The latter two options allow automation of the scenario at hand, i.e. one would need to enable instance protection before running 'heavy processing' jobs, and disable instance protection once they are finished so that the instance is eligible for termination again.

Initial Answer

This functionality is currently not available for Auto Scaling of Amazon EC2 instances - while you are indeed able to Configure [an] Instance Termination Policy for Your Auto Scaling Group, the available policies do not include such a (fairly advanced) concept:

Auto Scaling provides the following termination policy options for you to choose from. You can specify one or more of these options in your termination policy.

OldestInstance — Specify this if you want the oldest instance in your Auto Scaling group to be terminated. [...]

NewestInstance — Specify this if you want the last launched instance to be terminated. [...]

OldestLaunchConfiguration — Specify this if you want the instance launched using the oldest launch configuration to be terminated. [...]

ClosestToNextInstanceHour — Specify this if you want the instance that is closest to completing the billing hour to be terminated. [...]

Default — Specify this if you want Auto Scaling to use the default termination policy to select instances for termination.

Question 2

I just successfully dealt with the problem of long-running jobs in an auto scaling group using the relatively recent lifecycle hook feature.

The problem with trying to choose an idle node to terminate, in my case, was that the process that chooses the idle node will race against processes that submit work to the nodes. In this case it's better to use a strategy where any node can be terminated, but termination happens gracefully so that no work is lost. You can then use all of the standard auto scaling policy stuff to manage scale-in and scale-out.

The termination lifecycle hook allows the user (or a process) to perform actions on the node after it has been placed into an intermediate state (labeled Terminating:Wait) by the auto scaling group. The user (or process) is then responsible for completing the lifecycle action via an AWS API call, resulting in the shutdown of the terminated EC2 instance.

The way I set this up, in short, is:

Create a role that allows auto scaling to post a message to an SQS queue.
Create an SQS queue for the termination messages.
Create a monitor script that runs as a service in each node. My script is a simple event-driven state machine that transitions in sequence from MONITORING (polling SQS for a termination message for the node) to DRAINING (polling a job queue until no work is being performed on the node) to TERMINATED (making the complete-lifecycle call).
Standard configuration for event-driven AWS auto-scaling; that is, creating CloudWatch alarms, and the auto-scaling policies for scale-in and scale-out.

~~One hinderance to this approach is that the lifecycle hook management isn't supported yet in the SDKs (boto, at least, doesn't support it AFAIK), nor are there Cloud Formation resources for the hooks.~~

The relevant AWS documentation is here:

http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/AutoScalingGroupLifecycle.html

Question 3

Amazon has finally addressed this issue in a simpler way. There is now "instance protection" where you can mark your instance as protected and it will not be terminated during a "scale in".

See https://aws.amazon.com/blogs/aws/new-instance-protection-for-auto-scaling

Question 4

aws-cli is your best friend..

Disable your scale down policy on your autoscaling group.
Create a cron job or scheduled task using aws-cli to:

2a. Get the EC2 instances associated with the autoscaling group http://docs.aws.amazon.com/cli/latest/reference/autoscaling/describe-auto-scaling-instances.html

2b. Next monitor the cloudwatch statistics on the EC2 instances http://docs.aws.amazon.com/AmazonCloudWatch/latest/DeveloperGuide/US_SingleMetricPerInstance.html http://docs.aws.amazon.com/cli/latest/reference/cloudwatch/get-metric-statistics.html

2c. Terminate the idle EC2 instance(s) from your auto-scaling group http://docs.aws.amazon.com/cli/latest/reference/autoscaling/terminate-instance-in-auto-scaling-group.html

Question 5

You can use Amazon CloudWatch to achieve this: http://aws.typepad.com/aws/2013/01/amazon-cloudwatch-alarm-actions.html. From the article:

You can use a similar strategy to get rid of instances that are tasked with handling compute-intensive batch processes. Once the CPU goes idle and the work is done, terminate the instance and save some money!

In this case, since you will be handling the termination, you will need to remove the scale-down policy. Also see another option: https://stackoverflow.com/a/19628453/432849.