Question

At unpredictable times (user request) I need to run a memory-intensive job. For this I get a spot or on-demand instance and mark it with a tag as non_idle. When the job is done (which may take hours), I give it the tag idle. Due to the hourly billing model of AWS, I want to keep that instance alive until another billable hour is incurred in case another job comes in. If a job comes in, the instance should be reused and marked it as non_idle. If no job comes in during that time, the instance should terminate.

Does AWS offer a ready solution for this? As far as I know, CloudWatch can't set alarms that should run at a specific time, never mind using the CPUUtilization or the instance's tags. Otherwise, perhaps I could simply set up for every created instance a java timer or scala actor that runs every hour after the instance is created and check for the tag idle.

Was it helpful?

Solution

There is no readily available AWS solution for this fine grained optimization, but you can use the existing building blocks to build you own based on the launch time of the current instance indeed (see Dmitriy Samovskiy's smart solution for deducing How Long Ago Was This EC2 Instance Started?).

Playing 'Chicken'

Shlomo Swidler has explored this optimization in his article Play “Chicken” with Spot Instances, albeit with a slightly different motivation in the context of Amazon EC2 Spot Instances:

AWS Spot Instances have an interesting economic characteristic that make it possible to game the system a little. Like all EC2 instances, when you initiate termination of a Spot Instance then you incur a charge for the entire hour, even if you’ve used less than a full hour. But, when AWS terminates the instance due to the spot price exceeding the bid price, you do not pay for the current hour.

The mechanics are the same of course, so you might be able to simply reuse the script he assembled, i.e. execute this script instead of or in addition to tagging the instance as idle:

#! /bin/bash
t=/tmp/ec2.running.seconds.$$
if wget -q -O $t http://169.254.169.254/latest/meta-data/local-ipv4 ; then
    # add 60 seconds artificially as a safety margin
    let runningSecs=$(( `date +%s` - `date -r $t +%s` ))+60
    rm -f $t
    let runningSecsThisHour=$runningSecs%3600
    let runningMinsThisHour=$runningSecsThisHour/60
    let leftMins=60-$runningMinsThisHour
    # start shutdown one minute earlier than actually required
    let shutdownDelayMins=$leftMins-1
    if [[ $shutdownDelayMins > 1 && $shutdownDelayMins < 60 ]]; then
        echo "Shutting down in $shutdownDelayMins mins."
        # TODO: Notify off-instance listener that the game of chicken has begun
        sudo shutdown -h +$shutdownDelayMins
    else
        echo "Shutting down now."
        sudo shutdown -h now
    fi
    exit 0
fi
echo "Failed to determine remaining minutes in this billable hour. Terminating now."
sudo shutdown -h now
exit 1

Once a job comes in you could then cancel the scheduled termination instead of or in addition to tagging the instance with non_idle as follows:

sudo shutdown -c

This is also the the 'red button' emergency command during testing/operation, see e.g. Shlomo's warning:

Make sure you really understand what this script does before you use it. If you mistakenly schedule an instance to be shut down you can cancel it with this command, run on the instance: sudo shutdown -c

Adding CloudWatch to the game

You could take Shlomo's self contained approach even further by integrating with Amazon CloudWatch, which recently added an option to Use Amazon CloudWatch to Detect and Shut Down Unused Amazon EC2 Instances, see the introductory blog post Amazon CloudWatch - Alarm Actions for details:

Today we are giving you the ability to stop or terminate your EC2 instances when a CloudWatch alarm is triggered. You can use this as a failsafe (detect an abnormal condition and then act) or as part of your application's processing logic (await an expected condition and then act). [emphasis mine]

Your use case is listed in section Application Integration specifically:

You can also create CloudWatch alarms based on Custom Metrics that you observe on an instance-by-instance basis. You could, for example, measure calls to your own web service APIs, page requests, or message postings per minute, and respond as desired.

So you could leverage this new functionality by Publishing Custom Metrics to CloudWatch to indicate whether an instance should terminate (is idle) based on and Dmitriy's launch time detection and reset the metric again once a job comes in and an instance should keep running (is non_idle) - like so EC2 would take care of the termination, 2 out of 3 automation steps would have been moved from the instance into the operations environment and management and visibility of the automation process improved accordingly.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top