Follow me on Twitter:

Automating and Monitoring Your AWS Costs

Posted: February 13th, 2013 | Author: | Filed under: DevOps | Tags: , , , | 4 Comments »

Regarding costs, there are two things I find most important when managing an infrastructure in AWS. The two sections below talk about automated mechanisms to monitoring your bill, and decide if you could be saving money somewhere.

Is the bill going to be a surprise?

You can login to the EC2 console and look at your estimated bill, but that requires memory, and time, and clicking around in a horrible interface.

I recommend two things to track your costs.

1) Set a cloudwatch alarm at your expected bill level.

If you expect to spend, say, $80K per month, then simply set a cloudwatch alarm at that level. You’ll get an email hopefully at the very end of the month (not the middle) saying you’ve reached $80K. This page shows you how: http://docs.aws.amazon.com/AmazonCloudWatch...

2) Graph your estimated charges.

You use graphite, right?

There is a cloudwatch CLI that you can use to fetch metrics from cloudwatch – including estimated billing, once you’ve enabled those metrics as described in the above link. Using the mon-get-stats is really annoying, but it works. Here is the shell script I use to grab billing metrics and shove them into graphite:

#!/bin/sh
export AWS_CLOUDWATCH_HOME=/home/charlie/cloudwatch/CloudWatch-1.0.13.4
export JAVA_HOME=/usr/lib/jvm/default-java

# Get the timestamp from 5 hours ago, to avoid getting > 1440 metrics (which errors).
# also, remove the +0000 from the timestamp, because the cloudwatch cli tries to enforce
# ISO 8601, but doesn't understand it.
DATE=$(date --iso-8601=hours -d "5 hours ago" |sed s/\+.*//)

#echo $COST

SERVICES='AmazonS3 ElasticMapReduce AmazonRDS AmazonDynamoDB AWSDataTransfer AmazonEC2 AWSQueueService'

for service in $SERVICES; do

COST=$(/home/charlie/cloudwatch/CloudWatch-1.0.13.4/bin/mon-get-stats EstimatedCharges --aws-credential-file ~/.ec2_credentials --namespace "AWS/Billing" --statistics Sum --dimensions "ServiceName=${service},Currency=USD" --start-time $DATE |tail -1 |awk '{print $3}')

if [ -z $COST ]; then
 echo "failed to retrieve $service metric from CloudWatch.."
 else
 echo "stats.prod.ops.billing.ec2_${service} $COST `date +%s`" |nc graphite.example.com 2023
 fi

done

# one more time, for the sum:
COST=$(/home/charlie/cloudwatch/CloudWatch-1.0.13.4/bin/mon-get-stats EstimatedCharges --aws-credential-file ~/.ec2_credentials --namespace "AWS/Billing" --statistics Sum --dimensions "Currency=USD" --start-time $DATE |tail -1 |awk '{print $3}')

if [ -z $COST ]; then
 echo "failed to retrieve EstimatedCharges metric from CloudWatch.."
 exit 1
else
 echo "stats.prod.ops.billing.ec2_total_estimated $COST `date +%s`" |nc graphite.example.com 2023
fi

You will have to install the Java-based cloudwatch CLI, and the ec2 credentials file has to be in a specific format – refer to the docs.

I run this from cron every 5 minutes, and the data is sent straight to graphite (not via statsd). Then, I can display the graph with something like:

http://graphite.example.com/render/?from=-7days&until=now&hideLegend=false&\
title=AWS%20estimated%20bill,%20monthly,%20in%20USD&lineWidth=3&lineMode=connected\
&target=legendValue(aliasByMetric(stats.prod.ops.billing.ec2_*)%2C%22last%22)\
&target=legendValue(alias(color(stats.prod.ops.billing.monthly_unused_RI_waste%2C%22red%22)\
%2C%22monthly%20cost%20of%20unused%20RIs%22)%2C%22last%22)

The final metric, stats.prod.ops.billing.monthly_unused_RI_waste, is pulled from rize.py (mentioned below), to show how much money is being spent on reserved instances that aren’t running (i.e. waste).

After all that, you can have a graph on the wall that shows your estimated monthly billing. Here is an example with totally made up dollar amounts:

Could I be saving money by doing something differently?

Likely, you can.

1) Spot instances

Spot instances are scary, especially if you aren’t using the ELB autoscaler. In the meantime, before your re-design and AMI-building project comes to fruition so you can use the autoscaler in a sane way, there are a few non-scary ways to use spot instances.

You can run one-off compute-heavy jobs (even via Elastic MapReduce!) using spot instances. If you bid just a few cents above the current price, your instances may get terminated and you may have to start over. This rarely happens, but it may. I recommend bidding all the way up to the on-demand price for the instances, if you don’t want them to disappear. People have run spot instances for 6+ months without having them terminated.

Another strategy is to provision a cluster of instances for your workload using reserved instances, up to the number of instances you think you need to provide the appropriate response-time (i.e. serve the traffic). You will be wrong about the maximum capacity of these instance, by the way. But that’s OK – next, you provision double capacity using spot instances. In this use-case, I’d bid 20% above the 30-day historical max market price. You’ll be paying a fraction of the on-demand cost for these instances – depending on the type and number, you can often double the capacity of your cluster (1/2 with spot instances) for just the cost of 1-2 on-demand instances. I know you’re thinking “when the market price shoots up and they all die, I don’t have enough capacity!” I recommend using the autoscaler.. but barring that, another strategy is to provision 20% of the extra nodes with a spot bid price as high as the on-demand rate. Chances are, they will never be terminated. Or better yet, provision 20% above your wrongly-estimated max capacity with on-demand or reserved instances, just to be safe 🙂

2) Reserved instances

You can save ~60% by reserving your on-demand EC2 instances for 1 year.

I wrote a python script to tell you how many instances you’re running, vs. how many are reserved, and it’s quite useful! Here:
https://github.com/manos/AWS-Reserved-Instances-Optimizer 

3) S3 lifecycle policies

S3 can be another huge cost in EC2. You have a few options for making sure your buckets don’t grow out of control.

Depending on your use case, you can:

  • Set an expiry policy, so e.g. all objects in a bucket are deleted after 1 year
  • Set a lifecycle policy, so e.g. all objects in a bucket are archived to Amazon Glacier after 60 days
I have a particular use case (log files – many TB of them) that uses both of the above techniques.
For other use cases, perhaps the best strategy is to simply monitor how much space each bucket is using, to avoid surprises (sorry, no script for this one, I don’t do it currently).

5) Dead instances still cost money

If you’re not monitoring individual instances with an external monitoring system, you may not notice when an instance stops responding. They are un-responsive, and not even in an ELB any longer – but they are costing you money! So run this from cron to email you if there are instances failing ec2 reachability checks – and it even lists the latest ec2 instance events for each dead instance, so you don’t have to muck around in the web console:
https://github.com/manos/ec2-find-dead-instances/
#!/usr/bin/env python
#
# print a list of 'failed' instances in ec2.
# first and only argument is the region to connect to (default: us-east-1)
# run from cron for each region to check.
#
import boto.ec2
import sys
import collections

regions = boto.ec2.regions()
names = [region.name for region in regions]

try:
    if len(sys.argv) > 1:
        region = regions[names.index(sys.argv[1])]
    else:
        region = regions[names.index('us-east-1')]
except ValueError:
    sys.stderr.write("Sorry, the region '%s' does not exist.\n" % sys.argv[1])
    sys.exit(1)  # proper return value for a script run as a command

ec2 = region.connect()
stats = ec2.get_all_instance_status(filters={"system-status.reachability": "failed"})

if len(stats) > 0:
    print "The following instances show 'failed' for the ec2 reachability check: "

for stat in stats:
    reservation = ec2.get_all_instances(filters={'instance-id': stat.id})
    dead_instance = reservation[0].instances[0]

    print dead_instance.tags.get('Name'), stat.id, stat.zone, stat.state_name

    if isinstance(stat.events, collections.Iterable):
        print "\tmost recent events: ", [(event.code, event.description) for event in stat.events]

I could go on forever – but this is getting rather long 🙂


4 Comments »