Follow me on Twitter:

Automating and Monitoring Your AWS Costs

Posted: February 13th, 2013 | Author: | Filed under: DevOps | Tags: , , , | 4 Comments »

Regarding costs, there are two things I find most important when managing an infrastructure in AWS. The two sections below talk about automated mechanisms to monitoring your bill, and decide if you could be saving money somewhere.

Is the bill going to be a surprise?

You can login to the EC2 console and look at your estimated bill, but that requires memory, and time, and clicking around in a horrible interface.

I recommend two things to track your costs.

1) Set a cloudwatch alarm at your expected bill level.

If you expect to spend, say, $80K per month, then simply set a cloudwatch alarm at that level. You’ll get an email hopefully at the very end of the month (not the middle) saying you’ve reached $80K. This page shows you how:¬†http://docs.aws.amazon.com/AmazonCloudWatch...

2) Graph your estimated charges.

You use graphite, right?

There is a cloudwatch CLI that you can use to fetch metrics from cloudwatch – including estimated billing, once you’ve enabled those metrics as described in the above link. Using the mon-get-stats is really annoying, but it works. Here is the shell script I use to grab billing metrics and shove them into graphite:

#!/bin/sh
export AWS_CLOUDWATCH_HOME=/home/charlie/cloudwatch/CloudWatch-1.0.13.4
export JAVA_HOME=/usr/lib/jvm/default-java

# Get the timestamp from 5 hours ago, to avoid getting > 1440 metrics (which errors).
# also, remove the +0000 from the timestamp, because the cloudwatch cli tries to enforce
# ISO 8601, but doesn't understand it.
DATE=$(date --iso-8601=hours -d "5 hours ago" |sed s/\+.*//)

#echo $COST

SERVICES='AmazonS3 ElasticMapReduce AmazonRDS AmazonDynamoDB AWSDataTransfer AmazonEC2 AWSQueueService'

for service in $SERVICES; do

COST=$(/home/charlie/cloudwatch/CloudWatch-1.0.13.4/bin/mon-get-stats EstimatedCharges --aws-credential-file ~/.ec2_credentials --namespace "AWS/Billing" --statistics Sum --dimensions "ServiceName=${service},Currency=USD" --start-time $DATE |tail -1 |awk '{print $3}')

if [ -z $COST ]; then
 echo "failed to retrieve $service metric from CloudWatch.."
 else
 echo "stats.prod.ops.billing.ec2_${service} $COST `date +%s`" |nc graphite.example.com 2023
 fi

done

# one more time, for the sum:
COST=$(/home/charlie/cloudwatch/CloudWatch-1.0.13.4/bin/mon-get-stats EstimatedCharges --aws-credential-file ~/.ec2_credentials --namespace "AWS/Billing" --statistics Sum --dimensions "Currency=USD" --start-time $DATE |tail -1 |awk '{print $3}')

if [ -z $COST ]; then
 echo "failed to retrieve EstimatedCharges metric from CloudWatch.."
 exit 1
else
 echo "stats.prod.ops.billing.ec2_total_estimated $COST `date +%s`" |nc graphite.example.com 2023
fi

You will have to install the Java-based cloudwatch CLI, and the ec2 credentials file has to be in a specific format – refer to the docs.

I run this from cron every 5 minutes, and the data is sent straight to graphite (not via statsd). Then, I can display the graph with something like:

http://graphite.example.com/render/?from=-7days&until=now&hideLegend=false&\
title=AWS%20estimated%20bill,%20monthly,%20in%20USD&lineWidth=3&lineMode=connected\
&target=legendValue(aliasByMetric(stats.prod.ops.billing.ec2_*)%2C%22last%22)\
&target=legendValue(alias(color(stats.prod.ops.billing.monthly_unused_RI_waste%2C%22red%22)\
%2C%22monthly%20cost%20of%20unused%20RIs%22)%2C%22last%22)

The final metric,¬†stats.prod.ops.billing.monthly_unused_RI_waste, is pulled from rize.py (mentioned below), to show how much money is being spent on reserved instances that aren’t running (i.e. waste).

After all that, you can have a graph on the wall that shows your estimated monthly billing. Here is an example with totally made up dollar amounts:

Could I be saving money by doing something differently?

Likely, you can.

1) Spot instances

Spot instances are scary, especially if you aren’t using the ELB autoscaler. In the meantime, before your re-design and AMI-building project comes to fruition so you can use the autoscaler in a sane way, there are a few non-scary ways to use spot instances.

You can run one-off compute-heavy jobs (even via Elastic MapReduce!) using spot instances. If you bid just a few cents above the current price, your instances may get terminated and you may have to start over. This rarely happens, but it may. I recommend bidding all the way up to the on-demand price for the instances, if you don’t want them to disappear. People have run spot instances for 6+ months without having them terminated.

Another strategy is to provision a cluster of instances for your workload using reserved instances, up to the number of instances you think you need to provide the appropriate response-time (i.e. serve the traffic). You will be wrong about the maximum capacity of these instance, by the way. But that’s OK – next, you provision double capacity using spot instances. In this use-case, I’d bid 20% above the 30-day historical max market price. You’ll be paying a fraction of the on-demand cost for these instances – depending on the type and number, you can often double the capacity of your cluster (1/2 with spot instances) for just the cost of 1-2 on-demand instances. I know you’re thinking “when the market price shoots up and they all die, I don’t have enough capacity!” I recommend using the autoscaler.. but barring that, another strategy is to provision 20% of the extra nodes with a spot bid price as high as the on-demand rate. Chances are, they will never be terminated. Or better yet, provision 20% above your wrongly-estimated max capacity with on-demand or reserved instances, just to be safe :)

2) Reserved instances

You can save ~60% by reserving your on-demand EC2 instances for 1 year.

I wrote a python script to tell you how many instances you’re running, vs. how many are reserved, and it’s quite useful! Here:
https://github.com/manos/AWS-Reserved-Instances-Optimizer 

3) S3 lifecycle policies

S3 can be another huge cost in EC2. You have a few options for making sure your buckets don’t grow out of control.

Depending on your use case, you can:

  • Set an expiry policy, so e.g. all objects in a bucket are deleted after 1 year
  • Set a lifecycle policy, so e.g. all objects in a bucket are archived to Amazon Glacier after 60 days
I have a particular use case (log files – many TB of them) that uses both of the above techniques.
For other use cases, perhaps the best strategy is to simply monitor how much space each bucket is using, to avoid surprises (sorry, no script for this one, I don’t do it currently).

5) Dead instances still cost money

If you’re not monitoring individual instances with an external monitoring system, you may not notice when an instance stops responding. They are un-responsive, and not even in an ELB any longer – but they are costing you money! So run this from cron to email you if there are instances failing ec2 reachability checks – and it even lists the latest ec2 instance events for each dead instance, so you don’t have to muck around in the web console:
https://github.com/manos/ec2-find-dead-instances/
#!/usr/bin/env python
#
# print a list of 'failed' instances in ec2.
# first and only argument is the region to connect to (default: us-east-1)
# run from cron for each region to check.
#
import boto.ec2
import sys
import collections

regions = boto.ec2.regions()
names = [region.name for region in regions]

try:
    if len(sys.argv) > 1:
        region = regions[names.index(sys.argv[1])]
    else:
        region = regions[names.index('us-east-1')]
except ValueError:
    sys.stderr.write("Sorry, the region '%s' does not exist.\n" % sys.argv[1])
    sys.exit(1)  # proper return value for a script run as a command

ec2 = region.connect()
stats = ec2.get_all_instance_status(filters={"system-status.reachability": "failed"})

if len(stats) > 0:
    print "The following instances show 'failed' for the ec2 reachability check: "

for stat in stats:
    reservation = ec2.get_all_instances(filters={'instance-id': stat.id})
    dead_instance = reservation[0].instances[0]

    print dead_instance.tags.get('Name'), stat.id, stat.zone, stat.state_name

    if isinstance(stat.events, collections.Iterable):
        print "\tmost recent events: ", [(event.code, event.description) for event in stat.events]

I could go on forever – but this is getting rather long :)


4 Comments »

4 Comments on “Automating and Monitoring Your AWS Costs”

  1. 1 Hassan Hosseini said at 13:34 on February 13th, 2013:

    Hey Charlie,

    Enjoyed the blog. I think maybe one step before this would be to project costs going forward even before creating a cloud account. We released a free simulation to help with that, it’s called PlanForCloud. Would love to get your thoughts on it :)

    Cheers,
    Hassan

  2. 2 charlie said at 18:51 on February 13th, 2013:

    Hassan,

    Interesting. It took a few hours to import my account…

    I’m trying to update the sample deployment to all Reserved Instances, but keep running into issues – the values put there during the import, for storage amounts were in decimal, but when updating the deployment, it won’t take the decimal values. Likewise, it won’t let me leave the usage for a stopped instance as 0. I finally got it to work, though :)

    Anyway, this seems pretty useful for giving people an idea of what costs will be – not only starting out, but if they decide to grow, e.g. to support some new infrastructure. That said, if the newly proposed usage was similar to other things I already run, I could arrive at the cost easily by multiplying :)

  3. 3 charlie said at 18:55 on February 13th, 2013:

    Another thing.

    There seems to be no less than 10 “cloud cost” startups currently. I’ve not seen any of them attempt to provide a mechanism by which I can group resources together, to provide costs per service. Being able to allocate costs seems extremely valuable for any company.

    For example, I’d like to be able to define a service “webapp1″ uses the following things:
    EC2 instances in X security group, or with tag Y
    RDS instances in X security group
    DynamoDB instance X
    S3 buckets XYZ
    An ELB named Q
    etc, etc,etc.

    And this magical tool could tell me how much webapp1 costs to run.

    Now *that* I would pay for.

  4. 4 Aaron Kaffen said at 17:05 on June 6th, 2013:

    @Charlie The use case of viewing costs for “webapp1″ is actually possible with Cloudability (disclosure: I work there).

    You would simply use AWS tags to define the AWS resources associated with that application. Something like app=webapp1

    Then you could use Cloudability’s AWS Cost Analytics to create a report showing daily spending filtered on that tag.

    You’d have a report showing the costs for all of the services used to run Webapp1.

    It’s a fairly new feature, and it’d be great to have you try it out and give us some feedback. LMK if you’re interested.

    [WORDPRESS HASHCASH] The poster sent us ’1323716522 which is not a hashcash value.


Leave a Reply

  •