Follow me on Twitter:

Automating and Monitoring Your AWS Costs

Posted: February 13th, 2013 | Author: | Filed under: DevOps | Tags: , , , | 4 Comments »

Regarding costs, there are two things I find most important when managing an infrastructure in AWS. The two sections below talk about automated mechanisms to monitoring your bill, and decide if you could be saving money somewhere.

Is the bill going to be a surprise?

You can login to the EC2 console and look at your estimated bill, but that requires memory, and time, and clicking around in a horrible interface.

I recommend two things to track your costs.

1) Set a cloudwatch alarm at your expected bill level.

If you expect to spend, say, $80K per month, then simply set a cloudwatch alarm at that level. You’ll get an email hopefully at the very end of the month (not the middle) saying you’ve reached $80K. This page shows you how: http://docs.aws.amazon.com/AmazonCloudWatch...

2) Graph your estimated charges.

You use graphite, right?

There is a cloudwatch CLI that you can use to fetch metrics from cloudwatch – including estimated billing, once you’ve enabled those metrics as described in the above link. Using the mon-get-stats is really annoying, but it works. Here is the shell script I use to grab billing metrics and shove them into graphite:

#!/bin/sh
export AWS_CLOUDWATCH_HOME=/home/charlie/cloudwatch/CloudWatch-1.0.13.4
export JAVA_HOME=/usr/lib/jvm/default-java

# Get the timestamp from 5 hours ago, to avoid getting > 1440 metrics (which errors).
# also, remove the +0000 from the timestamp, because the cloudwatch cli tries to enforce
# ISO 8601, but doesn't understand it.
DATE=$(date --iso-8601=hours -d "5 hours ago" |sed s/\+.*//)

#echo $COST

SERVICES='AmazonS3 ElasticMapReduce AmazonRDS AmazonDynamoDB AWSDataTransfer AmazonEC2 AWSQueueService'

for service in $SERVICES; do

COST=$(/home/charlie/cloudwatch/CloudWatch-1.0.13.4/bin/mon-get-stats EstimatedCharges --aws-credential-file ~/.ec2_credentials --namespace "AWS/Billing" --statistics Sum --dimensions "ServiceName=${service},Currency=USD" --start-time $DATE |tail -1 |awk '{print $3}')

if [ -z $COST ]; then
 echo "failed to retrieve $service metric from CloudWatch.."
 else
 echo "stats.prod.ops.billing.ec2_${service} $COST `date +%s`" |nc graphite.example.com 2023
 fi

done

# one more time, for the sum:
COST=$(/home/charlie/cloudwatch/CloudWatch-1.0.13.4/bin/mon-get-stats EstimatedCharges --aws-credential-file ~/.ec2_credentials --namespace "AWS/Billing" --statistics Sum --dimensions "Currency=USD" --start-time $DATE |tail -1 |awk '{print $3}')

if [ -z $COST ]; then
 echo "failed to retrieve EstimatedCharges metric from CloudWatch.."
 exit 1
else
 echo "stats.prod.ops.billing.ec2_total_estimated $COST `date +%s`" |nc graphite.example.com 2023
fi

You will have to install the Java-based cloudwatch CLI, and the ec2 credentials file has to be in a specific format – refer to the docs.

I run this from cron every 5 minutes, and the data is sent straight to graphite (not via statsd). Then, I can display the graph with something like:

http://graphite.example.com/render/?from=-7days&until=now&hideLegend=false&\
title=AWS%20estimated%20bill,%20monthly,%20in%20USD&lineWidth=3&lineMode=connected\
&target=legendValue(aliasByMetric(stats.prod.ops.billing.ec2_*)%2C%22last%22)\
&target=legendValue(alias(color(stats.prod.ops.billing.monthly_unused_RI_waste%2C%22red%22)\
%2C%22monthly%20cost%20of%20unused%20RIs%22)%2C%22last%22)

The final metric, stats.prod.ops.billing.monthly_unused_RI_waste, is pulled from rize.py (mentioned below), to show how much money is being spent on reserved instances that aren’t running (i.e. waste).

After all that, you can have a graph on the wall that shows your estimated monthly billing. Here is an example with totally made up dollar amounts:

Could I be saving money by doing something differently?

Likely, you can.

1) Spot instances

Spot instances are scary, especially if you aren’t using the ELB autoscaler. In the meantime, before your re-design and AMI-building project comes to fruition so you can use the autoscaler in a sane way, there are a few non-scary ways to use spot instances.

You can run one-off compute-heavy jobs (even via Elastic MapReduce!) using spot instances. If you bid just a few cents above the current price, your instances may get terminated and you may have to start over. This rarely happens, but it may. I recommend bidding all the way up to the on-demand price for the instances, if you don’t want them to disappear. People have run spot instances for 6+ months without having them terminated.

Another strategy is to provision a cluster of instances for your workload using reserved instances, up to the number of instances you think you need to provide the appropriate response-time (i.e. serve the traffic). You will be wrong about the maximum capacity of these instance, by the way. But that’s OK – next, you provision double capacity using spot instances. In this use-case, I’d bid 20% above the 30-day historical max market price. You’ll be paying a fraction of the on-demand cost for these instances – depending on the type and number, you can often double the capacity of your cluster (1/2 with spot instances) for just the cost of 1-2 on-demand instances. I know you’re thinking “when the market price shoots up and they all die, I don’t have enough capacity!” I recommend using the autoscaler.. but barring that, another strategy is to provision 20% of the extra nodes with a spot bid price as high as the on-demand rate. Chances are, they will never be terminated. Or better yet, provision 20% above your wrongly-estimated max capacity with on-demand or reserved instances, just to be safe 🙂

2) Reserved instances

You can save ~60% by reserving your on-demand EC2 instances for 1 year.

I wrote a python script to tell you how many instances you’re running, vs. how many are reserved, and it’s quite useful! Here:
https://github.com/manos/AWS-Reserved-Instances-Optimizer 

3) S3 lifecycle policies

S3 can be another huge cost in EC2. You have a few options for making sure your buckets don’t grow out of control.

Depending on your use case, you can:

  • Set an expiry policy, so e.g. all objects in a bucket are deleted after 1 year
  • Set a lifecycle policy, so e.g. all objects in a bucket are archived to Amazon Glacier after 60 days
I have a particular use case (log files – many TB of them) that uses both of the above techniques.
For other use cases, perhaps the best strategy is to simply monitor how much space each bucket is using, to avoid surprises (sorry, no script for this one, I don’t do it currently).

5) Dead instances still cost money

If you’re not monitoring individual instances with an external monitoring system, you may not notice when an instance stops responding. They are un-responsive, and not even in an ELB any longer – but they are costing you money! So run this from cron to email you if there are instances failing ec2 reachability checks – and it even lists the latest ec2 instance events for each dead instance, so you don’t have to muck around in the web console:
https://github.com/manos/ec2-find-dead-instances/
#!/usr/bin/env python
#
# print a list of 'failed' instances in ec2.
# first and only argument is the region to connect to (default: us-east-1)
# run from cron for each region to check.
#
import boto.ec2
import sys
import collections

regions = boto.ec2.regions()
names = [region.name for region in regions]

try:
    if len(sys.argv) > 1:
        region = regions[names.index(sys.argv[1])]
    else:
        region = regions[names.index('us-east-1')]
except ValueError:
    sys.stderr.write("Sorry, the region '%s' does not exist.\n" % sys.argv[1])
    sys.exit(1)  # proper return value for a script run as a command

ec2 = region.connect()
stats = ec2.get_all_instance_status(filters={"system-status.reachability": "failed"})

if len(stats) > 0:
    print "The following instances show 'failed' for the ec2 reachability check: "

for stat in stats:
    reservation = ec2.get_all_instances(filters={'instance-id': stat.id})
    dead_instance = reservation[0].instances[0]

    print dead_instance.tags.get('Name'), stat.id, stat.zone, stat.state_name

    if isinstance(stat.events, collections.Iterable):
        print "\tmost recent events: ", [(event.code, event.description) for event in stat.events]

I could go on forever – but this is getting rather long 🙂


4 Comments »

ELBs with stickiness can get you into.. sticky situations

Posted: July 22nd, 2012 | Author: | Filed under: DevOps | Tags: , | 1 Comment »

AWS strongly recommends against using the stickiness features in their Elastic Load Balancer. Indeed, it’s nice to evenly distribute all requests and not worry about another layer of logic around “why a request went here.”

Unfortunately, in the real world, some applications really want stickiness. For example, at work we run a geoip service, which returns the location of an IP address. All 20+ nodes running this service have a local copy of the database, to avoid having an off-node dependency on the data. Lookups via the tornado-based service are not fast, however. So we toss varnish in front to cache 4+ GB worth of responses. If clients were hopping around between all nodes in an AZ, cache hit rates would suck. Point being, there are times that you want stickiness, and it doesn’t necessarily mean you have poor application design (so stop saying that, Amazon).

Now, when you point your CNAME at a region’s ELB CNAME, you’ll find that it returns 5 or more IP addresses. Some IPs are associated with individual AZs where you have live instances. There is no load balancing to determine which AZ gets which traffic; instead that’s handled by DNS (and to some extent, ELBs forwarding traffic between themselves). When a client hits ELB(s) in an individual AZ, requests are balanced across all instances in that AZ. This means that you can will observe all your repeated requests end up in a single AZ, but rotating between all instances. Unless you enable stickiness, of course.

So you enable stickiness because you want requests from the same client to end up at the same node (for caching, in this example).

Then the inevitable happens — a few of your nodes in the same AZ have hardware failures. You notice the load getting way too high in the remaining 2 instances, and decide that you’d really prefer to balance out that traffic between three other AZs. Like any logical person, you remove the instance in the AZ from the ELB, expecting the ELB(s) to notice there are no longer any instances in that AZ, and stop sending traffic (there’s nothing to receive it!).

This is when you start hearing about failed requests.

If you only remove instances from an ELB, and don’t remove the AZ itself from the ELB, the cluster of ELBs will still try very hard to forward traffic toward the AZ associated with the shared sticky state they know about. Your traffic is forwarded to, and hits an LB with no instances, and dies.

The moral of the story is: if you use stickiness, and want to disable a zone, you must remove the ZONE from the ELB, not just all instances.

That was an annoying lesson learned 😉

Quick side-note: if you use ELB-generated cookies for stickiness, the ELBs will remove your Cache-Control headers!

It’s best to configure varnish/apache/etc set a Served-By cookie, and tell the ELB to use that. That also makes it really easy to tell which node served a request, if you’re debugging via ‘curl -v’ or similar.


1 Comment »