Follow me on Twitter:

Turning SPOT GPS tracks into a google map

Posted: January 13th, 2014 | Author: | Filed under: Tricks & Tips | No Comments »

I’ve talked about this before, but I finally spent a little time making this better, and more generally useful for others.

This script will fetch GPS tracks from the SPOT API, store them forever, and generate a Google map plotting all points.

Here it is:

https://github.com/manos/SPOT-GPS-Tracker-Current-Location-on-Your-Website

And here is what it generates:


No Comments yet... be the first »

Automating and Monitoring Your AWS Costs

Posted: February 13th, 2013 | Author: | Filed under: DevOps | Tags: , , , | 4 Comments »

Regarding costs, there are two things I find most important when managing an infrastructure in AWS. The two sections below talk about automated mechanisms to monitoring your bill, and decide if you could be saving money somewhere.

Is the bill going to be a surprise?

You can login to the EC2 console and look at your estimated bill, but that requires memory, and time, and clicking around in a horrible interface.

I recommend two things to track your costs.

1) Set a cloudwatch alarm at your expected bill level.

If you expect to spend, say, $80K per month, then simply set a cloudwatch alarm at that level. You’ll get an email hopefully at the very end of the month (not the middle) saying you’ve reached $80K. This page shows you how: http://docs.aws.amazon.com/AmazonCloudWatch...

2) Graph your estimated charges.

You use graphite, right?

There is a cloudwatch CLI that you can use to fetch metrics from cloudwatch – including estimated billing, once you’ve enabled those metrics as described in the above link. Using the mon-get-stats is really annoying, but it works. Here is the shell script I use to grab billing metrics and shove them into graphite:

#!/bin/sh
export AWS_CLOUDWATCH_HOME=/home/charlie/cloudwatch/CloudWatch-1.0.13.4
export JAVA_HOME=/usr/lib/jvm/default-java

# Get the timestamp from 5 hours ago, to avoid getting > 1440 metrics (which errors).
# also, remove the +0000 from the timestamp, because the cloudwatch cli tries to enforce
# ISO 8601, but doesn't understand it.
DATE=$(date --iso-8601=hours -d "5 hours ago" |sed s/\+.*//)

#echo $COST

SERVICES='AmazonS3 ElasticMapReduce AmazonRDS AmazonDynamoDB AWSDataTransfer AmazonEC2 AWSQueueService'

for service in $SERVICES; do

COST=$(/home/charlie/cloudwatch/CloudWatch-1.0.13.4/bin/mon-get-stats EstimatedCharges --aws-credential-file ~/.ec2_credentials --namespace "AWS/Billing" --statistics Sum --dimensions "ServiceName=${service},Currency=USD" --start-time $DATE |tail -1 |awk '{print $3}')

if [ -z $COST ]; then
 echo "failed to retrieve $service metric from CloudWatch.."
 else
 echo "stats.prod.ops.billing.ec2_${service} $COST `date +%s`" |nc graphite.example.com 2023
 fi

done

# one more time, for the sum:
COST=$(/home/charlie/cloudwatch/CloudWatch-1.0.13.4/bin/mon-get-stats EstimatedCharges --aws-credential-file ~/.ec2_credentials --namespace "AWS/Billing" --statistics Sum --dimensions "Currency=USD" --start-time $DATE |tail -1 |awk '{print $3}')

if [ -z $COST ]; then
 echo "failed to retrieve EstimatedCharges metric from CloudWatch.."
 exit 1
else
 echo "stats.prod.ops.billing.ec2_total_estimated $COST `date +%s`" |nc graphite.example.com 2023
fi

You will have to install the Java-based cloudwatch CLI, and the ec2 credentials file has to be in a specific format – refer to the docs.

I run this from cron every 5 minutes, and the data is sent straight to graphite (not via statsd). Then, I can display the graph with something like:

http://graphite.example.com/render/?from=-7days&until=now&hideLegend=false&\
title=AWS%20estimated%20bill,%20monthly,%20in%20USD&lineWidth=3&lineMode=connected\
&target=legendValue(aliasByMetric(stats.prod.ops.billing.ec2_*)%2C%22last%22)\
&target=legendValue(alias(color(stats.prod.ops.billing.monthly_unused_RI_waste%2C%22red%22)\
%2C%22monthly%20cost%20of%20unused%20RIs%22)%2C%22last%22)

The final metric, stats.prod.ops.billing.monthly_unused_RI_waste, is pulled from rize.py (mentioned below), to show how much money is being spent on reserved instances that aren’t running (i.e. waste).

After all that, you can have a graph on the wall that shows your estimated monthly billing. Here is an example with totally made up dollar amounts:

Could I be saving money by doing something differently?

Likely, you can.

1) Spot instances

Spot instances are scary, especially if you aren’t using the ELB autoscaler. In the meantime, before your re-design and AMI-building project comes to fruition so you can use the autoscaler in a sane way, there are a few non-scary ways to use spot instances.

You can run one-off compute-heavy jobs (even via Elastic MapReduce!) using spot instances. If you bid just a few cents above the current price, your instances may get terminated and you may have to start over. This rarely happens, but it may. I recommend bidding all the way up to the on-demand price for the instances, if you don’t want them to disappear. People have run spot instances for 6+ months without having them terminated.

Another strategy is to provision a cluster of instances for your workload using reserved instances, up to the number of instances you think you need to provide the appropriate response-time (i.e. serve the traffic). You will be wrong about the maximum capacity of these instance, by the way. But that’s OK – next, you provision double capacity using spot instances. In this use-case, I’d bid 20% above the 30-day historical max market price. You’ll be paying a fraction of the on-demand cost for these instances – depending on the type and number, you can often double the capacity of your cluster (1/2 with spot instances) for just the cost of 1-2 on-demand instances. I know you’re thinking “when the market price shoots up and they all die, I don’t have enough capacity!” I recommend using the autoscaler.. but barring that, another strategy is to provision 20% of the extra nodes with a spot bid price as high as the on-demand rate. Chances are, they will never be terminated. Or better yet, provision 20% above your wrongly-estimated max capacity with on-demand or reserved instances, just to be safe :)

2) Reserved instances

You can save ~60% by reserving your on-demand EC2 instances for 1 year.

I wrote a python script to tell you how many instances you’re running, vs. how many are reserved, and it’s quite useful! Here:
https://github.com/manos/AWS-Reserved-Instances-Optimizer 

3) S3 lifecycle policies

S3 can be another huge cost in EC2. You have a few options for making sure your buckets don’t grow out of control.

Depending on your use case, you can:

  • Set an expiry policy, so e.g. all objects in a bucket are deleted after 1 year
  • Set a lifecycle policy, so e.g. all objects in a bucket are archived to Amazon Glacier after 60 days
I have a particular use case (log files – many TB of them) that uses both of the above techniques.
For other use cases, perhaps the best strategy is to simply monitor how much space each bucket is using, to avoid surprises (sorry, no script for this one, I don’t do it currently).

5) Dead instances still cost money

If you’re not monitoring individual instances with an external monitoring system, you may not notice when an instance stops responding. They are un-responsive, and not even in an ELB any longer – but they are costing you money! So run this from cron to email you if there are instances failing ec2 reachability checks – and it even lists the latest ec2 instance events for each dead instance, so you don’t have to muck around in the web console:
https://github.com/manos/ec2-find-dead-instances/
#!/usr/bin/env python
#
# print a list of 'failed' instances in ec2.
# first and only argument is the region to connect to (default: us-east-1)
# run from cron for each region to check.
#
import boto.ec2
import sys
import collections

regions = boto.ec2.regions()
names = [region.name for region in regions]

try:
    if len(sys.argv) > 1:
        region = regions[names.index(sys.argv[1])]
    else:
        region = regions[names.index('us-east-1')]
except ValueError:
    sys.stderr.write("Sorry, the region '%s' does not exist.\n" % sys.argv[1])
    sys.exit(1)  # proper return value for a script run as a command

ec2 = region.connect()
stats = ec2.get_all_instance_status(filters={"system-status.reachability": "failed"})

if len(stats) > 0:
    print "The following instances show 'failed' for the ec2 reachability check: "

for stat in stats:
    reservation = ec2.get_all_instances(filters={'instance-id': stat.id})
    dead_instance = reservation[0].instances[0]

    print dead_instance.tags.get('Name'), stat.id, stat.zone, stat.state_name

    if isinstance(stat.events, collections.Iterable):
        print "\tmost recent events: ", [(event.code, event.description) for event in stat.events]

I could go on forever – but this is getting rather long :)


4 Comments »

Starting out: a new approach to systems monitoring.

Posted: October 2nd, 2012 | Author: | Filed under: DevOps | Tags: , , | 2 Comments »

OK, not new to some. Circonus does it this way, and so do some very large sites like Netflix.

But new to me, and certainly new to anyone currently using nagios/zenoss/zabbix/etc. Here’s the story:

The Idea

Metrics

At work (Krux), we have graphite and tons of graphs on the wall. We can see application-level response times in the same view as cache hit/miss rates and requests per second. That’s nice. It’s also not very proactive.

Monitoring

We also have cloudkick (think: nagios with an API). We have tons of plugins checking thresholds, running locally on each box. We recently re-evaluated our monitoring solution, and ultimately decided to write our own loosely coupled monitoring infrastructure using a variety of awesome tools. We migrated from cloudkick to collectd with a bunch of plugins we wrote, using a custom python library I wrote, called monitorlib (collectd and pagerduty parts). The functionality is basically the same: run scripts on each node every 60 seconds, check if some threshold is met, and alert directly to pagerduty. meh.

Combining

What I really want is a decision engine.

I want applications to push events, when they know something a poll-based monitoring script doesn’t.
I want to suppress threshold-based alerts, based on a set of rules, and only alert some people.
I want to check the load balancer to see how many nodes are healthy, before alerting that a single node went down.
I want to check response-time graphs in graphite, by polling the holt-winters confidence bands, and then alert based on configured rules.

Basically, we are in a world where we have great graphs, and old-school threshold-based alerts. I want to alert on graphs, but also much more – I want to combine multiple bits of information before paging someone at 2am.

How to get there

Going to the next level requires processing events, accepting event data from multiple sources, and configuring rules.

This blog post has some good ideas http://www.control-alt-del.org/2012/03/28/collectd-esper-amqp-opentsdbgraphite-oh-my/ and it outlines a few options.

Basically, I want *something* to sit and listen for events. I want all my collectd scripts to send data via HTTP POST (JSON), or protobufs, along with the status (ok, warn, error) every minute. Then, the *thing* that’s receiving these events, will decide – based on state it knows or gathers by polling graphite/load balancers/etc – whether to alert, update a status board, neither, or both.

Building that *thing* is the hard part. There are Complex Event Processing (CEP) frameworks available, most notably Esper, written in Java. Using Esper requires writing a lot of Java. There is a google open source thing, which seems like a bundle of code published but not maintained, called rocksteady. Using rocksteady may help the “ugh, don’t want to Java” aspect.

Then there is Riemann - this is what I’m starting with first. After learning a bit of Clojure, it should provide immediate benefit. And it’s actively developed and the author is very responsive. We’ll see how it goes!

Final notes

I think what I’m trying to do is a bit different than most.

I don’t want to send all my data (graphite metrics – we do around 150K metrics/sec to our graphite cluster) through this decision engine. I want it to get *events* which would historically have been something to page or email about. Then, it needs to make decisions: check graphs as another source of data; check load balancers; re-check to make sure it’s still a problem; maybe even spin up new EC2 instances. I may also want to poll graphite periodically to check various things, perhaps with graphite-tattle.

At this point, I don’t know what else it can/should do. The first step is to send all alerts to the decision engine, and define rules. It shall grow from there :)

 


2 Comments »

ELBs with stickiness can get you into.. sticky situations

Posted: July 22nd, 2012 | Author: | Filed under: DevOps | Tags: , | 1 Comment »

AWS strongly recommends against using the stickiness features in their Elastic Load Balancer. Indeed, it’s nice to evenly distribute all requests and not worry about another layer of logic around “why a request went here.”

Unfortunately, in the real world, some applications really want stickiness. For example, at work we run a geoip service, which returns the location of an IP address. All 20+ nodes running this service have a local copy of the database, to avoid having an off-node dependency on the data. Lookups via the tornado-based service are not fast, however. So we toss varnish in front to cache 4+ GB worth of responses. If clients were hopping around between all nodes in an AZ, cache hit rates would suck. Point being, there are times that you want stickiness, and it doesn’t necessarily mean you have poor application design (so stop saying that, Amazon).

Now, when you point your CNAME at a region’s ELB CNAME, you’ll find that it returns 5 or more IP addresses. Some IPs are associated with individual AZs where you have live instances. There is no load balancing to determine which AZ gets which traffic; instead that’s handled by DNS (and to some extent, ELBs forwarding traffic between themselves). When a client hits ELB(s) in an individual AZ, requests are balanced across all instances in that AZ. This means that you can will observe all your repeated requests end up in a single AZ, but rotating between all instances. Unless you enable stickiness, of course.

So you enable stickiness because you want requests from the same client to end up at the same node (for caching, in this example).

Then the inevitable happens — a few of your nodes in the same AZ have hardware failures. You notice the load getting way too high in the remaining 2 instances, and decide that you’d really prefer to balance out that traffic between three other AZs. Like any logical person, you remove the instance in the AZ from the ELB, expecting the ELB(s) to notice there are no longer any instances in that AZ, and stop sending traffic (there’s nothing to receive it!).

This is when you start hearing about failed requests.

If you only remove instances from an ELB, and don’t remove the AZ itself from the ELB, the cluster of ELBs will still try very hard to forward traffic toward the AZ associated with the shared sticky state they know about. Your traffic is forwarded to, and hits an LB with no instances, and dies.

The moral of the story is: if you use stickiness, and want to disable a zone, you must remove the ZONE from the ELB, not just all instances.

That was an annoying lesson learned ;)

Quick side-note: if you use ELB-generated cookies for stickiness, the ELBs will remove your Cache-Control headers!

It’s best to configure varnish/apache/etc set a Served-By cookie, and tell the ELB to use that. That also makes it really easy to tell which node served a request, if you’re debugging via ‘curl -v’ or similar.


1 Comment »

I give thee, cronlib and puppet-cron-analyzer

Posted: June 11th, 2012 | Author: | Filed under: DevOps | Tags: , , , | No Comments »

I’ve been working on a puppet cron analyzer tool, which is coming along nicely:

https://github.com/manos/Puppet-Cron-Analyzer

Its goal was to provide an analysis/map of cron runtimes, but it turns out that simply regex searching across all crons in an infrastructure is the most useful part. (and it works now)
Also, to build this, I had to create a library to convert cron entries (like what you’d see on-disk), into normalized entries (with only lists of numbers). Cronlib also supports dumping a list of all timestamps a cron will run at (huge list!), based on a days argument. See cron-analyze.py for a nice way to create a time_map, to avoid storing duplicates of these huge lists.
Cronlib: https://github.com/manos/Puppet-Cron-Analyzer/blob/master/cronlib.py
More to come as puppet-cron-analyzer progresses.


No Comments yet... be the first »

Connecting to existing buckets in S3 with boto, the right way

Posted: January 23rd, 2012 | Author: | Filed under: DevOps | Tags: , | No Comments »

Here’s another interesting tidbit.

If you have scripts that connect to S3, and you run out of buckets (Amazon only allows 100 buckets per account), you might get a nasty surprise.

See, you may have been using create_bucket(name-of-bucket) to get your bucket object. It’s undocumented as far as I can see, but apparently if you use create_bucket() on an bucket that actually exists, it’ll return the Bucket object. That’s handy! Except it breaks if you’re unable to create more buckets (even though you aren’t really trying to create more). Sigh, so I refactored as such:

# old and busted: bucket = s3_conn.create_bucket(bucket_name)
# new hotness:
# iterate over Bucket objects and return the one matching string:
def find_s3_bucket(s3_conn, string):
    for i in s3_conn.get_all_buckets():
        if string in i.name:
            return i
Used as: bucket = find_s3_bucket(s3_conn, bucket_name)

There is likely a more elegant way, but hey this works.


No Comments yet... be the first »