Follow me on Twitter:

Starting out: a new approach to systems monitoring.

Posted: October 2nd, 2012 | Author: | Filed under: DevOps | Tags: , , | 2 Comments »

OK, not new to some. Circonus does it this way, and so do some very large sites like Netflix.

But new to me, and certainly new to anyone currently using nagios/zenoss/zabbix/etc. Here’s the story:

The Idea

Metrics

At work (Krux), we have graphite and tons of graphs on the wall. We can see application-level response times in the same view as cache hit/miss rates and requests per second. That’s nice. It’s also not very proactive.

Monitoring

We also have cloudkick (think: nagios with an API). We have tons of plugins checking thresholds, running locally on each box. We recently re-evaluated our monitoring solution, and ultimately decided to write our own loosely coupled monitoring infrastructure using a variety of awesome tools. We migrated from cloudkick to collectd with a bunch of plugins we wrote, using a custom python library I wrote, called monitorlib (collectd and pagerduty parts). The functionality is basically the same: run scripts on each node every 60 seconds, check if some threshold is met, and alert directly to pagerduty. meh.

Combining

What I really want is a decision engine.

I want applications to push events, when they know something a poll-based monitoring script doesn’t.
I want to suppress threshold-based alerts, based on a set of rules, and only alert some people.
I want to check the load balancer to see how many nodes are healthy, before alerting that a single node went down.
I want to check response-time graphs in graphite, by polling the holt-winters confidence bands, and then alert based on configured rules.

Basically, we are in a world where we have great graphs, and old-school threshold-based alerts. I want to alert on graphs, but also much more – I want to combine multiple bits of information before paging someone at 2am.

How to get there

Going to the next level requires processing events, accepting event data from multiple sources, and configuring rules.

This blog post has some good ideas http://www.control-alt-del.org/2012/03/28/collectd-esper-amqp-opentsdbgraphite-oh-my/ and it outlines a few options.

Basically, I want *something* to sit and listen for events. I want all my collectd scripts to send data via HTTP POST (JSON), or protobufs, along with the status (ok, warn, error) every minute. Then, the *thing* that’s receiving these events, will decide – based on state it knows or gathers by polling graphite/load balancers/etc – whether to alert, update a status board, neither, or both.

Building that *thing* is the hard part. There are Complex Event Processing (CEP) frameworks available, most notably Esper, written in Java. Using Esper requires writing a lot of Java. There is a google open source thing, which seems like a bundle of code published but not maintained, called rocksteady. Using rocksteady may help the “ugh, don’t want to Java” aspect.

Then there is Riemann – this is what I’m starting with first. After learning a bit of Clojure, it should provide immediate benefit. And it’s actively developed and the author is very responsive. We’ll see how it goes!

Final notes

I think what I’m trying to do is a bit different than most.

I don’t want to send all my data (graphite metrics – we do around 150K metrics/sec to our graphite cluster) through this decision engine. I want it to get *events* which would historically have been something to page or email about. Then, it needs to make decisions: check graphs as another source of data; check load balancers; re-check to make sure it’s still a problem; maybe even spin up new EC2 instances. I may also want to poll graphite periodically to check various things, perhaps with graphite-tattle.

At this point, I don’t know what else it can/should do. The first step is to send all alerts to the decision engine, and define rules. It shall grow from there 🙂

 


2 Comments »