Follow me on Twitter:

Starting out: a new approach to systems monitoring.

Posted: October 2nd, 2012 | Author: | Filed under: DevOps | Tags: , , | 2 Comments »

OK, not new to some. Circonus does it this way, and so do some very large sites like Netflix.

But new to me, and certainly new to anyone currently using nagios/zenoss/zabbix/etc. Here’s the story:

The Idea

Metrics

At work (Krux), we have graphite and tons of graphs on the wall. We can see application-level response times in the same view as cache hit/miss rates and requests per second. That’s nice. It’s also not very proactive.

Monitoring

We also have cloudkick (think: nagios with an API). We have tons of plugins checking thresholds, running locally on each box. We recently re-evaluated our monitoring solution, and ultimately decided to write our own loosely coupled monitoring infrastructure using a variety of awesome tools. We migrated from cloudkick to collectd with a bunch of plugins we wrote, using a custom python library I wrote, called monitorlib (collectd and pagerduty parts). The functionality is basically the same: run scripts on each node every 60 seconds, check if some threshold is met, and alert directly to pagerduty. meh.

Combining

What I really want is a decision engine.

I want applications to push events, when they know something a poll-based monitoring script doesn’t.
I want to suppress threshold-based alerts, based on a set of rules, and only alert some people.
I want to check the load balancer to see how many nodes are healthy, before alerting that a single node went down.
I want to check response-time graphs in graphite, by polling the holt-winters confidence bands, and then alert based on configured rules.

Basically, we are in a world where we have great graphs, and old-school threshold-based alerts. I want to alert on graphs, but also much more – I want to combine multiple bits of information before paging someone at 2am.

How to get there

Going to the next level requires processing events, accepting event data from multiple sources, and configuring rules.

This blog post has some good ideas http://www.control-alt-del.org/2012/03/28/collectd-esper-amqp-opentsdbgraphite-oh-my/ and it outlines a few options.

Basically, I want *something* to sit and listen for events. I want all my collectd scripts to send data via HTTP POST (JSON), or protobufs, along with the status (ok, warn, error) every minute. Then, the *thing* that’s receiving these events, will decide – based on state it knows or gathers by polling graphite/load balancers/etc – whether to alert, update a status board, neither, or both.

Building that *thing* is the hard part. There are Complex Event Processing (CEP) frameworks available, most notably Esper, written in Java. Using Esper requires writing a lot of Java. There is a google open source thing, which seems like a bundle of code published but not maintained, called rocksteady. Using rocksteady may help the “ugh, don’t want to Java” aspect.

Then there is Riemann – this is what I’m starting with first. After learning a bit of Clojure, it should provide immediate benefit. And it’s actively developed and the author is very responsive. We’ll see how it goes!

Final notes

I think what I’m trying to do is a bit different than most.

I don’t want to send all my data (graphite metrics – we do around 150K metrics/sec to our graphite cluster) through this decision engine. I want it to get *events* which would historically have been something to page or email about. Then, it needs to make decisions: check graphs as another source of data; check load balancers; re-check to make sure it’s still a problem; maybe even spin up new EC2 instances. I may also want to poll graphite periodically to check various things, perhaps with graphite-tattle.

At this point, I don’t know what else it can/should do. The first step is to send all alerts to the decision engine, and define rules. It shall grow from there 🙂

 


2 Comments »

Back to Basics: Unix System Stats Utilities

Posted: February 24th, 2010 | Author: | Filed under: Linux / Unix | Tags: , , , , , | No Comments »

Unix and Linux systems have forever been obtuse and mysterious for many people. They generally don’t have nice graphical utilities for displaying system performance information; you need to know how to coax the information you need. Furthermore, you need to know how to interpret the information you’re given. Let’s take a look at some common system tools that can provide tons of visibility into what the opaque OS is really doing.

Unfortunately, the same tools don’t exist universally across all Unix variants. A few commonly underused ones do, however, and that is what we’ll focus on first.

Disk Activity
A common source of “slowness” is disk I/O, or rather the lack of available I/O. On Linux especially, it may be a difficult diagnosis. Often the load average will climb quickly, but without any corresponding processes in top eating much CPU. Linux counts “iowait” as CPU time when calculating load average. I’ve seen load numbers in the tens of thousands, on more than one occasion.

The easiest way to see what’s happening to your disks is to run the ‘iostat’ program. Via iostat, you can see how many read and write operations are happening per device, how much CPU is being utilized, and how long each transaction takes. Many arguments are available for iostat, so do spend some time with the man page on your specific system. By default, running ‘iostat’ with no arguments produces a report about disk IO since boot. To get a snapshot of “now” add a numerical argument last, which will prompt iostat to gather statistics for that number of seconds.

Linux will show number of blocks read or written per second, along with some useful CPU statistics. This is one particularly busy server:

 avg-cpu:  %user   %nice %system %iowait  %steal   %idle
 1.36    0.07    5.21   23.80    0.00   69.57
Device:   tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda       18.22     15723.35       643.25 65474958946 2678596632

Notice that iowait is at 23%. This means that 23% of the time this server is waiting on disk I/O. Some Solaris iostat output shows a similar thing, just represented differently(iostat -xnz):

    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
295.3   79.7 5657.8  211.0  0.0 10.3    0.0   27.4   0 100 d101
134.8   16.4 4069.8  116.0  0.0  3.5    0.0   23.3   0  90 d105

The %b (block) column shows that I/O to device d101 is 100% blocked waiting for the device to complete transaction. The average service time isn’t good either: disk reads shouldn’t take 27.4ms. Arguably, Solaris’s output is more friendly to parse, since it gives the reads per second in kilobytes rather than blocks. We can quickly calculate that this server is reading about 19KB per read by dividing the number of KB read per second by the number of reads that happened. In short: this disk array is being taxed by large amounts of read requests.

Vmstat
The ‘vmstat’ program is also universally available, and extremely useful. It, too, provides vastly different information among operating systems. The vmstat utility will show you statistics about the virtual memory subsystem, or to put it simply: swap space. It is much more complex than just swap, as nearly every IO operation involves the VM system when pages of memory are allocated.A disk write, network packet send, and the obvious “program allocates RAM” all impact what you see in vmstat.

Running vmstat with the -p argument will print out statistics about disk IO. In Solaris you get some disk information anyway, as seen below:

 kthr      memory            page            disk
 faults      cpu  r b w   swap
free  re  mf pi po fr de sr m0 m1 m2 m7
in   sy   cs us sy id  0 0 0 7856104 526824 386 2401 0 0 0  0  0  3  0  0  0
16586 22969 12576 8 9 83  1 0 0 7851344 522016 18 678 32 0  0  0  0  2
0  0  0 13048 11737 10197 7 6 86  0 0 0 7843584 514128 76 3330 197 0
0 0  0  2  0  0  0 4762 131492 4441 16 8 76

A subtle, but important differences between Solaris and Linux is that Solaris will start scanning for pages of memory that can be freed before it will actually start swapping RAM to disk. The ‘sr’ column, scan rate, will start increasing right before swapping takes place, and continue until some RAM is available. The normal things are available in all operating systems; these include: swap space, free memory, pages in and out (careful, this doesn’t mean swapping is happening), page faults, context switches, and some CPU idle/system/user statistics. Once you know how to interpret these items you quickly learn to infer what they indicate about the usage of your system.

The two main programs for finding “slowness” are therefore iostat and vmstat. Before the obligatory tangent into “what Dtrace can do for you,” here’s a few other tools that no Unix junkie should leave home without:

lsof
Lists open files (including network ports) for all processes
netstat
Lists all sockets in use by the system
mpstat
Shows CPU statistics (including IO), per-processor

Dtrace
We cannot talk about system visibility without mentioning Dtrace. Invented by Sun, Dtrace provides dynamic tracing of everything about a system. Dtrace gives you the ability to ask any arbitrary question about the state of a system, which works by calling “probes” within the kernel. That sounds intimidating, doesn’t it?

Let’s say that we wanted to know what files were being read or written on our Linux server that has a high iowait percentage. There’s simply no way to know. Let’s ask the same question of Solaris, and instead of learning Dtrace, we’ll find something useful in the Dtrace ToolKit. In the kit, you’ll find a few neat programs like iosnoop and iotop, which will tell you which processes are doing all the disk IO operations. Neat, but we really want to know what files are being accessed so much. In the FS directory, the rfileio.d script will provide this information. Run it, and you’ll see every file that’s read or written, and cache hit statistics. There’s no way to get this information in other Unixes, and this is just one simple example of how Dtrace is invaluable.

The script itself is about 90 lines, inclusive of comments, but the bulk of it is dealing with cache statistics. An excellent way to start learning Dtrace is to simply read the Dtrace ToolKit scripts.

Don’t worry if you’re not a Solaris admin: Dtrace is coming soon to a FreeBSD near you. SystemTap, a replica of Dtrace, will be available for Linux soon as well. Until then, and even afterward, the above mentioned tools will still be invaluable. If you can quickly get disk IO statistics and see if you’re swapping the majority of system performance problems are solved. Dtrace also provides amazing application tracing functionality, and if you’re looking at the application itself, you already know the slowness isn’t likely being caused by a system problem.

Soon, I’ll publish a few Dtrace tutorials.

Some things have surely been left out – discuss below!


No Comments yet... be the first »

Zenoss: We Can Ditch Nagios Now

Posted: February 14th, 2010 | Author: | Filed under: IT Management, Linux / Unix, Networking | Tags: , , , | 24 Comments »

Another perfect example of open source software gone commercial is Zenoss. As a full-featured network and service monitoring solution, Zenoss is one of the best monitoring tools available.

Most importantly, Zenoss combines two functionalities. First and foremost an enterprise environment requires host and service monitoring, with notifications. Network monitoring really means checking services, checking that hosts are up (they ping), and possibly writing your own plugins to check various other aspects of a server or network device. Until now, Nagios has filled that role.

Second, once a decent monitoring solution is in place, getting time-based information becomes desirable. Memory and CPU usage is the most prevalent example: if you’re checking available swap space every so often with Nagios, you may know when you start running low. But it may be just as important to see a graph of the last week’s usage. Tools like Cacti or Munin, which collect data frequently and use RRD graphs to display it, are very useful.

Zenoss fills both roles, without the annoying shortcomings prevalent in the alternative solutions. Zenoss uses the terms Availability Monitoring and Performance Monitoring to describe these two fundamental roles.

Performance of monitoring tools is important, and often times overlooked until it becomes a debilitating problem. For example, if you want to chart pretty RRD graphs of systems statistics like available RAM or disk space, Munin is an option. Unfortunately it’s all Perl, and designed in such a way that prevents it from scaling to even moderate amounts of hosts. Cacti is a bit better, but monitoring close to 100 hosts is painful with either option. Along comes Zenoss.

Zenoss is written in Python, and uses a MySQL backend for storage, and by all accounts it appears to perform very well. The really great thing about corporate-backed open source is quality control. The community simply isn’t responsible enough to say, “No, this won’t work, re-implement it.” A company with QA is.

Speaking of features, Zenoss isn’t missing many. Flexibility seems to be top priority–it can monitor hosts with SNMP, Nagios agents, SSH, Windows WMI, and various other mechanisms. Many features they claim are a bit over-inflated, such as ZenPing (marketed as Network Topology Monitoring) but the feature set is rich nonetheless.

Zenoss’s primary functions involve four features:

  • Inventory Tracking
  • Availability Monitoring
  • Performance Monitoring
  • Event Monitoring and Management

Inventory tracking claims some sort of “configuration” reporting as well, but it seems very limited. Zenoss will discover your inventory and auto-populate a database. This is great for knowing which IP addresses are in use, for example, but means that “configuration” reporting is limited to an outside observer’s perspective. It can tell you which servers have a Web server running, but it certainly doesn’t deal with the configuration of the Web server. Of course, inventory tracking isn’t limited to automatically discovered information; there are manual input capabilities too.

Availability monitoring is basically Nagios, plus. It can ping, it can monitor Windows machines, and it can pretty much do whatever you need. Even your old Nagios plugins will work with Zenoss. It does generate reports, but much better ones than Nagios is capable of.

Host monitoring, performance monitoring, or whatever you’d like to call it, is quite robust in Zenoss. Some would think it’s light on features, but there’s a good reason that Zenoss requires you use SNMP: it’s much more scalable than SSH’ing to each server every minute. A bit of up-front configuration is required, in that all your hosts will need SNMP configured and working, but it’s completely worth it. Zenoss too uses RRD graphs, and it can generate events and alerts based on pre-defined thresholds.

Finally we come to event monitoring. Zenoss is also encroaching on Splunk‘s territory a bit. It can combine syslog, availability monitoring alerts, SNMP traps, and even Windows event log data. Much like Splunk, Zenoss correlates similar events for easier viewing and troubleshooting. This is the portion that processes all events and generates alerts to pagers or e-mail, taking into account the escalation procedure you’ve defined.

To top it all off, the Zenoss Web interface is top-notch. It includes a customizable “dashboard” for monitoring, and everything is AJAX-enabled. AJAX provides the user experience similar to Splunk and Google’s Gmail.

Marketing fluff aside, Zenoss really does provide a wonderful product. It is, of course, open source and available for free.

At last year’s LISA conference, Zenoss gave a demonstration that sadly coincided with free beer time. Stumbling in toward the end, I demanded one of their free baseball caps, and sat to listen to the last few audience questions. One thing was very obvious: everyone in the room was excited about this product. If hardcore sysadmins are excited, you know this is something worthwhile.

Zenosss is very functional and full of features. It may even be possible to replace three separate pieces of software with this one product: host inventory database, Nagios, and your performance monitoring tool of choice. Maybe even Splunk some day. We can’t wait to see what features they will be adding next.


24 Comments »