Zenoss: We Can Ditch Nagios Now

Posted: February 14th, 2010 | Author: charlie | Filed under: IT Management, Linux / Unix, Networking | Tags: , , , | 10 Comments »

Another perfect example of open source software gone commercial is Zenoss. As a full-featured network and service monitoring solution, Zenoss is one of the best monitoring tools available.

Most importantly, Zenoss combines two functionalities. First and foremost an enterprise environment requires host and service monitoring, with notifications. Network monitoring really means checking services, checking that hosts are up (they ping), and possibly writing your own plugins to check various other aspects of a server or network device. Until now, Nagios has filled that role.

Second, once a decent monitoring solution is in place, getting time-based information becomes desirable. Memory and CPU usage is the most prevalent example: if you’re checking available swap space every so often with Nagios, you may know when you start running low. But it may be just as important to see a graph of the last week’s usage. Tools like Cacti or Munin, which collect data frequently and use RRD graphs to display it, are very useful.

Zenoss fills both roles, without the annoying shortcomings prevalent in the alternative solutions. Zenoss uses the terms Availability Monitoring and Performance Monitoring to describe these two fundamental roles.

Performance of monitoring tools is important, and often times overlooked until it becomes a debilitating problem. For example, if you want to chart pretty RRD graphs of systems statistics like available RAM or disk space, Munin is an option. Unfortunately it’s all Perl, and designed in such a way that prevents it from scaling to even moderate amounts of hosts. Cacti is a bit better, but monitoring close to 100 hosts is painful with either option. Along comes Zenoss.

Zenoss is written in Python, and uses a MySQL backend for storage, and by all accounts it appears to perform very well. The really great thing about corporate-backed open source is quality control. The community simply isn’t responsible enough to say, “No, this won’t work, re-implement it.” A company with QA is.

Speaking of features, Zenoss isn’t missing many. Flexibility seems to be top priority–it can monitor hosts with SNMP, Nagios agents, SSH, Windows WMI, and various other mechanisms. Many features they claim are a bit over-inflated, such as ZenPing (marketed as Network Topology Monitoring) but the feature set is rich nonetheless.

Zenoss’s primary functions involve four features:

  • Inventory Tracking
  • Availability Monitoring
  • Performance Monitoring
  • Event Monitoring and Management

Inventory tracking claims some sort of “configuration” reporting as well, but it seems very limited. Zenoss will discover your inventory and auto-populate a database. This is great for knowing which IP addresses are in use, for example, but means that “configuration” reporting is limited to an outside observer’s perspective. It can tell you which servers have a Web server running, but it certainly doesn’t deal with the configuration of the Web server. Of course, inventory tracking isn’t limited to automatically discovered information; there are manual input capabilities too.

Availability monitoring is basically Nagios, plus. It can ping, it can monitor Windows machines, and it can pretty much do whatever you need. Even your old Nagios plugins will work with Zenoss. It does generate reports, but much better ones than Nagios is capable of.

Host monitoring, performance monitoring, or whatever you’d like to call it, is quite robust in Zenoss. Some would think it’s light on features, but there’s a good reason that Zenoss requires you use SNMP: it’s much more scalable than SSH’ing to each server every minute. A bit of up-front configuration is required, in that all your hosts will need SNMP configured and working, but it’s completely worth it. Zenoss too uses RRD graphs, and it can generate events and alerts based on pre-defined thresholds.

Finally we come to event monitoring. Zenoss is also encroaching on Splunk’s territory a bit. It can combine syslog, availability monitoring alerts, SNMP traps, and even Windows event log data. Much like Splunk, Zenoss correlates similar events for easier viewing and troubleshooting. This is the portion that processes all events and generates alerts to pagers or e-mail, taking into account the escalation procedure you’ve defined.

To top it all off, the Zenoss Web interface is top-notch. It includes a customizable “dashboard” for monitoring, and everything is AJAX-enabled. AJAX provides the user experience similar to Splunk and Google’s Gmail.

Marketing fluff aside, Zenoss really does provide a wonderful product. It is, of course, open source and available for free.

At last year’s LISA conference, Zenoss gave a demonstration that sadly coincided with free beer time. Stumbling in toward the end, I demanded one of their free baseball caps, and sat to listen to the last few audience questions. One thing was very obvious: everyone in the room was excited about this product. If hardcore sysadmins are excited, you know this is something worthwhile.

Zenosss is very functional and full of features. It may even be possible to replace three separate pieces of software with this one product: host inventory database, Nagios, and your performance monitoring tool of choice. Maybe even Splunk some day. We can’t wait to see what features they will be adding next.


10 Comments »

Related posts:

  1. Squeeze Your Gigabit NIC for Top Performance
  2. Managing Virtual Machine and Cloud Sprawl
  3. Built-in Security with Cisco IPS
  4. Back to Basics: Unix System Stats Utilities
  5. Manage Devices and Configurations with Cisco SDM

10 Comments on “Zenoss: We Can Ditch Nagios Now”

  1. 1 Mark Hinkle said at 22:39 on February 14th, 2010:

    Thanks for the thorough write-up and the kind words, Charlie. I think the point you make about configuration is interesting but depending on your level of monitoring you can pull all sorts of configuration data on a devcie e.g. for a Linux server – CPU, memory, routing tables, software installed, etc. just use your favorite SSH ZenPack or same for routers when you install the MIBs. Great article and thanks again!

  2. 2 charlie said at 23:17 on February 14th, 2010:

    Thanks Mark!

    I think I was trying to get at the need for Configuration Management (puppet) and Monitoring convergence. Total convergence/integration, I mean. Hmm, I should do an article about that topic, actually – I think I will.

    (also, this may all sound familiar.. it was originally published on http://enterprisenetworkingplanet.com and another blog.. I’m currently consolidating all past articles here)

  3. 3 Mark Hinkle said at 15:19 on February 15th, 2010:

    I thought I recognized the article. It was nice to see article reposted. Look forward to reading this blog more often.

  4. 4 mb said at 14:15 on February 18th, 2010:

    Actually many can still not drop nagios, Zenoss still lack a very basic feature which is prudent to many, and usually the reason for choosing nagios, and our own reason also for still keeping nagios, we evaluate zenoss about every 6 months because we want to use it, but it’s still lacking the following feature:

    - Manuel dependency mappings which does not require python scripting for layer2 devices and applications.

    There are multiple request in the forums and on the trac site for this feature, why it has been ignored is beyond my comprehension :(

  5. 5 charlie said at 14:22 on February 18th, 2010:

    I was hoping someone would mention that :)

    You specify “manual” so it seems you’re aware that dependencies do work, sort of, if you feed Zenoss your routing table(s).

    I agree, though. Most people yearn for dependency mappings. I could understand leaving this out if they provided a crazily robust and innovative new auto-discovery system — but that’s near impossible to discover. Human intervention is pretty necessary to define these relationships.

    I do, however, believe that most IT environments can live without this feature. Define better alerting groups. And ok when something blows up you get bombarded with TXT messages. Just make sure to pay for all your employees’ unlimited TXT plans ;) ..the benefits of Zenoss in my opinion make this worth dealing with.

  6. 6 mb said at 14:41 on February 18th, 2010:

    Some can live with that, but not everyone will be in a position that they could decide for themselves, we have to provide a pro/con list with zenoss vs nagios+cacti vs Microsoft system center operations manager, and currently management is leaning towards MS, which is something i would hate to see happen, as we have grown alot we are often seeing stuff we miss with our current nagios/cacti implementation which zenoss could solve, but dependency is quite important to management because of reports

    The availability reports will be “wrong” and indicate that alot of our equipment is faulty and not just unreachable f eks because of that one switch which lost network connectivity.

    Basically it drills down to: fix or lose customers, and that alone should make it a pretty high priority task for zenoss.

    (sorry if this sounds like a flame report, but im pretty frustrated about this being the open-source lover that i am)

  7. 7 Matt Ray said at 15:14 on February 22nd, 2010:

    How are the availability reports “wrong”, and have you opened a ticket? The Administration Guide documents how they’re calculated and we’ve had people make incorrect assumptions of how it’s calculated by Zenoss before. Note that it’s based on critical ping events when the device is in the Production maintenance state. So if your devices go into a Decommisioned maintenance window your Availability is not affected. You can easily write a new Availability report that makes changes to the inputs and we’d love to have alternative implementations shared in reporting ZenPacks.

  8. 8 mb said at 15:22 on February 22nd, 2010:

    If there is no dependencies and topology awareness in the system the reports will be “wrong”. Basically what it lacks is a state called “unreachable” and logic to differentiate between down and unreachable which is impossible without dependency/topology awareness. A report like that is not good enough in a enterprise where there are non tech-people which you have to present the report to.

    Though nagios does not provide a report like this, it provides the data to compile such a report, and we have written a nagios csv report parser which gives us such reports.

  9. 9 mb said at 15:27 on February 22nd, 2010:

    whops, forget to reply to your other stuff. I have opened a ticket with the dependency stuff in your trac system, and are currently looking into the dev docs for Zenpacks, but our experience with python is limited.

    And i wanted to thank you if you work for zenoss, that you actually replied :) Most software devs usually just ignore rants like my previous comments.

  10. 10 mlist said at 02:18 on February 24th, 2010:

    I agree with mb. I’m using zenoss with satisfaction (I used Nagios for 3 years) but in my opinion the big lack of Zenoss is:
    a) missing dependencies and topology. Manual configuration is absolutely a MUST. In Nagios this concept is named “Parent & Child Relationship”
    b) With Zenoss you cannot distinguish between “hard state” and “soft state” and you cannot configure alerts with the flexibility of Nagios. With nagios you have these parameters:
    -max check attempts (example: 3)
    -retry check interval (example: 1 minute)
    In this way you can say:
    send alerts ONLY if the “problem” is present for 3 consecutive checks that, in this case, means 3 minutes.
    With zenoss you have the “count” option in the “alerting rule” thus you can prevent emails but not the generation of events.
    This implementation in Zenoss would help to reduce the “false positive” above all in large environment.
    c) Min/Max threshold
    Did you try to configure 2 Threshold like these?
    Critical = 10
    Warning = 5
    When critical threshold is breached, Zenoss will generate 2 events instead just one. With Zenoss 2.5 you can prevent this bad behaviour configuring the “warning threshold” in this way:
    min=5
    max=10
    The logic here is very confusing!
    -”Service group” concept
    Zenoss should provide the ability to configure a group of services to monitor otherwise in large environment with many domains and with many windows servers NOT in domain, is quite frustrating to manually add or disable windows services that Zenoss will try to monitor.
    -Service dependecy
    This woul help to reduce the alarms but, above all, this would help to understand the “root cause”. Only people that used this Nagios feature can understand the power of this concept.

    That said, Zenoss is a good product that enable to monitor both windows and *nix servers without much difficult. Moreover the fact that Zenoss managers continuosly ask to users what they think or what they suggest is very very positive. I think that if Zenoss developers will develop these features, Zenoss will became the best Monitoring solutions!


Leave a Reply

  •