Follow me on Twitter:

Zenoss: We Can Ditch Nagios Now

Posted: February 14th, 2010 | Author: | Filed under: IT Management, Linux / Unix, Networking | Tags: , , , | 24 Comments »

Another perfect example of open source software gone commercial is Zenoss. As a full-featured network and service monitoring solution, Zenoss is one of the best monitoring tools available.

Most importantly, Zenoss combines two functionalities. First and foremost an enterprise environment requires host and service monitoring, with notifications. Network monitoring really means checking services, checking that hosts are up (they ping), and possibly writing your own plugins to check various other aspects of a server or network device. Until now, Nagios has filled that role.

Second, once a decent monitoring solution is in place, getting time-based information becomes desirable. Memory and CPU usage is the most prevalent example: if you’re checking available swap space every so often with Nagios, you may know when you start running low. But it may be just as important to see a graph of the last week’s usage. Tools like Cacti or Munin, which collect data frequently and use RRD graphs to display it, are very useful.

Zenoss fills both roles, without the annoying shortcomings prevalent in the alternative solutions. Zenoss uses the terms Availability Monitoring and Performance Monitoring to describe these two fundamental roles.

Performance of monitoring tools is important, and often times overlooked until it becomes a debilitating problem. For example, if you want to chart pretty RRD graphs of systems statistics like available RAM or disk space, Munin is an option. Unfortunately it’s all Perl, and designed in such a way that prevents it from scaling to even moderate amounts of hosts. Cacti is a bit better, but monitoring close to 100 hosts is painful with either option. Along comes Zenoss.

Zenoss is written in Python, and uses a MySQL backend for storage, and by all accounts it appears to perform very well. The really great thing about corporate-backed open source is quality control. The community simply isn’t responsible enough to say, “No, this won’t work, re-implement it.” A company with QA is.

Speaking of features, Zenoss isn’t missing many. Flexibility seems to be top priority–it can monitor hosts with SNMP, Nagios agents, SSH, Windows WMI, and various other mechanisms. Many features they claim are a bit over-inflated, such as ZenPing (marketed as Network Topology Monitoring) but the feature set is rich nonetheless.

Zenoss’s primary functions involve four features:

  • Inventory Tracking
  • Availability Monitoring
  • Performance Monitoring
  • Event Monitoring and Management

Inventory tracking claims some sort of “configuration” reporting as well, but it seems very limited. Zenoss will discover your inventory and auto-populate a database. This is great for knowing which IP addresses are in use, for example, but means that “configuration” reporting is limited to an outside observer’s perspective. It can tell you which servers have a Web server running, but it certainly doesn’t deal with the configuration of the Web server. Of course, inventory tracking isn’t limited to automatically discovered information; there are manual input capabilities too.

Availability monitoring is basically Nagios, plus. It can ping, it can monitor Windows machines, and it can pretty much do whatever you need. Even your old Nagios plugins will work with Zenoss. It does generate reports, but much better ones than Nagios is capable of.

Host monitoring, performance monitoring, or whatever you’d like to call it, is quite robust in Zenoss. Some would think it’s light on features, but there’s a good reason that Zenoss requires you use SNMP: it’s much more scalable than SSH’ing to each server every minute. A bit of up-front configuration is required, in that all your hosts will need SNMP configured and working, but it’s completely worth it. Zenoss too uses RRD graphs, and it can generate events and alerts based on pre-defined thresholds.

Finally we come to event monitoring. Zenoss is also encroaching on Splunk‘s territory a bit. It can combine syslog, availability monitoring alerts, SNMP traps, and even Windows event log data. Much like Splunk, Zenoss correlates similar events for easier viewing and troubleshooting. This is the portion that processes all events and generates alerts to pagers or e-mail, taking into account the escalation procedure you’ve defined.

To top it all off, the Zenoss Web interface is top-notch. It includes a customizable “dashboard” for monitoring, and everything is AJAX-enabled. AJAX provides the user experience similar to Splunk and Google’s Gmail.

Marketing fluff aside, Zenoss really does provide a wonderful product. It is, of course, open source and available for free.

At last year’s LISA conference, Zenoss gave a demonstration that sadly coincided with free beer time. Stumbling in toward the end, I demanded one of their free baseball caps, and sat to listen to the last few audience questions. One thing was very obvious: everyone in the room was excited about this product. If hardcore sysadmins are excited, you know this is something worthwhile.

Zenosss is very functional and full of features. It may even be possible to replace three separate pieces of software with this one product: host inventory database, Nagios, and your performance monitoring tool of choice. Maybe even Splunk some day. We can’t wait to see what features they will be adding next.


24 Comments »

24 Comments on “Zenoss: We Can Ditch Nagios Now”

  1. 1 Mark Hinkle said at 22:39 on February 14th, 2010:

    Thanks for the thorough write-up and the kind words, Charlie. I think the point you make about configuration is interesting but depending on your level of monitoring you can pull all sorts of configuration data on a devcie e.g. for a Linux server – CPU, memory, routing tables, software installed, etc. just use your favorite SSH ZenPack or same for routers when you install the MIBs. Great article and thanks again!

  2. 2 charlie said at 23:17 on February 14th, 2010:

    Thanks Mark!

    I think I was trying to get at the need for Configuration Management (puppet) and Monitoring convergence. Total convergence/integration, I mean. Hmm, I should do an article about that topic, actually – I think I will.

    (also, this may all sound familiar.. it was originally published on http://enterprisenetworkingplanet.com and another blog.. I’m currently consolidating all past articles here)

  3. 3 Mark Hinkle said at 15:19 on February 15th, 2010:

    I thought I recognized the article. It was nice to see article reposted. Look forward to reading this blog more often.

  4. 4 mb said at 14:15 on February 18th, 2010:

    Actually many can still not drop nagios, Zenoss still lack a very basic feature which is prudent to many, and usually the reason for choosing nagios, and our own reason also for still keeping nagios, we evaluate zenoss about every 6 months because we want to use it, but it’s still lacking the following feature:

    – Manuel dependency mappings which does not require python scripting for layer2 devices and applications.

    There are multiple request in the forums and on the trac site for this feature, why it has been ignored is beyond my comprehension 🙁

  5. 5 charlie said at 14:22 on February 18th, 2010:

    I was hoping someone would mention that 🙂

    You specify “manual” so it seems you’re aware that dependencies do work, sort of, if you feed Zenoss your routing table(s).

    I agree, though. Most people yearn for dependency mappings. I could understand leaving this out if they provided a crazily robust and innovative new auto-discovery system — but that’s near impossible to discover. Human intervention is pretty necessary to define these relationships.

    I do, however, believe that most IT environments can live without this feature. Define better alerting groups. And ok when something blows up you get bombarded with TXT messages. Just make sure to pay for all your employees’ unlimited TXT plans 😉 ..the benefits of Zenoss in my opinion make this worth dealing with.

  6. 6 mb said at 14:41 on February 18th, 2010:

    Some can live with that, but not everyone will be in a position that they could decide for themselves, we have to provide a pro/con list with zenoss vs nagios+cacti vs Microsoft system center operations manager, and currently management is leaning towards MS, which is something i would hate to see happen, as we have grown alot we are often seeing stuff we miss with our current nagios/cacti implementation which zenoss could solve, but dependency is quite important to management because of reports

    The availability reports will be “wrong” and indicate that alot of our equipment is faulty and not just unreachable f eks because of that one switch which lost network connectivity.

    Basically it drills down to: fix or lose customers, and that alone should make it a pretty high priority task for zenoss.

    (sorry if this sounds like a flame report, but im pretty frustrated about this being the open-source lover that i am)

  7. 7 Matt Ray said at 15:14 on February 22nd, 2010:

    How are the availability reports “wrong”, and have you opened a ticket? The Administration Guide documents how they’re calculated and we’ve had people make incorrect assumptions of how it’s calculated by Zenoss before. Note that it’s based on critical ping events when the device is in the Production maintenance state. So if your devices go into a Decommisioned maintenance window your Availability is not affected. You can easily write a new Availability report that makes changes to the inputs and we’d love to have alternative implementations shared in reporting ZenPacks.

  8. 8 mb said at 15:22 on February 22nd, 2010:

    If there is no dependencies and topology awareness in the system the reports will be “wrong”. Basically what it lacks is a state called “unreachable” and logic to differentiate between down and unreachable which is impossible without dependency/topology awareness. A report like that is not good enough in a enterprise where there are non tech-people which you have to present the report to.

    Though nagios does not provide a report like this, it provides the data to compile such a report, and we have written a nagios csv report parser which gives us such reports.

  9. 9 mb said at 15:27 on February 22nd, 2010:

    whops, forget to reply to your other stuff. I have opened a ticket with the dependency stuff in your trac system, and are currently looking into the dev docs for Zenpacks, but our experience with python is limited.

    And i wanted to thank you if you work for zenoss, that you actually replied 🙂 Most software devs usually just ignore rants like my previous comments.

  10. 10 mlist said at 02:18 on February 24th, 2010:

    I agree with mb. I’m using zenoss with satisfaction (I used Nagios for 3 years) but in my opinion the big lack of Zenoss is:
    a) missing dependencies and topology. Manual configuration is absolutely a MUST. In Nagios this concept is named “Parent & Child Relationship”
    b) With Zenoss you cannot distinguish between “hard state” and “soft state” and you cannot configure alerts with the flexibility of Nagios. With nagios you have these parameters:
    -max check attempts (example: 3)
    -retry check interval (example: 1 minute)
    In this way you can say:
    send alerts ONLY if the “problem” is present for 3 consecutive checks that, in this case, means 3 minutes.
    With zenoss you have the “count” option in the “alerting rule” thus you can prevent emails but not the generation of events.
    This implementation in Zenoss would help to reduce the “false positive” above all in large environment.
    c) Min/Max threshold
    Did you try to configure 2 Threshold like these?
    Critical = 10
    Warning = 5
    When critical threshold is breached, Zenoss will generate 2 events instead just one. With Zenoss 2.5 you can prevent this bad behaviour configuring the “warning threshold” in this way:
    min=5
    max=10
    The logic here is very confusing!
    -“Service group” concept
    Zenoss should provide the ability to configure a group of services to monitor otherwise in large environment with many domains and with many windows servers NOT in domain, is quite frustrating to manually add or disable windows services that Zenoss will try to monitor.
    -Service dependecy
    This woul help to reduce the alarms but, above all, this would help to understand the “root cause”. Only people that used this Nagios feature can understand the power of this concept.

    That said, Zenoss is a good product that enable to monitor both windows and *nix servers without much difficult. Moreover the fact that Zenoss managers continuosly ask to users what they think or what they suggest is very very positive. I think that if Zenoss developers will develop these features, Zenoss will became the best Monitoring solutions!

  11. 11 Michael Hendrickx said at 09:41 on May 14th, 2010:

    I agree with the above. Zenoss is a very good product, but is lacking the flexibility that Nagios provides.
    While not being there yet, I feel that Zenoss is on it’s way to bypass Nagios though.

  12. 12 Benny Chitambira said at 03:46 on May 17th, 2010:

    @Mlist

    I specifically refer to you post about the 3 cons that you listed about Zenoss. I want to point out that all of your cons (together with mb’s) are actually false.
    I would excuse the layer 2 dependency one however, but for the other two, we cannot excuse(confuse) lack of understanding of the product for lack of features.

    Generation of events if a background process thats similar even with nagios. What you require is to filter out events from your event view (similar to the filtering that you mentioned for nagios) so if you dont want to see events in your event view/dashboard when count is less than 2, you can specify that in the filters.

    Moreso, there is the event transform, you can drop any event that meet a certain criteria and it will not be available in the status or history table.

    The other issue about thresholds is also false. Did you try range thresholds? this is available prior 2.5 as a zenpack, so there is still a solution for that.

    About dependencies, I take it that nagios configuration itself is as takesome as a simple python statement. So there is no excuse for those who do not want to copy-paste-edit a few python lines to take care of dependecies. there is a lot of info in the FAQs and forum on how one can implement manual device dependencies.

    Its only that most users, after working with nagios for years and mastering it, fail to realise that they also need experience with any new system. We cant expect to just switch to zenoss and find every little trick we had in nagios to be automagically done in the same way/manner.

    i find most issues or sceptism come from seasoned nagios admins, (new horses are easier to teach new tricks …lol)

  13. 13 mlist said at 06:37 on May 17th, 2010:

    Hi Benny

    honestly I’m a little bit surpri
    sed by your reply because when someone speak about technical aspects, should avoid comments based on a personal preference.

    you say:
    1) “Generation of events…..is less than 2, you can specify that in the filters”

    Interesting is that I just found YOUR comment at http://community.zenoss.org/message/48705#48705 in which YOU say:
    —–
    you cannot use event count in a transform im afraid. count only gets updated when after transformation, so its not available before that
    —–

    So..what have I misunderstood?
    I simply said that you can prevent the email generation but NOT the event generation. And..please note that I’m speaking about the real generation of an event and not just of the filtering using the event console.
    This is quite a big limit in fact I have some problems. Chett Luther through the Enterprise support replied to me that he agree with me and told to me that Zenoss has already aware of this problem and will implement a kind of “quarantine” in next release.
    I’m a dummy user but…Chett Luther is a Zenoss guru. Thus…what are we speaking about?

    2) You say:
    “Moreso, there is the event transform…history table”

    I never said that this is not possible. I have lot of transforms that change serverity or send to history or drop.

    3) You say:
    “The other issue about….solution for that”

    I’m an Enterprise customer and what I said has been confirmed by Chett Luther (that you surely know) through the Enteprise support thus I don’t understand your comment. The solution is not a Zenpack but just to use the workaround I described (really provide by Chett again….). The point of question is that this is confusing and Zenoss has already confirmed that will be fixed in the feature and as buil-in function and not as a separate Zenpack. Isn’t logic that this should be the base?

    4) You say:
    “About dependencies…device dependencies”.
    Zenoss confirmed to me (and there is a thread in the forum also…) that this is one of the most requested features. So…I’m the only one that thinks that is not possible to manage hundreds of dependeces writing python code? In my opinion you are not impartial due to your Zenoss passion.

    5) You say:
    “Its only that most users…way/manner”

    This is part true but isn’t my case. I used Nagios for 3 years and now I’m using Zenoss as main NMS for 1 year and…although I must admit that my knowledege is less than Nagios, I think that I can say with humility that I know Zenoss quite well now. I never thought that I would have to switch to Zenoss magically.

    To summarize:
    I think that when you compare two products, you should be objective as possible and leave aside personal preferences and you should should rely on reliable data. My statements have been analyzed by Zenoss Company also. Moreover I follow this argument with passion for many years and I found confirmation from many experts, perhaps a little more impartial than you.

  14. 14 Benny Chitambira said at 07:41 on May 17th, 2010:

    @Mlist
    True that I may be biased on my arguments (but then thats very common of most sys admins)
    Whilist my first statement was addressing you, the rest of my argument was rather general as I was trying to discourage users from dismissing products before thay thoroughly evaluate them.

    I appreciate the fact that the environments that we work with are vastly different and their requirements will similarly differ.

    Clearly you are one of the many that really need layer 2 dependencies. I had already indicated my reservation for this one when I said I can excuse dependencies issue.

    My push is for zenoss core to come out strong on the strength of the community. Already there is an indication in that direction thats why i have a strong passion for it.

    Believe me I have come accross hard-core nagios admins who think zenoss is a non-starter and no-brainer, so I thot your post was leaning towards that…..

  15. 15 marco said at 08:32 on May 17th, 2010:

    Benny

    Believe me…I like Zenoss in fact I purchased the Enterprise version and I’m migrating from Nagios and Hp Open View.
    Zenoss is a good product with more and more features compared to Nagios.
    That said I simply exposed some limits or aspect that should be improved and…in my opinion, Zenoss is in the right direction and maybe that in a couple of years we won’t need to discuss if Zenoss is better or not respect to Nagios. Zenoss probably will be the killer application.
    But I think that positive criticism are constructive and usually are appreciated by people that want seriosly help to improve products.
    That said I must admint that is human and normal to have some preference.
    Anyway at the end I think that discuss (always) is a positive thing!

    bye
    Marco

  16. 16 Guest123 said at 16:51 on June 13th, 2010:

    I want to try some tool but still confused after googling this much.

  17. 17 Gust456 said at 15:07 on January 11th, 2011:

    go back to sleep then, thats what I did!

  18. 18 blake said at 13:26 on January 14th, 2011:

    Just for the record, in the post you say that Zenoss uses MySQL for storage but this is not 100% true. It only uses MySQL to store events. Performance data is stored in RRD files, and all other data (the bulk of Zenoss) is stored in ZODB.

  19. 19 Jack said at 18:09 on April 15th, 2011:

    Id have to say after playing around with free monitoring tools for the past 10 years I always end up back at zenoss. Though its class logic, odd gui and bugs have given me the shits over the years it does do the fundamental job of monitoring a systems nic, cpu,disk and mem along with processes/services well. Getting WMI working can be a pain in the rear on some windows boxes but being able to use the windows built in snmp is nice. I just hope that it stays free and will forever allow me to stay 1 step ahead of those crazy users coming running to tell me the internet is down and zenoss having already told me the proxy server is having a heart attack…CLEAR

  20. 20 Janderson said at 22:23 on April 18th, 2011:

    I still prefer nagios… I’m very satisfied of using nagios with centreon administration and reports for two years with 200 hosts and 800 services without spend nothing. The nagios checker plugin for firefox and nagroid for those with mobile are excellent. For those that wants better visual NagVis can do the job.

  21. 21 charlie said at 12:28 on October 2nd, 2012:

    Update: I’m now anti-nagios/zenoss/etc altogether. They just don’t scale, and are too monolithic. My latest blog post explains why I’ve changed my mind 🙂

    http://www.longitudetech.com/devops/starting-out-a-new-approach-to-systems-monitoring/

  22. 22 hobbit said at 08:03 on October 29th, 2012:

    it s like zenoss is a revolution because it combines monitoring (ping, mem disk etc) AND graphs.

    But guys, take a look at Xymon. it already does this kind of stuff for ages.

    http://www.xymon.com

    it is really scallable and works fine.

  23. 23 Travis Runyard said at 22:59 on November 19th, 2012:

    Haven’t evaluated Zenoss yet, but will be soon to monitor internal network, branch office and Amazon cloud environments. I just wanted to through my comment out there about the repeated posts about Nagios lacking graphs. True it does, but that can be handled by a plugin called Pnp4Nagios which also uses RRD graphing. I am also monitoring WAN/LAN interfaces for bandwidth usage with MRTG, which is all collected by Nagios. That’s great news that all the old plugins will work in Zenoss, I will be using many of them for sure, but I am not looking forward to the learning curve on a new NMS ;-o

  24. 24 No said at 23:03 on November 19th, 2012:

    Everyone should be using graphite for graphing.


Leave a Reply

  •