May 2, 2010

Zenoss

I've spent quite some time now working with Zenoss, and looking at the shiny Zenoss homepage with all it's praising catch phrases I felt like I had to do some critics here. Well, sadly I have mostly bad critics, which makes me think once more how poorly integrated monitoring today still is. I apologize for not pointing out the good things of Zenoss, but I'm sure you'll find plenty of that elsewhere on the net. The information here is based on Zenoss Enterprise version 2.4.

Appliance

Zenoss provides a stack installer that includes everything you need from OpenSSL to MySQL. This saves you from dependency problems and version mismatches, but has some odd consequences. Integrating Zenoss with the rest of your software can be a pain. It always wants to access or start the internal MySQL server, but you may already have your perfect MySQL cluster where you want to put the Zenoss event data. Also, you have to su to the zenoss user whenever you do some work on Zenoss, otherwise you don't have the right profile for all the executables and libraries. When debugging you have to take care to always test with the Zenoss versions of the programs, in some cases they do not exactly match the behavior of the versions included in your system.

The stack installer does its work, but lacks real integration into the system. Upgrades and De-installation clearly need a lot of manual work that could be avoided with a real package like an RPM. The fact that today Zenoss still doesn't put any effort in proper packaging brings up the question if Zenoss' design really is the fundamental problem of proper system integration, and if not, why are they persisting that much on this strategy?

Distribution

Zenoss supports distributed collectors, so you can split the load over multiple servers. But it involves a lot of network setup. It demands multiple open ports for connections in different directions. Encrypting the traffic has to be done on ones own with SSL. This is especially a problem in large and insecure networks, as it might often be the case in environments you need that type of monitoring. Unfortunately, computation load is not the only thing that is distributed in Zenoss. Data is split over 3 different databases. The configuration is stored in ZEO (a Zope Object Database), events go to a MySQL database and performance data is stored in RRD files on the filesystem. Only the first two are centralized and need to be accessible by every collector. RRD files are distributed (yes, again!) over all collectors, each one provides the files for the hosts it monitors. And of course if you want to see nice graphs in the web interface you need to access all collectors, more specifically you need a way to access the RRD files on all collectors from the server that serves the web interface.

This architecture not only makes Zenoss very hard to setup, but also requires a lot of work when you migrate something or are looking for errors.

Memory

Zenoss starts various different daemons for a collector. Each one runs in its own Python VM, so at startup it can easily consume 50-100MB of memory without doing any work yet. Additionally some processes tend to consume much more memory under load, 300-500MB for a single process is no exception. Finally, Zenoss seems to be a not so bad friend of memory leaks. I often caught processes using up to 2GB memory and after a restart the were happy with a fourth of that memory before leaking again after another week of duty. Be prepared that restarting the Zenoss stack can take quite some time and result in some alert events or gaps in the graphs. The choice for a dynamic and interpreted language really shows some of its downside in this aspect.

Extensibility

This is one of Zenoss' strengths. Unfortunately the ways in which you can extend it are diverse and can lead to a small chaos. Zenpacks give you so much freedom you can pack any file or database object into them. But there is absolutely no control whether the installed data still matches what was in the zenpack. Also, it gets very difficult to track back what things were installed by which package. Installing, upgrading and versioning is completely up to you. And of course you need to install the zenpacks on each distributed collector individually, or sync them in the web interface which copies files over SSH and restarts everything, but you still have to do it for each collector.

There are a lot of methods to gather data in Zenoss, it even has its own syslog daemon you can stream to. But the integration of various methods lacks the care for details in the implementation. SSH commands very often report as an error, even when the check did not really fail but just the command timed out. This happens because there is no clean distinction between a failed state and an unknown state, like Nagios has it. Also, it occured to me that hundreds of OIDs reported as error just because there was a problem reading data from an SNMP agent.

Behind the scenes

Rather than having to define every item you want to monitor, Zenoss brings a lot of mechanisms that automatically gather data, and it finds out things all by itself. The consequences can be bad performance until you notice that periodic port scans are not such a good idea in a network where the majority of ports are silently blocked by default. The concept of modeling devices, so that you don't have to gather certain data every time you do a check, is in some cases overused. E.g. the size of a filesystem is modeled, but it's usage is not. So if you grow a filesystem you can get an error of e.g. 110% full, or -10% free, until some hours later the device is remodeled and the filesystem size is adapted to the grown value.

Security

A monitoring server can easily become a concern of security. It is a central device that can access data from all other devices. Whereas for certain monitoring methods the risk might be in acceptable range, there are others like SSH, where the shell access makes a lot of room for possible exploits and is difficult to restrict. The most important thing here is to handle stored passwords and keys in a secure way. But in Zenoss I found various ways to retrieve passwords as non root and also as non Zenoss admin. One way is to execute predefined commands on devices which gives you e.g. the clear text SNMP password in the output log. Another way is to browse the Zope management interface behind Zenoss, that gives you a low level access to the data. A lot of this is also an issue of access management.

As already mentioned, other weaknesses include Zenoss' need for the root SSH password (although this seems to have changed in new versions). It's needed to automatically update software and distribute configuration to all collectors. Another problem is the operation in untrusted networks. Most daemons make unencrypted connections, leaving it up to you to wrap everything with SSL tunnels.

No comments:

Post a Comment