Anselm's Blog: 05/2010

For years extfs has been a reliable, stable and well performing filesystem. But there has always been a weak point it its design for operation on long-running servers with lots of data. It is what manifests as the mount count, or more precisely, the inability to check the filesystem online.

This is a real problem for servers with high uptime demands. The usual way is to reboot the machine (in case the filesystem is vital for system operation) or remount the filesystem from time to time, and when the mount count reaches its limit the filesystem is checked. But not only the reboots and remounts interrupt the servers operation and raise the service downtime. Also, the filesystem check must be completed before the filesystem can be mounted again. With terabytes of data a check can take multiple hours or even days to complete, depending on your disk layout. And if this catches you in the wrong moment, e.g. when the server crashed and you need it back online fast, this can be a real pain. The alternative is to disable this behavior by setting the mount count to 0. In this case no automatic checks are done, but which honest system administrator would claim to be able to regularly check all filesystems on all servers by hand.

With the recent growing use of ext4 as standard filesystem in Linux distributions I asked myself if this shortcoming still exists today. Luckily things have changed for the better, and not first with the introduction of ext4 but with the inclusion of the VFS lock patches for the 2.6 series of kernels. Ext4 still can not do it on its own, but with the help of LVM there is a way. What is a simple fsck_ufs -B for the BSD user, and where the ZFS user only smiles at you and asks "What, you only check once a month? I do it all the time, it is called checksums.", is not that simple for the Linux user. The trick is to put the extfs on top of an LVM and use snapshots. The extfs and LVM code in the kernel need to play together, so when you make a snapshot everything is properly locked and consistent. Then you can run fsck on the snapshot, while the original copy of the filesystem continues its operation. If fsck reports success you simply discard the snapshot again, and know everything is fine.

The ugly thing is that LVM snapshots are a bit fragile. When you create the snapshot you must specify the size. It is the amount of space that is reserved to write the changes that happen since you made the snapshot. You have to properly estimate it depending on how long you will need the snapshot and how much data will be modified meanwhile. This might not be an easy task, even for a sysadmin. If the snapshot runs out of space you risk to loose data.

So, it is possible but truly not a very solid solution. I have never seen a distribution that implements this by default, though this seems to me to be a very essential task. So, how do you manage to check all your ext filesystems on your enterprise Linux servers that have hundreds of days of uptime ...?

I've spent quite some time now working with Zenoss, and looking at the shiny Zenoss homepage with all it's praising catch phrases I felt like I had to do some critics here. Well, sadly I have mostly bad critics, which makes me think once more how poorly integrated monitoring today still is. I apologize for not pointing out the good things of Zenoss, but I'm sure you'll find plenty of that elsewhere on the net. The information here is based on Zenoss Enterprise version 2.4.

Appliance

Zenoss provides a stack installer that includes everything you need from OpenSSL to MySQL. This saves you from dependency problems and version mismatches, but has some odd consequences. Integrating Zenoss with the rest of your software can be a pain. It always wants to access or start the internal MySQL server, but you may already have your perfect MySQL cluster where you want to put the Zenoss event data. Also, you have to su to the zenoss user whenever you do some work on Zenoss, otherwise you don't have the right profile for all the executables and libraries. When debugging you have to take care to always test with the Zenoss versions of the programs, in some cases they do not exactly match the behavior of the versions included in your system.

The stack installer does its work, but lacks real integration into the system. Upgrades and De-installation clearly need a lot of manual work that could be avoided with a real package like an RPM. The fact that today Zenoss still doesn't put any effort in proper packaging brings up the question if Zenoss' design really is the fundamental problem of proper system integration, and if not, why are they persisting that much on this strategy?

Distribution

Zenoss supports distributed collectors, so you can split the load over multiple servers. But it involves a lot of network setup. It demands multiple open ports for connections in different directions. Encrypting the traffic has to be done on ones own with SSL. This is especially a problem in large and insecure networks, as it might often be the case in environments you need that type of monitoring. Unfortunately, computation load is not the only thing that is distributed in Zenoss. Data is split over 3 different databases. The configuration is stored in ZEO (a Zope Object Database), events go to a MySQL database and performance data is stored in RRD files on the filesystem. Only the first two are centralized and need to be accessible by every collector. RRD files are distributed (yes, again!) over all collectors, each one provides the files for the hosts it monitors. And of course if you want to see nice graphs in the web interface you need to access all collectors, more specifically you need a way to access the RRD files on all collectors from the server that serves the web interface.

This architecture not only makes Zenoss very hard to setup, but also requires a lot of work when you migrate something or are looking for errors.

Memory

Zenoss starts various different daemons for a collector. Each one runs in its own Python VM, so at startup it can easily consume 50-100MB of memory without doing any work yet. Additionally some processes tend to consume much more memory under load, 300-500MB for a single process is no exception. Finally, Zenoss seems to be a not so bad friend of memory leaks. I often caught processes using up to 2GB memory and after a restart the were happy with a fourth of that memory before leaking again after another week of duty. Be prepared that restarting the Zenoss stack can take quite some time and result in some alert events or gaps in the graphs. The choice for a dynamic and interpreted language really shows some of its downside in this aspect.

Extensibility

This is one of Zenoss' strengths. Unfortunately the ways in which you can extend it are diverse and can lead to a small chaos. Zenpacks give you so much freedom you can pack any file or database object into them. But there is absolutely no control whether the installed data still matches what was in the zenpack. Also, it gets very difficult to track back what things were installed by which package. Installing, upgrading and versioning is completely up to you. And of course you need to install the zenpacks on each distributed collector individually, or sync them in the web interface which copies files over SSH and restarts everything, but you still have to do it for each collector.

There are a lot of methods to gather data in Zenoss, it even has its own syslog daemon you can stream to. But the integration of various methods lacks the care for details in the implementation. SSH commands very often report as an error, even when the check did not really fail but just the command timed out. This happens because there is no clean distinction between a failed state and an unknown state, like Nagios has it. Also, it occured to me that hundreds of OIDs reported as error just because there was a problem reading data from an SNMP agent.

Behind the scenes

Rather than having to define every item you want to monitor, Zenoss brings a lot of mechanisms that automatically gather data, and it finds out things all by itself. The consequences can be bad performance until you notice that periodic port scans are not such a good idea in a network where the majority of ports are silently blocked by default. The concept of modeling devices, so that you don't have to gather certain data every time you do a check, is in some cases overused. E.g. the size of a filesystem is modeled, but it's usage is not. So if you grow a filesystem you can get an error of e.g. 110% full, or -10% free, until some hours later the device is remodeled and the filesystem size is adapted to the grown value.

Security

A monitoring server can easily become a concern of security. It is a central device that can access data from all other devices. Whereas for certain monitoring methods the risk might be in acceptable range, there are others like SSH, where the shell access makes a lot of room for possible exploits and is difficult to restrict. The most important thing here is to handle stored passwords and keys in a secure way. But in Zenoss I found various ways to retrieve passwords as non root and also as non Zenoss admin. One way is to execute predefined commands on devices which gives you e.g. the clear text SNMP password in the output log. Another way is to browse the Zope management interface behind Zenoss, that gives you a low level access to the data. A lot of this is also an issue of access management.

As already mentioned, other weaknesses include Zenoss' need for the root SSH password (although this seems to have changed in new versions). It's needed to automatically update software and distribute configuration to all collectors. Another problem is the operation in untrusted networks. Most daemons make unencrypted connections, leaving it up to you to wrap everything with SSL tunnels.

Anselm's Blog

May 12, 2010

Checking Extfs Online

May 2, 2010

Zenoss