[collectd] collectd -> Nagios connectivity

Sun Jan 28 19:48:49 CET 2007

Hi,

I have been made aware, that some people argue against collectd because
it ``doesn't work with Nagios''. While collectd isn't (and is not
supposed to be) a monitoring application, it's of course possible to
provide the collected values for an external application. I have written
a proof-of-concept implementation and would like to hear some opinions
on it. So if you're interested in that sort of thing please read on..

I'll first tell you what I did, and later on I'll tell you why I did it
like that. I've added a new `write' plugin to collectd which stores the
values in memory. The plugin also opens a UNIX-socket on which it waits
for incoming connections. Clients connecting to the plugin can send
commands (the only one being implemented right now is `GETVAL') which
the plugin then handles.

With the `GETVAL'-command you can query the last values of a certain
data-set, e. g. `hostname/traffic/if_octets-eth0'. The return string is
something along the lines of: ``2 rx=3.95000000e5 tx=2.2360000e3''. The
values returned are always a rate, i. e. the counter values have been
converted to a `gauge'-value.

Another small application has been written to perform the actual Nagios-
checks. It connects to the UNIX-socket and queries the values of a
data-set which is given as a command line argument. Still, more than one
value might be returned. So there is a command line argument to select
one or more values (if ommited all values are being used). The remaining
values are `consolidated':
`none':    The range is applied to each value individually. If one
           range-check is `critical', the entire check will return
           critical.
`average': The average of all values is calculated and the range-check
           is done on this average.
`sum':     The sum of all values is being checked.

Why so complicated? Because I think it's necessary:

While reading up on Nagios' plugin interface I started to wonder why
everybody sticks with this software: With it's huge userbase, numerous
plugins available and it being the successor of the already popular
`NetSaint' I thought that it's plugin concept would be easy to use,
powerfull and all in all superb. Well, that's not the case..

The straight-forward solution would have been to simply submit `passive
checks' to Nagios. This is not reasonable, because
- Nagios wants to know if a check was successfull or not, which collectd
  doesn't know. Of course this could have been configured, but then the
  Nagios-configuration would have been torn appart and you couldn't
  configure the checks from within nagios.
- if a computer or service disappears an error should be generated.
  However, this would mean setting a timer within the plugin, resetting
  the timer if the value is updated and basically implementing all the
  monitoring stuff within collectd. This is not right.

So we need active checks, which Nagios clearly favors. The UNIX-socket
provides a fairly easy and flexible solution for this. The idea is to
expant the plugin in the future, e. g. to allow it to load/unload
plugins while the daemon is running.

One last annoyance: Nagios passes `ranges' to the plugins and expects
them to report `okay', `warning', or `critical' back. This complicates
argument parsing significantly: My first implementation that simply
returned the values as returned via the UNIX-socket was <200 lines. With
all these ranges and selecting DSes and stuff it's now >450 lines.

I'm anxious to hear your comments and being ripped appart by enthusiasts
;)
Regards,
-octo
-- 
Florian octo Forster
Hacker in training
GnuPG: 0x91523C3D
http://verplant.org/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : http://mailman.verplant.org/pipermail/collectd/attachments/20070128/4ff4c879/attachment.pgp