[collectd] Availability monitoring (replacing nagios with thresholds)
Trent W. Buck
twb-mailman-collectd at cyber.com.au
Thu Jun 23 07:07:44 CEST 2011
Currently I do availability monitoring with a simple nagios2 setup. I
want to replace this with collectd, consolidating performance and
availability monitoring into a single piece of infrastructure.
If you want to make me really happy, you can say "why don't you just
do <obvious thing>?" I darkly suspect that filters can do about a
third of this, but I definitely don't "get" them yet.
The core of this is covered by Thresholds, especially with the newer
Hits and Hysteresis (and Timeout) options. I especially love that I
can define an alert level for a resource type, and it automatically
applies to all hosts. (By comparison, in nagios each resource must be
enumerated explicitly. For example, if a new filesystem is added and
nobody tells nagios, nagios won't warn when it's full.) That's a
pretty killer feature for me, because most of my customers are too
small to have any real change management.
Here's the hurdle I'm currently stuck on:
- "escalation" of outages as they become more serious. For example,
if a customer's web server is down, it should email them
immediately, but it shouldn't whinge to *me* about it unless it's
down for an hour or more.
This is apparently a non-negotiable business requirement; customer
support contracts are written to assume it or something. And I'm
struggling to work out how to implement it.
First I though of turning on "Persist true" and have the
email-sending handler send an email to $customer when the status
changes, and to simply count the number of notifications since the
last edge and notify me when it exceeded an hour's worth of
notifications.
The naïve implementation of this won't work because collectd fires
up exec/perl/python notification handlers asynchronously, and they
can't reliably store state between invocations (i.e. no closures).
I was starting to roll an elaborate framework to store
notification counts outside the process (like, in a write-mutex
database or in filesystem stamp files), when a co-worker suggested
just feeding the outage state back into collectd as a new
pseudo-plugin. Then, you can use a new set of thresholds to
monitor the "derived" resource, and the email-sending handler can
simply dispatch to $customer for normal resources, and to me for
notifications on derived resources.
So that's what I'm doing now, and while it's a bit fugly, it seems
to work. Attached is the Exec script I'm using and a half-assed
collectd.conf snippet to use it. (I couldn't get perl plugin to
work, and my python-fu was weak enough to slow me down. I'll
optimize later if using exec gets too laggy.)
Here are the hurdles I've identified and worked around:
- nagios is pull-oriented, so the hub can be behind a NAT, but the
leaves can't. collectd is push-oriented, so the reverse applies.
I prefer the collectd way, so no problem there.
- nagios provides a "dashboard" of current outages.
I think this can be implemented by having a notification callback
in Exec/Perl/Python that writes out static (read-only) HTML files.
If the "derived resource" approach is used, it's just a cron job
that renders a dashboard from them once an hour or whatever.
I don't care about providing an interactive web UI.
- nagios allows one to "acknowledge" outages, to stop receiving
periodic "it's still down" emails.
I don't care about this, because I don't use "Persist true".
- nagios allows one to "schedule" outages, so that no notification
is sent.
I was thinking of doing this by creating stamp files in
/var/lib/collectd/rrd/foo/bar.{begin,end}. The email handler will
not react to notifications of foo-bar when begin < now < end.
Creating of stampfiles can be partially automated by a helper
script if necessary.
- nagios understands "dependencies", e.g. you can tell it not to
complain about www.example.net's HTTP service being down if it
can't even ping www.example.net.
I'm not too worried about this because the only thing I use it for
currently is to say "don't complain about offsite things when the
local internet link is down", and I expect I can hard-code that
into the email-sending handler.
- alerting different people. For example, email me at example.net for
all example.net-related issues, but email customer at example.com for
all example.com issues.
This can be done with a simple case dispatch against Host (for
most resources), or TypeInstance (for ping) in the email-sending
handler.
PS: while writing this, some other wishlist items came to mind:
- Currently thresholds can be defined globally, or per
host/instance. It would be nice if I could also do it per "site",
basically by allowing wildcards in host/instance, such as <Host
"*.example.net"> or <Type "ping"> Instance "*.example.net".
- It might be nice to define global default Hits, Hysteresis and Timeout,
which are then overriden in specific thresholds if necessary.
Currently Hits and Hysteresis are hard-coded to be off globally
(AFAICT).
- I haven't found a good way to distinguish between OK/FAILURE
events caused by "missing" data vs. failed data. Possibly an
extra field in the rfc822-like stdin?
- Can't "see" in the handler what the threshold was (e.g. percent
vs. absolute value), except in the pre-formatted message body.
- the libesmtp plugin is currently too limited for my requirements,
so I've been referring to "email-sending handler" with the
assumption it's an in-house exec/perl/python script, but it might
be feasible to add enough useful-for-everybody features to the
libesmtp plugin that I can just use it instead.
-------------- next part --------------
# This config snippet is responsible for outage notifications (a la
# nagios). It defines alert thresholds for various resources, and
# sets up the "derive" pseudoplugin to allow two "layers" of alert
# notification.
LoadPlugin exec
<Plugin exec>
NotificationExec nobody "/etc/collectd/derive.sh" "-m" "notify"
Exec nobody "/etc/collectd/derive.sh" "-m" "read"
NotificationExec nobody "/etc/collectd/derive.sh" "-m" "email"
</Plugin>
# This part isn't actually tested yet; the test example was stupid and
# unreadable.
LoadPlugin threshold
<Threshold>
<Plugin derive>
WarningMax 2
FailureMax 3
Hits 5
Hysteresis 3
</Plugin>
<Plugin ping>
WarningMax 3
FailureMax 5
Hits 2
Hysteresis 2
</Plugin>
<Type "df">
DataSource "free"
WarningMin 10
FailureMin 5
Percentage true
Hits 60
Hysteresis 20
</Type>
</Threshold>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: derive.sh
Type: application/x-sh
Size: 3895 bytes
Desc: not available
URL: <http://mailman.verplant.org/pipermail/collectd/attachments/20110623/73aee39e/attachment.sh>
More information about the collectd
mailing list