[collectd] Availability monitoring (replacing nagios with thresholds)

Thu Jun 23 07:07:44 CEST 2011

Currently I do availability monitoring with a simple nagios2 setup.  I
want to replace this with collectd, consolidating performance and
availability monitoring into a single piece of infrastructure.

If you want to make me really happy, you can say "why don't you just
do <obvious thing>?"  I darkly suspect that filters can do about a
third of this, but I definitely don't "get" them yet.

The core of this is covered by Thresholds, especially with the newer
Hits and Hysteresis (and Timeout) options.  I especially love that I
can define an alert level for a resource type, and it automatically
applies to all hosts.  (By comparison, in nagios each resource must be
enumerated explicitly.  For example, if a new filesystem is added and
nobody tells nagios, nagios won't warn when it's full.)  That's a
pretty killer feature for me, because most of my customers are too
small to have any real change management.

Here's the hurdle I'm currently stuck on:

  - "escalation" of outages as they become more serious.  For example,
    if a customer's web server is down, it should email them
    immediately, but it shouldn't whinge to *me* about it unless it's
    down for an hour or more.

    This is apparently a non-negotiable business requirement; customer
    support contracts are written to assume it or something.  And I'm
    struggling to work out how to implement it.

    First I though of turning on "Persist true" and have the
    email-sending handler send an email to $customer when the status
    changes, and to simply count the number of notifications since the
    last edge and notify me when it exceeded an hour's worth of
    notifications.

    The naïve implementation of this won't work because collectd fires
    up exec/perl/python notification handlers asynchronously, and they
    can't reliably store state between invocations (i.e. no closures).

    I was starting to roll an elaborate framework to store
    notification counts outside the process (like, in a write-mutex
    database or in filesystem stamp files), when a co-worker suggested
    just feeding the outage state back into collectd as a new
    pseudo-plugin.  Then, you can use a new set of thresholds to
    monitor the "derived" resource, and the email-sending handler can
    simply dispatch to $customer for normal resources, and to me for
    notifications on derived resources.

    So that's what I'm doing now, and while it's a bit fugly, it seems
    to work.  Attached is the Exec script I'm using and a half-assed
    collectd.conf snippet to use it.  (I couldn't get perl plugin to
    work, and my python-fu was weak enough to slow me down.  I'll
    optimize later if using exec gets too laggy.)

Here are the hurdles I've identified and worked around:

  - nagios is pull-oriented, so the hub can be behind a NAT, but the
    leaves can't.  collectd is push-oriented, so the reverse applies.

    I prefer the collectd way, so no problem there.

  - nagios provides a "dashboard" of current outages.

    I think this can be implemented by having a notification callback
    in Exec/Perl/Python that writes out static (read-only) HTML files.

    If the "derived resource" approach is used, it's just a cron job
    that renders a dashboard from them once an hour or whatever.

    I don't care about providing an interactive web UI.

  - nagios allows one to "acknowledge" outages, to stop receiving
    periodic "it's still down" emails.

    I don't care about this, because I don't use "Persist true".

  - nagios allows one to "schedule" outages, so that no notification
    is sent.

    I was thinking of doing this by creating stamp files in
    /var/lib/collectd/rrd/foo/bar.{begin,end}.  The email handler will
    not react to notifications of foo-bar when begin < now < end.
    Creating of stampfiles can be partially automated by a helper
    script if necessary.

  - nagios understands "dependencies", e.g. you can tell it not to
    complain about www.example.net's HTTP service being down if it
    can't even ping www.example.net.

    I'm not too worried about this because the only thing I use it for
    currently is to say "don't complain about offsite things when the
    local internet link is down", and I expect I can hard-code that
    into the email-sending handler.

  - alerting different people.  For example, email me at example.net for
    all example.net-related issues, but email customer at example.com for
    all example.com issues.

    This can be done with a simple case dispatch against Host (for
    most resources), or TypeInstance (for ping) in the email-sending
    handler.

PS: while writing this, some other wishlist items came to mind:

  - Currently thresholds can be defined globally, or per
    host/instance.  It would be nice if I could also do it per "site",
    basically by allowing wildcards in host/instance, such as <Host
    "*.example.net"> or <Type "ping"> Instance "*.example.net".

  - It might be nice to define global default Hits, Hysteresis and Timeout,
    which are then overriden in specific thresholds if necessary.
    Currently Hits and Hysteresis are hard-coded to be off globally
    (AFAICT).

  - I haven't found a good way to distinguish between OK/FAILURE
    events caused by "missing" data vs. failed data.  Possibly an
    extra field in the rfc822-like stdin?

  - Can't "see" in the handler what the threshold was (e.g. percent
    vs. absolute value), except in the pre-formatted message body.

  - the libesmtp plugin is currently too limited for my requirements,
    so I've been referring to "email-sending handler" with the
    assumption it's an in-house exec/perl/python script, but it might
    be feasible to add enough useful-for-everybody features to the
    libesmtp plugin that I can just use it instead.
-------------- next part --------------
# This config snippet is responsible for outage notifications (a la
# nagios).  It defines alert thresholds for various resources, and
# sets up the "derive" pseudoplugin to allow two "layers" of alert
# notification.

LoadPlugin exec
<Plugin exec>
  NotificationExec nobody "/etc/collectd/derive.sh" "-m" "notify"
  Exec             nobody "/etc/collectd/derive.sh" "-m" "read"
  NotificationExec nobody "/etc/collectd/derive.sh" "-m" "email"
</Plugin>

# This part isn't actually tested yet; the test example was stupid and
# unreadable.
LoadPlugin threshold
<Threshold>
  <Plugin derive>
    WarningMax 2
    FailureMax 3
    Hits       5
    Hysteresis 3
  </Plugin>
  <Plugin ping>
    WarningMax 3
    FailureMax 5
    Hits       2
    Hysteresis 2
  </Plugin>
  <Type "df">
    DataSource "free"
    WarningMin 10
    FailureMin 5
    Percentage true
    Hits       60
    Hysteresis 20
   </Type>
</Threshold>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: derive.sh
Type: application/x-sh
Size: 3895 bytes
Desc: not available
URL: <http://mailman.verplant.org/pipermail/collectd/attachments/20110623/73aee39e/attachment.sh>