[collectd] Thresholds: xxx has not been updated for yy seconds

Mon Mar 22 15:52:19 CET 2010

Hello,

I am using collectd to monitor 5 physical servers each one running a
collectd instance and sending their data every 10 seconds to another one
which stores these data in rrd on its filesystem and also monitor itself.
I can already draw nice graphs but next steps was being able to trigger
email alerts and that is where my problem lies :)

The threshold configuration i currently use is at the bottom of this email,
i have three alerts monitored (for testing at first):
- disk space used (alert when > 80%)
- load (alert when midterm above 0.8)
- ping (only one is shown but i have more): alert when 5% of pings are lost

My problem is that alerts are triggered really often because either load or
df plugin data are missing for data received (it is not always the same
server and when it happens it tends to do it more than once), I tried
different things to solve the problems:

RRD files were initially written on disk by collectd itself and i then
switched to using rrdcached which reduced disk write access greatly but did
not solve my problem (i thought disk access may be to blame since i got
spikes up to 3MB/s).

My second attempt was to increase threads count, from the documentation i
read that the number of threads can be set to the number of plugins enabled
in case of problems, I tried setting it like that (I use around 11 plugins)
and also tried to increase that just to test to 20, problem remains.

My last attempt at solving this was simply to check what was going on the
networks by putting a network sniffer on the central server (ngrep), results
are that every 10s the collectd servers really send the data as it should
BUT not all of these packets contains the load or df value (it may also
happen with other fields but i did not checked every one of them), the load
and df fields can sometimes be included in 1/3 packets meaning it is sent
every 30s instead of 10s but most of the time it is simply sent half of the
time so every 20s.

Example of alerts triggered:

9:44 <server>/load/load has not been updated for 20 seconds.
9:44 Received a value for <server>/load/load. It was missing for 20 seconds.

10:22 <server>/load/load has not been updated for 20 seconds.
10:22 Received a value for <server>/load/load. It was missing for 20
seconds.

10:24 <server>/load/load has not been updated for 20 seconds.
10:24 Received a value for <server>/load/load. It was missing for 20
seconds.

The strange things is that the ping plugin never raise a "data are missing"
alert and only triggers an alert when it should, I really feel lost on that
:\

Thanks in advance for any help.

Threshold configuration:

<Threshold>
  <Host "xxxxx">
    <Plugin "ping">
      <Type "ping_droprate">
        Datasource "value"
        # 5 % ?
        FailureMax 0.05
      </Type>
    </Plugin>
  </Host>

  <Plugin "load">
    <Type "load">
      Datasource "midterm"
      FailureMax 0.8
    </Type>
  </Plugin>

  <Plugin "df">
    <Type "df">
      Datasource "used"
      FailureMax 80
      Percentage true
    </Type>
  </Plugin>
</Threshold>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.verplant.org/pipermail/collectd/attachments/20100322/e52917c9/attachment.htm