[collectd] Thresholds: xxx has not been updated for yy seconds

Andrés J. Díaz ajdiaz at connectical.com
Mon Mar 22 22:02:25 CET 2010


On Mon, 22 Mar 2010 15:52:19 +0100
Schmurfy <schmurfy at gmail.com> wrote:

> Hello,

Hello Schmurfy :)

> [...]
> My problem is that alerts are triggered really often because either
> load or df plugin data are missing for data received (it is not
> always the same server and when it happens it tends to do it more
> than once), I tried different things to solve the problems:

There are a number of reasons for your problem, the most common
one is the LAN latency. As you are checking for missing pings, 
I would think that there were previous problems in the network,
innit? ;)

> My last attempt at solving this was simply to check what was going on
> the networks by putting a network sniffer on the central server
> (ngrep), results are that every 10s the collectd servers really send
> the data as it should BUT not all of these packets contains the load
> or df value (it may also happen with other fields but i did not
> checked every one of them), the load and df fields can sometimes be
> included in 1/3 packets meaning it is sent every 30s instead of 10s
> but most of the time it is simply sent half of the time so every 20s.

I assume that all clients are reporting at the same interval. Have you
any other weird message on collectd log? In some situations (when a
plugin fails) the collection of data can be stopped for a while, but in
this cases you can see the properly message on collectd log.

Also be carefully with the timedate on the nodes, I was some problems in
the past with non-synchronized hosts, in this cases RRD often fails
before thresholds too and log errors are very verbosed.

> The strange things is that the ping plugin never raise a "data are
> missing" alert and only triggers an alert when it should, I really
> feel lost on that :\

Really this is not an unusual thing, because the ping plugin is
dispatched by the server and the problem appears to be in the
communication with the clients, so i think that we can discard network
related problems... :/ 

On the other hand you have an easy workarround for thresholds (of
course this is not a solution). Recently i post a patch to change the
timeout using in thresholds, so you can use:

Timeout 3

to increase the checking time for missing values to 30s (3 intervals).

Regards,
  Andres




More information about the collectd mailing list