[collectd] Thresholds: xxx has not been updated for yy seconds
Schmurfy
schmurfy at gmail.com
Tue Mar 23 17:43:20 CET 2010
I checked the logs of the collectd instances with problems but there is not
much in it, only some warning about configuration block with disabled plugin
which shows i have to do some cleaning after my tests :p
For now I will follow your suggestion and increase the number of miss for an
alert to be triggered but as you said it is only a workaround, I still do
not have any clue about why this happens, the servers are not under heavy
load, network between them is fine ( all hosted in the same datacenter and
when sniffing network i can se the packets each 10s, it is just that some
counters are missing in them... ).
Interval is set to 10s for every node and none of our monitored data
requires actions in the minute so getting an alert about a missing machine
after the server was missing for even 60s is acceptable.
It still puzzle me and I really want to find a solution but at least I can
use alerts now ^^
And I forgot to write it in first email but all servers
are synchronised with an external timeserver (multiple servers in fact in
case of failure).
Thanks for your answer.
Julien A.
On 22 March 2010 22:02, Andrés J. Díaz <ajdiaz at connectical.com> wrote:
> On Mon, 22 Mar 2010 15:52:19 +0100
> Schmurfy <schmurfy at gmail.com> wrote:
>
> > Hello,
>
> Hello Schmurfy :)
>
> > [...]
> > My problem is that alerts are triggered really often because either
> > load or df plugin data are missing for data received (it is not
> > always the same server and when it happens it tends to do it more
> > than once), I tried different things to solve the problems:
>
> There are a number of reasons for your problem, the most common
> one is the LAN latency. As you are checking for missing pings,
> I would think that there were previous problems in the network,
> innit? ;)
>
> > My last attempt at solving this was simply to check what was going on
> > the networks by putting a network sniffer on the central server
> > (ngrep), results are that every 10s the collectd servers really send
> > the data as it should BUT not all of these packets contains the load
> > or df value (it may also happen with other fields but i did not
> > checked every one of them), the load and df fields can sometimes be
> > included in 1/3 packets meaning it is sent every 30s instead of 10s
> > but most of the time it is simply sent half of the time so every 20s.
>
> I assume that all clients are reporting at the same interval. Have you
> any other weird message on collectd log? In some situations (when a
> plugin fails) the collection of data can be stopped for a while, but in
> this cases you can see the properly message on collectd log.
>
> Also be carefully with the timedate on the nodes, I was some problems in
> the past with non-synchronized hosts, in this cases RRD often fails
> before thresholds too and log errors are very verbosed.
>
> > The strange things is that the ping plugin never raise a "data are
> > missing" alert and only triggers an alert when it should, I really
> > feel lost on that :\
>
> Really this is not an unusual thing, because the ping plugin is
> dispatched by the server and the problem appears to be in the
> communication with the clients, so i think that we can discard network
> related problems... :/
>
> On the other hand you have an easy workarround for thresholds (of
> course this is not a solution). Recently i post a patch to change the
> timeout using in thresholds, so you can use:
>
> Timeout 3
>
> to increase the checking time for missing values to 30s (3 intervals).
>
> Regards,
> Andres
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.verplant.org/pipermail/collectd/attachments/20100323/33595c36/attachment.htm
More information about the collectd
mailing list