<div>I checked the logs of the collectd instances with problems but there is not much in it, only some warning about configuration block with disabled plugin which shows i have to do some cleaning after my tests :p</div><div>
For now I will follow your suggestion and increase the number of miss for an alert to be triggered but as you said it is only a workaround, I still do not have any clue about why this happens, the servers are not under heavy load, network between them is fine ( all hosted in the same datacenter and when sniffing network i can se the packets each 10s, it is just that some counters are missing in them... ).</div>
<div><br></div><div>Interval is set to 10s for every node and none of our monitored data requires actions in the minute so getting an alert about a missing machine after the server was missing for even 60s is acceptable.</div>
<div>It still puzzle me and I really want to find a solution but at least I can use alerts now ^^</div><div>And I forgot to write it in first email but all servers are synchronised with an external timeserver (multiple servers in fact in case of failure).</div>
<div><br></div><div>Thanks for your answer.</div><div><br></div><div>Julien A.<br><br><div class="gmail_quote">On 22 March 2010 22:02, Andrés J. Díaz <span dir="ltr"><<a href="mailto:ajdiaz@connectical.com">ajdiaz@connectical.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">On Mon, 22 Mar 2010 15:52:19 +0100<br>
Schmurfy <<a href="mailto:schmurfy@gmail.com">schmurfy@gmail.com</a>> wrote:<br>
<br>
> Hello,<br>
<br>
Hello Schmurfy :)<br>
<br>
> [...]<br>
<div class="im">> My problem is that alerts are triggered really often because either<br>
> load or df plugin data are missing for data received (it is not<br>
> always the same server and when it happens it tends to do it more<br>
> than once), I tried different things to solve the problems:<br>
<br>
</div>There are a number of reasons for your problem, the most common<br>
one is the LAN latency. As you are checking for missing pings,<br>
I would think that there were previous problems in the network,<br>
innit? ;)<br>
<div class="im"><br>
> My last attempt at solving this was simply to check what was going on<br>
> the networks by putting a network sniffer on the central server<br>
> (ngrep), results are that every 10s the collectd servers really send<br>
> the data as it should BUT not all of these packets contains the load<br>
> or df value (it may also happen with other fields but i did not<br>
> checked every one of them), the load and df fields can sometimes be<br>
> included in 1/3 packets meaning it is sent every 30s instead of 10s<br>
> but most of the time it is simply sent half of the time so every 20s.<br>
<br>
</div>I assume that all clients are reporting at the same interval. Have you<br>
any other weird message on collectd log? In some situations (when a<br>
plugin fails) the collection of data can be stopped for a while, but in<br>
this cases you can see the properly message on collectd log.<br>
<br>
Also be carefully with the timedate on the nodes, I was some problems in<br>
the past with non-synchronized hosts, in this cases RRD often fails<br>
before thresholds too and log errors are very verbosed.<br>
<div class="im"><br>
> The strange things is that the ping plugin never raise a "data are<br>
> missing" alert and only triggers an alert when it should, I really<br>
> feel lost on that :\<br>
<br>
</div>Really this is not an unusual thing, because the ping plugin is<br>
dispatched by the server and the problem appears to be in the<br>
communication with the clients, so i think that we can discard network<br>
related problems... :/<br>
<br>
On the other hand you have an easy workarround for thresholds (of<br>
course this is not a solution). Recently i post a patch to change the<br>
timeout using in thresholds, so you can use:<br>
<br>
Timeout 3<br>
<br>
to increase the checking time for missing values to 30s (3 intervals).<br>
<br>
Regards,<br>
Andres<br>
<br>
</blockquote></div><br></div>