Hello,<div><br></div><div>I am using collectd to monitor 5 physical servers each one running a collectd instance and sending their data every 10 seconds to another one which stores these data in rrd on its filesystem and also monitor itself.</div>
<div>I can already draw nice graphs but next steps was being able to trigger email alerts and that is where my problem lies :)</div><div><br></div><div>The threshold configuration i currently use is at the bottom of this email, i have three alerts monitored (for testing at first):</div>
<div>- disk space used (alert when > 80%)</div><div>- load (alert when midterm above 0.8)</div><div>- ping (only one is shown but i have more): alert when 5% of pings are lost</div><div><br></div><div>My problem is that alerts are triggered really often because either load or df plugin data are missing for data received (it is not always the same server and when it happens it tends to do it more than once), I tried different things to solve the problems:</div>
<div><br></div><div>RRD files were initially written on disk by collectd itself and i then switched to using rrdcached which reduced disk write access greatly but did not solve my problem (i thought disk access may be to blame since i got spikes up to 3MB/s).</div>
<div><br></div><div>My second attempt was to increase threads count, from the documentation i read that the number of threads can be set to the number of plugins enabled in case of problems, I tried setting it like that (I use around 11 plugins) and also tried to increase that just to test to 20, problem remains.</div>
<div><br></div><div>My last attempt at solving this was simply to check what was going on the networks by putting a network sniffer on the central server (ngrep), results are that every 10s the collectd servers really send the data as it should BUT not all of these packets contains the load or df value (it may also happen with other fields but i did not checked every one of them), the load and df fields can sometimes be included in 1/3 packets meaning it is sent every 30s instead of 10s but most of the time it is simply sent half of the time so every 20s.</div>
<div><br></div><div>Example of alerts triggered:</div><div><br></div><div>9:44 <server>/load/load has not been updated for 20 seconds.</div><div><div>9:44 Received a value for <server>/load/load. It was missing for 20 seconds.</div>
<div><br></div><div><div>10:22 <server>/load/load has not been updated for 20 seconds.</div><div><div>10:22 Received a value for <server>/load/load. It was missing for 20 seconds.</div><div><br></div><div><div>
10:24 <server>/load/load has not been updated for 20 seconds.</div><div><div>10:24 Received a value for <server>/load/load. It was missing for 20 seconds.</div></div></div></div></div></div><div><br></div><div>
<br></div><div>The strange things is that the ping plugin never raise a "data are missing" alert and only triggers an alert when it should, I really feel lost on that :\</div><div><br></div><div><br></div><div>
Thanks in advance for any help.</div>
<div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div>Threshold configuration: </div><div><br></div><div><div><Threshold></div><div> <Host "xxxxx"></div>
<div> <Plugin "ping"></div><div> <Type "ping_droprate"></div><div> Datasource "value"</div><div> # 5 % ?</div><div> FailureMax 0.05</div><div> </Type></div>
<div> </Plugin></div><div> </Host></div><div> </div><div> <Plugin "load"></div><div> <Type "load"></div><div> Datasource "midterm"</div><div> FailureMax 0.8</div>
<div> </Type></div><div> </Plugin></div><div> </div><div> <Plugin "df"></div><div> <Type "df"></div><div> Datasource "used"</div><div> FailureMax 80</div>
<div> Percentage true</div><div> </Type></div><div> </Plugin></div><div></Threshold></div></div>