<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
<meta name="Generator" content="Microsoft Word 14 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri","sans-serif";}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:blue;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{mso-style-priority:99;
color:purple;
text-decoration:underline;}
span.EmailStyle17
{mso-style-type:personal-compose;
font-family:"Calibri","sans-serif";
color:windowtext;}
.MsoChpDefault
{mso-style-type:export-only;}
@page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="EN-US" link="blue" vlink="purple">
<div class="WordSection1">
<p class="MsoNormal">Hi everyone.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">I’m running collectd 5.1.0 in a client/server setup, with a central monitoring server reading from the network plugin, writing out via write_graphite, and also exposing UnixSock for nagios. Everything works fine until the graphite server
gets overloaded and becomes unresponsive. Several seconds after that, the collectd server starts dropping data points from the cache, causing nagios to emit a ton of spurious pages.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Here’s a sample of the log output from when I forced it to fail by stopping the carbon server:<o:p></o:p></p>
<p class="MsoNormal"> collectd[11038]: write_graphite plugin: send failed with status -1 (Broken pipe)<o:p></o:p></p>
<p class="MsoNormal"> collectd[11038]: write_graphite plugin: error with wg_send_message<o:p></o:p></p>
<p class="MsoNormal"> collectd[11038]: write_graphite plugin: Connecting to graphite.xxxxx.xxx:2003 failed. The last error was: Connection refused<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">My admittedly weak understanding is that the cache insert happens before the write plugins (based on https://collectd.org/wiki/index.php/Chains#Pre-_and_post-cache_chains), so failing to write shouldn’t stop values from being stored in
the cache. I’ve tried a number of tricks to try and get it to keep the values, like switching back to the old python plugin or writing a “null” plugin that always returns successfully and runs along with write_graphite. I’m starting to go down the road of
trying terrible hacks to work around this, and there’s probably something fundamental I’m getting wrong.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">My entire collectd.conf contains:<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"> Hostname "monitoring.xxxxx.xxx"<o:p></o:p></p>
<p class="MsoNormal"> FQDNLookup true<o:p></o:p></p>
<p class="MsoNormal"> BaseDir "/var/lib/collectd"<o:p></o:p></p>
<p class="MsoNormal"> PluginDir "/usr/lib/collectd"<o:p></o:p></p>
<p class="MsoNormal"> TypesDB "/usr/share/collectd/types.db", "/usr/share/collectd/firefall.types.db"<o:p></o:p></p>
<p class="MsoNormal"> Interval 10<o:p></o:p></p>
<p class="MsoNormal"> ReadThreads 5<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"> Include "/etc/collectd/plugins/*.conf"<o:p></o:p></p>
<p class="MsoNormal"> Include "/etc/collectd/thresholds.conf"<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">The config for write_graphite has:<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"> LoadPlugin write_graphite<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"> <Plugin "write_graphite"><o:p></o:p></p>
<p class="MsoNormal"> <Carbon><o:p></o:p></p>
<p class="MsoNormal"> Host "graphite.xxxxx.xxx"<o:p></o:p></p>
<p class="MsoNormal"> Port 2003<o:p></o:p></p>
<p class="MsoNormal"> Storerates true<o:p></o:p></p>
<p class="MsoNormal"> </Carbon><o:p></o:p></p>
<p class="MsoNormal"> </Plugin><o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">There are other config files for various read plugins, but I doubt they’re relevant. I haven’t performed an upgrade to a more recent version yet, mostly since nothing related to this seemed to be mentioned in the changelogs. I was hoping
that this sort of behavior might be something that’s been seen before, and there might be a known solution to it.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Any ideas? I’m happy to supply more information if needed.<o:p></o:p></p>
</div>
</body>
</html>