[collectd] write_graphite failures causing cached values to drop

Erik Stambaugh estambaugh at red5studios.com
Wed Jan 30 03:43:59 CET 2013

Hi everyone.

I'm running collectd 5.1.0 in a client/server setup, with a central monitoring server reading from the network plugin, writing out via write_graphite, and also exposing UnixSock for nagios.  Everything works fine until the graphite server gets overloaded and becomes unresponsive.  Several seconds after that, the collectd server starts dropping data points from the cache, causing nagios to emit a ton of spurious pages.

Here's a sample of the log output from when I forced it to fail by stopping the carbon server:
   collectd[11038]: write_graphite plugin: send failed with status -1 (Broken pipe)
   collectd[11038]: write_graphite plugin: error with wg_send_message
   collectd[11038]: write_graphite plugin: Connecting to graphite.xxxxx.xxx:2003 failed. The last error was: Connection refused

My admittedly weak understanding is that the cache insert happens before the write plugins (based on https://collectd.org/wiki/index.php/Chains#Pre-_and_post-cache_chains), so failing to write shouldn't stop values from being stored in the cache.  I've tried a number of tricks to try and get it to keep the values, like switching back to the old python plugin or writing a "null" plugin that always returns successfully and runs along with write_graphite.  I'm starting to go down the road of trying terrible hacks to work around this, and there's probably something fundamental I'm getting wrong.

My entire collectd.conf contains:

    Hostname "monitoring.xxxxx.xxx"
    FQDNLookup true
    BaseDir "/var/lib/collectd"
    PluginDir "/usr/lib/collectd"
    TypesDB "/usr/share/collectd/types.db", "/usr/share/collectd/firefall.types.db"
    Interval 10
    ReadThreads 5

    Include "/etc/collectd/plugins/*.conf"
    Include "/etc/collectd/thresholds.conf"

The config for write_graphite has:

    LoadPlugin write_graphite

    <Plugin "write_graphite">
        Host "graphite.xxxxx.xxx"
        Port 2003
        Storerates true

There are other config files for various read plugins, but I doubt they're relevant.  I haven't performed an upgrade to a more recent version yet, mostly since nothing related to this seemed to be mentioned in the changelogs.  I was hoping that this sort of behavior might be something that's been seen before, and there might be a known solution to it.

Any ideas?  I'm happy to supply more information if needed.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.verplant.org/pipermail/collectd/attachments/20130130/0bd1c797/attachment.html>

More information about the collectd mailing list