[collectd] Debugging NaN values being recorded

Mon Oct 22 18:43:56 CEST 2012

  I've got the following collectd arrangement:

     Solaris Zone 1 collectd --.
     Solaris Zone 2 collectd --+--  Linux collectd -> rrd
     Solaris Zone 3 collectd --'
     Solaris zone 4 collectd --'

So four Solaris zones, which all exist on the same host server, 
reporting (via network plugin) to collectd running on Linux.  It 
actually works very well.

The binaries and configurations for all four zones are identical, except 
for Hostname.  Most of the stats are working fine, *except* for 
"fork_rate" from the processes plugin.

This is where it gets weird.

"fork_rate", because these are zones and not full VMs, is the exact same 
metric across all four.  So it's wasteful for me to be recording it four 
times, but not terribly so - and it helps avoid needing to flip pages 
when viewing the stats.

However, two of the zones are reporting "NaN" for that metric, while the 
other two are happily recording real, useful values.  Keep in mind that 
this is effectively the same number being sent by all four zones... I 
don't think it'd vary that much as each zone's collectd gets CPU time, 
and not this consistently.

What are my best means of finding out *why* RRD would reject a value?  
I've checked to make sure the "heartbeat" of each rrd matches the 
interval... and I've tried turning up syslogging but there's a lot of 
traffic and it's hard to pick things out when I don't know what I'm 
looking for.

Is there a means of detecting rrd rejections?