[collectd] load average peaks periodically on unloaded host

Thu Mar 31 10:59:38 CEST 2011

Hi,

On Thu, 31 March 2011 Vincent McIntyre <vincent.mcintyre at gmail.com> wrote:
> I am running collectd 4.10 backported to debian lenny (official
> backport), amd64.
> 
> I have a new host that is doing something odd - every 100 minutes or
> so the load average peaks at about 1.0-1.5. It rises over a period of
> a minute or so, bobs around for a minute or two and then declines
> steadily back to 0.0 after about 5 minutes.

This smells like multiple processes get runnable at same time...

> I started turning things off and found that collectd seems to be the
> culprit - the peaks go away entirely if I turn it off. If I turn off
> say half of the plugins, the load peak still occurs but with half the
> amplitude. I have a cron job printing the process table when the peak
> is occurring but nothing obvious shows up; the only process with %CPU
> larger than 0.0 is collectd. Neither does anything in the various
> plots (we use collection3), related to collectd or the other processes
> that are showing any activity (see Processes config below).

I would say this is due to the scheduling of the various threads used by
collectd.
The "load" varies across different kernel versions e.g. for some kernels
you get those peaks, for others you don't. What kernel are you running on?

You could reduce collectd to less worker threads in order to not have that
scheduling artifact.

> Has anyone seen this before? Any debugging tips?

Yes, I've even seen machines where there was no CPU activity and load
average kept steadily climbing (going up and down but average climbing).

Other, much more loaded systems remained with small load value.

Except using a different kernel version or playing with scheduler settings
I don't see very much you could do... (but remember that load alone is not
a very good indicator, at best it's a hint to look at the other values)

Possibly multiple processes/threads get blocking each-other in some mutex
in their syscalls because they want to do their job at same time.

Regards,
Bruno

> Cheers
> Vince
> 
> Current collectd.conf:
> 
> FQDNLookup true
> LoadPlugin syslog
> <Plugin syslog>
>         LogLevel info
> </Plugin>
> Include "/etc/collectd/filters.conf"
> Include "/etc/collectd/thresholds.conf"
> LoadPlugin cpu
> LoadPlugin disk
> LoadPlugin interface
> LoadPlugin load
> LoadPlugin memory
> LoadPlugin processes
> LoadPlugin rrdtool
> LoadPlugin network
> <Plugin network>
>   server 1.2.3.4
> </Plugin>
> LoadPlugin ntpd
> <Plugin "ntpd">
>   Host "localhost"
>   Port "123"
>   ReverseLookups false
> </Plugin>
> <Plugin processes>
>        Process "collectd"
>        Process "ntpd"
>        ProcessMatch "atop" "atop"
>        ProcessMatch "exim4" "exim4"
> </Plugin>