[collectd] load average peaks periodically on unloaded host

Fri Apr 1 00:32:16 CEST 2011

Hi Vincent,

On Fri, 01 April 2011 Vincent McIntyre <vincent.mcintyre at gmail.com> wrote:
> On 3/31/11, Bruno Prémont <bonbons at linux-vserver.org> wrote:
> > On Thu, 31 March 2011 Vincent McIntyre <vincent.mcintyre at gmail.com> wrote:
> 
> >> I started turning things off and found that collectd seems to be the
> >> culprit - the peaks go away entirely if I turn it off. If I turn off
> >> say half of the plugins, the load peak still occurs but with half the
> >> amplitude. I have a cron job printing the process table when the peak
> >> is occurring but nothing obvious shows up; the only process with %CPU
> >> larger than 0.0 is collectd. Neither does anything in the various
> >> plots (we use collection3), related to collectd or the other processes
> >> that are showing any activity (see Processes config below).
> >
> > I would say this is due to the scheduling of the various threads used by
> > collectd.
> > The "load" varies across different kernel versions e.g. for some kernels
> > you get those peaks, for others you don't. What kernel are you running on?
> >
> 
> It's the stock Debian Lenny  kernel, 2.6.26-2-amd64.

You might want to try a different kernel (even some maintenance release)
Your kernel looks rather old ;)

> Should I not see thread activity reflected in, say, 'top', or in the
> process plugin plots
> for collectd ? I don't see any peaks there really, certainly nothing with the
> same pattern in time. I'm not sure if the processes plugin tracks
> thread count or per-thread statistics; collection3 does not show plots
> of quantities like this, anyway.
> 
> > You could reduce collectd to less worker threads in order to not have that
> > scheduling artifact.
> >
> 
> Thanks, I'll give this a try.

Note that it may cause collectd to miss some data (e.g. if one/some of the
plugins takes lots of time to fetch its data)

> >> Has anyone seen this before? Any debugging tips?
> >
> > Yes, I've even seen machines where there was no CPU activity and load
> > average kept steadily climbing (going up and down but average climbing).
> >
> > Other, much more loaded systems remained with small load value.
> >
> > Except using a different kernel version or playing with scheduler settings
> > I don't see very much you could do... (but remember that load alone is not
> > a very good indicator, at best it's a hint to look at the other values)
> >
> > Possibly multiple processes/threads get blocking each-other in some mutex
> > in their syscalls because they want to do their job at same time.
> >
> 
> Ok. I raised this because I've not seen it before, though we have
> collectd on quite
> a few machines, some of which are running the same hardware & kernel.

These are all guesses/interpretations on my side based on what I saw - I
have no hard numbers to prove it.

Something on that machine probably causes multiple threads to compete for
resources at same time (might even be collectd versus some other process).
Possibly just unlucky wake-up times.

I don't think there is much you can do against it...
The possible options I see:
 - trying a different kernel which would have its scheduler/timers behave
   slightly differently
 - remove/disable powersaving features (like dynticks)
   (kernel side but for some hardware also BIOS side!)

Note that your other systems with same HW/Kernel might still have different
runtime pattern (e.g. when CPU has to work to service interrupts or processes)
which prevents it to have recurring tasks aligned by idle latencies. They
probably are a bit less idle/bored :)

As long as processes, cpu and disk plugins don't show unusual activity/state
I would not care too much about load (at best consider it a hint asking to
look at the various resources (indirectly) covered by load -- load being some
magic value of "runnable" processes knowing that 'D' state is considered
runnable in this case)
Top can show threads (H) but it doesn't by default (ps can show them as well).
You could look at /proc/$(pidof collectd)/tasks/* for manually looking at
the threads.

Regards,
Bruno