[collectd] load average peaks periodically on unloaded host

Thu Mar 31 23:58:18 CEST 2011

Hi Bruno,

thanks for your comments.

On 3/31/11, Bruno Prémont <bonbons at linux-vserver.org> wrote:
> Hi,
>
> On Thu, 31 March 2011 Vincent McIntyre <vincent.mcintyre at gmail.com> wrote:

>> I started turning things off and found that collectd seems to be the
>> culprit - the peaks go away entirely if I turn it off. If I turn off
>> say half of the plugins, the load peak still occurs but with half the
>> amplitude. I have a cron job printing the process table when the peak
>> is occurring but nothing obvious shows up; the only process with %CPU
>> larger than 0.0 is collectd. Neither does anything in the various
>> plots (we use collection3), related to collectd or the other processes
>> that are showing any activity (see Processes config below).
>
> I would say this is due to the scheduling of the various threads used by
> collectd.
> The "load" varies across different kernel versions e.g. for some kernels
> you get those peaks, for others you don't. What kernel are you running on?
>

It's the stock Debian Lenny  kernel, 2.6.26-2-amd64.

Should I not see thread activity reflected in, say, 'top', or in the
process plugin plots
for collectd ? I don't see any peaks there really, certainly nothing with the
same pattern in time. I'm not sure if the processes plugin tracks
thread count or per-thread statistics; collection3 does not show plots
of quantities like this, anyway.

> You could reduce collectd to less worker threads in order to not have that
> scheduling artifact.
>

Thanks, I'll give this a try.

>> Has anyone seen this before? Any debugging tips?
>
> Yes, I've even seen machines where there was no CPU activity and load
> average kept steadily climbing (going up and down but average climbing).
>
> Other, much more loaded systems remained with small load value.
>
> Except using a different kernel version or playing with scheduler settings
> I don't see very much you could do... (but remember that load alone is not
> a very good indicator, at best it's a hint to look at the other values)
>
> Possibly multiple processes/threads get blocking each-other in some mutex
> in their syscalls because they want to do their job at same time.
>

Ok. I raised this because I've not seen it before, though we have
collectd on quite
a few machines, some of which are running the same hardware & kernel.

Thanks again,
Vince