[collectd] collected 4.10.1 stops writing and high CPU

Jesse Reynolds jesse at bulletproof.net
Wed Jan 16 13:33:00 CET 2013


Hi Yves

Thanks for your reply! 

We are not using rrdcached. I was aware of it but not in any detail, thanks for the recommendation. I can see that even just decoupling the RRD file writes to a separate process has big benefits, eg just being able to restart collectd without triggering a flush of all RRD files. I'll look at using it on the next build. 

So with issue #75, am I correct in thinking that it could explain the 75 minute freeze up on RRD updates if a large number of RRDs were being created at that time? ... What about the fact that RRDs for the collectd server machine itself continued to be updated, does this rule out #75 as a possible cause? 

Perhaps a new set of RRDs needing to be written could have banked up the receive queue in the network plugin's thread and triggered a separate bug in it? 

OK, I've just thought to look at memory usage on the machine. During the last occurrence of this problem, used memory started increasing linearly until collectd was restarted, which free'd the memory. Eg between 02:20 and 03:30 on this graph: http://f.cl.ly/items/0z2P0e3C0e1S3G0W3z3t/Screen%20Shot%202013-01-16%20at%2010.58.41%20PM.png

I suppose this could represent all the queued stats that were unable to be written to disk accumulating over time.

And in regards to your other questions:

> About your CPU shooting the roof, could you check if it works full time or if it is waiting for your disk ? (iostat should help).

Most of the time the filesystem is at around 1,600 write operations per second. During the 75 minute period today where statistics were not being written, the disks were doing very little and there was no IO wait. collectd process was using around 107% CPU as reported by top, so perhaps one thread was consuming 100% of one core and 7% by other threads on other cores. Normally collectd us using only between 3% and 8% CPU. 

> 
> Are you sure your disk is not 99% full (perfs are lower when a disk is nearly full).

Yes. It's at 92% currently. 

> Are you sure your disk is not broken ?

No :-) But I would be surprised. There's no IO errors logged at the OS level. ... It is a filesystem mounted from a SAN. The collectd server is running in a Ubuntu VM on ESXi, and the filesystem is a mapper device over four underlying disk image files on VMFS. (Yes this is somewhat convoluted! Future stats servers will have a lot of local spindles and be physical machines. )

> 
> With iostat, if you have a FS dedicated to the rrd files, have you checked that it is that FS and not another that is working slowly ?

Yes :-) The device is dm-0. Most of the time it sits around 1,600 write ops per second. When the problem occurred it dropped down to around 15 write ops per second. Disk write time decreased from around 1.4 to around 0.2 while the problem was occurring, reflecting a lower load on the disks i presume. After restarting collectd both these figures went back to normal after a few minutes. 

We don't use a separate filesystem for the RRDs, it is just the root filesystem. collectd is all this box does. 

Cheers
Jesse

On 16/01/2013, at 5:45 PM, Yves Mettier <ymettier at free.fr> wrote:

> Hello,
> 
> Issue #75 is the first think I'm thinking about.
> 
> Are you using rrdcached ?
> If not, you should (but with so many rrds, I'm sure you are).
> 
> If yes, try to configure your collectd to *not* create rrds for some hours (maybe one or two days).
> If this is better for you, you are probably experiencing issue #75.
> 
> Have a look at https://github.com/collectd/collectd/issues/75.
> As far as I know, there is no "good" solution. Only some tips and tricks.
> 
> 
> If not issue #75, here are some ieads...
> 
> About your CPU shooting the roof, could you check if it works full time or if it is waiting for your disk ? (iostat should help).
> 
> Are you sure your disk is not 99% full (perfs are lower when a disk is nearly full).
> Are you sure your disk is not broken ?
> 
> With iostat, if you have a FS dedicated to the rrd files, have you checked that it is that FS and not another that is working slowly ?
> 
> Note : I'm using 5.2, so I will not be able to help you better.
> 
> Regards,
> Yves
> 
> Le 2013-01-16 07:32, Jesse Reynolds a écrit :
>> Hello
>> 
>> We have a collectd server that is writing to about 24,000 RRD files,
>> most of which are 15 MB each (with some at 30 MB and some at 45 MB),
>> about 480 GB of RRD files in all.
>> 
>> On occasion we are seeing disk writes drop right down to a trickle,
>> and at the same time collectd's CPU shooting through the roof. Once
>> collectd goes into this state it can be like this for hours, and the
>> RRD files are mostly not being updated in this time. The only way to
>> get things going again is to 'kill -9 <collectd's pids>' and start
>> collectd again.
>> 
>> The RRD files for data originating within this instance of collectd
>> (and not coming via the network plugin) are not interrupted, so it is
>> something to do with the network plugin, it seems.
>> 
>> Has anyone got any advice on how we might chase this problem down
>> further? We are on Ubuntu 12.04.
>> 
>> Is it possible to peer into collectd to see if it's a problem with
>> the network plugin, or the rrd plugin, or something else?
>> 
>> Thank you
>> Jesse
>> 
>> 
>> _______________________________________________
>> collectd mailing list
>> collectd at verplant.org
>> http://mailman.verplant.org/listinfo/collectd
> 
> -- 
> - Homepage       - http://ymettier.free.fr                             -
> - GPG key        - http://ymettier.free.fr/gpg.txt                     -
> - C en action    - http://ymettier.free.fr/livres/C_en_action_ed2.html -
> - Guide Survie C - http://www.pearson.fr/livre/?GCOI=27440100673730    -
> 
> _______________________________________________
> collectd mailing list
> collectd at verplant.org
> http://mailman.verplant.org/listinfo/collectd




More information about the collectd mailing list