[collectd] collected 4.10.1 stops writing and high CPU

Wed Jan 16 14:23:31 CET 2013

Hi Jesse,

On Wed, Jan 16, 2013 at 11:03:00PM +1030, Jesse Reynolds wrote:
> So with issue #75, am I correct in thinking that it could explain the
> 75 minute freeze up on RRD updates if a large number of RRDs were
> being created at that time? ... What about the fact that RRDs for the
> collectd server machine itself continued to be updated, does this rule
> out #75 as a possible cause? 

> OK, I've just thought to look at memory usage on the machine. During
> the last occurrence of this problem, used memory started increasing
> linearly until collectd was restarted, which free'd the memory. Eg
> between 02:20 and 03:30 on this graph:
> http://f.cl.ly/items/0z2P0e3C0e1S3G0W3z3t/Screen%20Shot%202013-01-16%20at%2010.58.41%20PM.png

the "network" plugin is using one thread to dispatch values to the
daemon. If that thread is getting stuck somewhere, received values will
accumulate in the "to be dispatched" queue. Since the resident segment size
(RSS, memory consumption) of collectd is growing rapidly in this period,
this is likely happening.

If the thread is not truly stuck, just delayed a bit (say 100 ms), then
only 10 values received from the network can be dispatches per second.
This would seem like "nothing is happening" for sufficiently many files.
The 5 read threads (the default) can still handle 500 files during the
normal read interval (10 seconds), seeming like "everything is fine". In
short, yes, it is possible that this is related to #75.

> Yes :-) The device is dm-0. Most of the time it sits around 1,600
> write ops per second. When the problem occurred it dropped down to
> around 15 write ops per second. Disk write time decreased from around
> 1.4 to around 0.2 while the problem was occurring, reflecting a lower
> load on the disks i presume. After restarting collectd both these
> figures went back to normal after a few minutes. 

1600 I/O-ops/s is impressive :)

How does I/O _bytes_ behave during these times? 15 writes per second
with 70 MByte each sums up to roughtly 1 GByte/s being written … ;)

If this happens again, can you record collectd's I/O, especially which
files it opens? Something along these lines should do the trick:

# strace -ttt -e trace=open -o collectd.strace -p $COLLECTD_PID -s 2048

Best regards,
—octo
-- 
collectd – The system statistics collection daemon
Website: http://collectd.org
Google+: http://collectd.org/+
GitHub:  https://github.com/collectd
Twitter: http://twitter.com/collectd
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: Digital signature
URL: <http://mailman.verplant.org/pipermail/collectd/attachments/20130116/c2d5cfe0/attachment.pgp>