[collectd] collected 4.10.1 stops writing and high CPU

Thu Jan 17 03:26:24 CET 2013

Hi Florian

On 16/01/2013, at 11:53 PM, Florian Forster <octo at collectd.org> wrote:

> the "network" plugin is using one thread to dispatch values to the
> daemon. If that thread is getting stuck somewhere, received values will
> accumulate in the "to be dispatched" queue. Since the resident segment size
> (RSS, memory consumption) of collectd is growing rapidly in this period,
> this is likely happening.

OK

> 
> If the thread is not truly stuck, just delayed a bit (say 100 ms), then
> only 10 values received from the network can be dispatches per second.
> This would seem like "nothing is happening" for sufficiently many files.
> The 5 read threads (the default) can still handle 500 files during the
> normal read interval (10 seconds), seeming like "everything is fine". In
> short, yes, it is possible that this is related to #75.

OK, that makes sense. 

> 
>> Yes :-) The device is dm-0. Most of the time it sits around 1,600
>> write ops per second. When the problem occurred it dropped down to
>> around 15 write ops per second. Disk write time decreased from around
>> 1.4 to around 0.2 while the problem was occurring, reflecting a lower
>> load on the disks i presume. After restarting collectd both these
>> figures went back to normal after a few minutes. 
> 
> 1600 I/O-ops/s is impressive :)

This is another thing that we see happening sometimes: 

http://f.cl.ly/items/0N151O3F0k3y3s101n0U/Screen%20Shot%202013-01-17%20at%2011.28.40%20AM.png

This is peaking at 5,200 writes per second after 'the event'. The event here was a new server being added causing collectd to create 182 new RRD files, claiming 3.2GB of disk space. This seems to have perhaps triggered issue #75 where the writes are held back in the network plugin for a time, and then the floodgates are opened and it goes all peaky for a time. Do you think?

> 
> How does I/O _bytes_ behave during these times? 15 writes per second
> with 70 MByte each sums up to roughtly 1 GByte/s being written … ;)

The IO bytes remains proportionate to the IO ops. Usually around 6 MB/s, dropped to around 60 kB/s, then back to 6 MB/s. So yeah, no real change discernible in the average transaction size. 

> 
> If this happens again, can you record collectd's I/O, especially which
> files it opens? Something along these lines should do the trick:
> 
> # strace -ttt -e trace=open -o collectd.strace -p $COLLECTD_PID -s 2048

Will do! Thanks. 

Cheers
Jesse