[collectd] [rrd-developers] rrdcached + collectd issues

Wed Oct 14 15:30:55 CEST 2009

Hi Thorsten,

On Tue, Oct 13, 2009 at 03:52:15AM -0700, Thorsten von Eicken wrote:
> I just restarted everything afresh to get a clean set of data. It's 
> already not looking pretty. Here's the set-up:
> - /usr/bin/rrdcached -w 3600 -z 3600 -f 7200 -t 2 -b /rrds -B -j 
> /rrds/journal -p /var/run/rrdcached/rrdcached.pid -l 127.0.0.1:3033
> - ~55k tree nodes, collected every 20 seconds
> - see the rrdcached-1*.png in http://www.voneicken.com/dl/rrd/

thanks for the interesting graphs.

> What I see:
> - the system ran with half the load for 5 minutes at start-up before I 
> added the "second half"
> - the input is constant (see network rx pkts in last graph in 1c.png)

Yep.

> - rrdcached has ok cpu load for the first 15 minutes, then it really 
>   ramps up to using over half a cpu
> - collectd keeps and keeps growing after the first 15 minutes, it's 
>   clear that the degradation in "receive update" is due to rrdcached
>   and collectd has to start buffering (note how the first 15 minutes
>   were nice and flat)

Yes, this is the most interesting part. Apparently the high CPU usage is
related to the decreased rate at which values are sent from collectd to
RRDCacheD. (Not a surprise, but nice to see evidence nontheless.)

> - the connection thread seems to be affected because the "receive
>   update" and "journal bytes" rates start to degrade

The question is, in my opinion, what is using so much CPU and what is it
used for. I think there are two possible answers:

  - The connection thread itself is using a lot of CPU for *something*.

  - Something else is blocking the connection thread. The most likely
    shares resource is the cache_lock.

The queue_thread_main appears to spend it's time waiting for queue_cond
to be signaled with four updates per second in average. I'd guess this
thread is innocent.

Maybe an interesting observation is that the CPU utilization is growing
for about 3–4 minutes. So it looks like some resource being used up
gradually.

The pagefaults are probably just RRDtool updating the files: RRDtool
uses memory mapped IO be default – I wouldn't be surprised if that
showed up as pagefaults. The end of memory growth at 3:40 / 1 GB is
exactly as I would have expected.

> - note that so far we haven't hit the end of the first hour, so no
>   flushes to disk yet

After that hour, the queue threads starts writing values in the same
two-stage matter (i.e. with a five minute delay). This decreases the
rate in which values are received at first but it looks like that value
is recovering.

> Without being able to run any decent profiler I'm a bit stumped.

The search engine of my choice came up with this article [0], so maybe
setitimer(2) can be utilized. Maybe oprofile [1] could be interesting,
too. Unfortunately I haven't used either method myself yet.

Regards,
—octo

[0] <http://lkml.indiana.edu/hypermail/linux/kernel/0101.3/1516.html>
[1] <http://oprofile.sourceforge.net/about/>
-- 
Florian octo Forster
Hacker in training
GnuPG: 0x91523C3D
http://verplant.org/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : http://mailman.verplant.org/pipermail/collectd/attachments/20091014/d9ccb4e0/attachment.pgp