[collectd] [rrd-developers] rrdcached + collectd issues

Tue Oct 13 10:13:08 CEST 2009

Hi Thorsten,

I'm having a bit of a hard time replying to this message because it (and
the previous one) were sent as HTML-only. Could you maybe switch to
multipart or plain text messages? Thanks :)

On Sun, Oct 11, 2009 at 11:26:55PM -0700, Thorsten von Eicken wrote:
> (It's not yet possible to &#8220;watch&#8221; the length of this queue
> directly. I'll add some measurements to the Network plugin so we can
> see what's going on eventually &#8230;)

I have implemented this in the master branch of collectd. It should be
possible to back-port this change to 4.8 without much troubles.

> […] Then I threw yet another 30'000 tree nodes and corresponding
> updates at it. At that point, collectd started immediately to grow
> again linearly to over 600MB.

You're right, this *is* unexpected. Are you sure the backlog was worked
up or was the entire system just settling at a stable rate with
~250 MBytes worth of data in the queue?

> It's as if the previous 250MB of buffers hadn't been freed (in the
> malloc sense, I understand that the process size isn't going to
> shrink). Could it be that there is a bug?<br>

We're talking about the resident segment size (RSS) here, right? Because
*that* ought to descrese. 

> &nbsp;- if rrdcached is restarted, collectd doesn't reconnect.

The collectd plugin calls “rrdc_connect” before each update. The
semantic of that function is to check whether a valid connection to the
daemon exists and try to reconnect if necessary. If anything goes wrong
with sending / receiving data, other functions will simply close /
invalidate the connection and it is supposed to be opened in the next
iteration.

If the connection is not reestablished, my guess is that the socket
descriptor is not properly invalidated. I'll have to look further into
this though.

> I know this is the case for TCP sockets but I'm pretty sure I observed
> it using the unix socket too.

It's the same logic. I'd be surprised if the behavior differed.

> &nbsp;- the -z parameter is nice, but not quite there yet.

Yeah, I think dynamically calculating an update rate is a better
solution for long caching intervals.

> I'm running with -w 3600 -z 3600 and the situation after the first
> hour is not pretty with a ton of flushes followed by a lull and a
> repeat after another hour.

That's unexpected (at least for me). With those setting I would have
expected the first hour to be memory only (i.e. no disk activity at all)
and after that basically uniformly distributed writes for an hour. Two
hours after start I'd expect a drop in writes which increases for an
hour and has its peak at three hours after start.

I've written a small script to mimic the behavior of RRDCacheD's “-z”
option. You can find the corresponding graph at [0]. The graph shows the
time in seconds on the x-axis and the updates per second on the y-axis.
The graph assumes that one million files should be written to every
3600+[0..3600) seconds. With less files the graph doesn't look as pretty
but the same overall characteristics still apply.

As you can see, the first hour is totally idle. The second hour shows
250–300 writes per second (expected value: 278 writes per second due to
uniform distribution over one hour). Between the second and the third
hours you have the increase which drops below the expected value of one
million writes in 5400 seconds (185 writes per second). This is due to
the fact that a new (uniform distributed) random jitter is added to the
update interval again: the resulting global distribution is not uniform
anymore (sum of two uniform random values).

The writes per second will converge eventually, but this takes a couple
of hours. You can find a graph showing a simulated day (24 hours)
at [1].

> I'm wondering whether it would be difficult to change to an adaptive
> rate system, where given a -w 3600 and the current number of dirty
> tree nodes rrdcached computes the rate at which it needs to flush to
> disk and then does that.

I don't think that'd be hard. In fact, I think it wouldn't make any
performance problems to re-calculate this value after each file that has
been written to disk.

> I suspect it would be possible to push the system further if the
> various rrdcached threads could be decoupled better.

Do you have anything specific in mind? As far as I can tell the various
threads are pretty much as decoupled as they can safely be.

> Also, being able to put an upper bound on collectd memory would be
> smart 'cause it's clear that at some point the growth becomes
> self-defeating.

Sounds like a reasonable idea. Any idea which values to drop? The
oldest, the newest, either (chosen randomly), both?

> &nbsp;- I'm wondering how we could overcome the RRD working set issue.

Let's assume every RRD file has only one data source and you have
100,000 files. Then the total data cached should be:

    8 Byte * 100,000 files * 3600 / 20 seconds ⇒ 144 MByte

This should be possible *somehow* …

> One idea that came to mind is to use the caching in rrdcached to
> convert the random small writes that are typical for RRDs to more of a
> sequential access pattern.

Well, the problem is that currently RRD files look like this on disk:

  [a0,a1,a2,a3,…,an] [b0,b1,b2,b3,…,bn] [c0,c1,c2,c3,…,cn]

To get a sequential access pattern, we'd have to reorder this to:

   a0,b0,c0 a1,b1,c1 a2,b2,c2 a3,b3,c3 … an,bn,cn 

I think the only way to achieve this is to have all that data in one
file. The huge problem here is adding new data: If we need to add
d[0,…,n] to the set above, almost *all* data has to be moved. And we're
not even touching several RRAs with differing resolutions. I think to
get this RRDtool / RRDCacheD would have to be turned into something much
more like a database system and less like a frontend for writing
separate files.

Regards,
—octo

[0] <http://verplant.org/temp/rrd_jitter.png>
[1] <http://verplant.org/temp/rrd_jitter_day.png>

P. S.: Those files will be deleted automatically in one month.
-- 
Florian octo Forster
Hacker in training
GnuPG: 0x91523C3D
http://verplant.org/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : http://mailman.verplant.org/pipermail/collectd/attachments/20091013/f0d8a4a3/attachment.pgp