[collectd] collectd rrdtool performance

Thu Dec 20 19:26:54 CET 2007

Florian Forster wrote:
> have you read the `Tuning RRDtool for performance' document in the
> RRDTool Wiki? If not, it may be of interest for you.
Yup, thanks.

> On Wed, Dec 19, 2007 at 08:29:37AM -0800, Thorsten von Eicken wrote:
>> Even if I set the RRDTool plugin cache to 60 seconds the situation is
>> not much better.
> 
> Really? I've heard and experienced that this setting reduces the IO load
> a lot: Since the disks always write blocks of 512 bytes it doesn't make
> much difference if you write one value or 64.
> 
>> The biggest issue I see is that 150 hosts = 10500 RRDs. I'm planning
>> to go ahead and reorganize a little how the RRD data is stored by
>> placing all related variables of a plugin into a single RRD as opposed
>> to the current scheme where almost every variable is in its own RRD.
> 
> Hm, I'd do some benchmarks first. The effect is that the DSes are stored
> nearby, basically the same effect the `CacheTimeout' option uses.

Good point. So I decided to do some more tests. First, I killed collectd 
and continued to watch the iostat output. It took ~120 seconds after 
kill -9 collectd for the disks to drop from 100% util to 0% util! The 
only explanation I have is that there was 120 seconds worth of dirty 
blocks in the filesystem cache! (I'm using xfs.) I don't know whether 
this means that on a crash I'd loose 120 seconds worth of data (I 
wouldn't care), or whether that's a more subtle artifact of xfs logging.

I then restarted collectd with a 120 second cache timeout. The disks 
stayed idle for about that time period and then ramped up to 100% util 
in ~40 seconds. They stayed there solid for >2 minutes until I killed 
collectd again. I then tried a 180 second timeout. Same thing, except 
that it took ~180 seconds for disk activity to show. Afterwards disks 
stayed at 100%. I had hoped to see a dip for ~60 seconds every 180 
seconds. No such thing...

The bottom line is interesting. If I run without cache timeout, the 
graphs are all up-to-date and I can't see any missing data. If I run 
with a cache timeout of 120 or 180 seconds, the graphs lag behind and 
the disks are just as hammered. Net result: cachetimeout>0 is worse than 
cachetimeout=0 in my case!

For grins, this is 'top' on my box:
top - 13:21:52 up 51 days, 36 min,  2 users,  load average: 3.55, 3.29, 2.58
Tasks:  53 total,   3 running,  50 sleeping,   0 stopped,   0 zombie
Cpu0  :  1.9%us,  1.1%sy,  0.0%ni, 29.8%id, 66.8%wa,  0.0%hi,  0.0%si, 
0.3%st
Cpu1  :  1.0%us,  2.3%sy,  0.0%ni, 86.1%id, 10.3%wa,  0.0%hi,  0.0%si, 
0.1%st
Mem:   7864320k total,  7848980k used,    15340k free,    59744k buffers
Swap:        0k total,        0k used,        0k free,  7070464k cached

This is iostat -x:
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            5.90    0.00    0.85   89.71    0.00    3.55

Device:         rrqm/s   wrqm/s   r/s   w/s    rkB/s    wkB/s avgrq-sz 
avgqu-sz   await  svctm  %util
sdb               0.00     0.20  1.80 209.70    23.20   895.20     8.68 
   104.89  477.52   4.66  98.60
sdc               0.00     0.50  1.40 205.70    16.80   891.20     8.77 
    52.39  260.74   3.58  74.10
sda1              0.00     5.40  0.00  1.10     0.00    26.00    47.27 
    0.00    0.00   0.00   0.00
dm-0              0.00     0.00  3.10 407.70    39.60  1754.40     8.73 
   157.38  377.56   2.44 100.10
dm-1              0.00     0.00  0.00  0.00     0.00     0.00     0.00 
    0.00    0.00   0.00   0.00

dm-0 is the partition with the xfs filesystem, and it's striped across 
sdb and sdc.

Now I'm off trying to figure out how to upgrade to rrdtool > 1.2.23

TvE