[collectd] collectd rrdtool performance
Thorsten von Eicken
tve at voneicken.com
Thu Dec 20 19:26:54 CET 2007
Florian Forster wrote:
> have you read the `Tuning RRDtool for performance' document in the
> RRDTool Wiki? If not, it may be of interest for you.
> On Wed, Dec 19, 2007 at 08:29:37AM -0800, Thorsten von Eicken wrote:
>> Even if I set the RRDTool plugin cache to 60 seconds the situation is
>> not much better.
> Really? I've heard and experienced that this setting reduces the IO load
> a lot: Since the disks always write blocks of 512 bytes it doesn't make
> much difference if you write one value or 64.
>> The biggest issue I see is that 150 hosts = 10500 RRDs. I'm planning
>> to go ahead and reorganize a little how the RRD data is stored by
>> placing all related variables of a plugin into a single RRD as opposed
>> to the current scheme where almost every variable is in its own RRD.
> Hm, I'd do some benchmarks first. The effect is that the DSes are stored
> nearby, basically the same effect the `CacheTimeout' option uses.
Good point. So I decided to do some more tests. First, I killed collectd
and continued to watch the iostat output. It took ~120 seconds after
kill -9 collectd for the disks to drop from 100% util to 0% util! The
only explanation I have is that there was 120 seconds worth of dirty
blocks in the filesystem cache! (I'm using xfs.) I don't know whether
this means that on a crash I'd loose 120 seconds worth of data (I
wouldn't care), or whether that's a more subtle artifact of xfs logging.
I then restarted collectd with a 120 second cache timeout. The disks
stayed idle for about that time period and then ramped up to 100% util
in ~40 seconds. They stayed there solid for >2 minutes until I killed
collectd again. I then tried a 180 second timeout. Same thing, except
that it took ~180 seconds for disk activity to show. Afterwards disks
stayed at 100%. I had hoped to see a dip for ~60 seconds every 180
seconds. No such thing...
The bottom line is interesting. If I run without cache timeout, the
graphs are all up-to-date and I can't see any missing data. If I run
with a cache timeout of 120 or 180 seconds, the graphs lag behind and
the disks are just as hammered. Net result: cachetimeout>0 is worse than
cachetimeout=0 in my case!
For grins, this is 'top' on my box:
top - 13:21:52 up 51 days, 36 min, 2 users, load average: 3.55, 3.29, 2.58
Tasks: 53 total, 3 running, 50 sleeping, 0 stopped, 0 zombie
Cpu0 : 1.9%us, 1.1%sy, 0.0%ni, 29.8%id, 66.8%wa, 0.0%hi, 0.0%si,
Cpu1 : 1.0%us, 2.3%sy, 0.0%ni, 86.1%id, 10.3%wa, 0.0%hi, 0.0%si,
Mem: 7864320k total, 7848980k used, 15340k free, 59744k buffers
Swap: 0k total, 0k used, 0k free, 7070464k cached
This is iostat -x:
avg-cpu: %user %nice %system %iowait %steal %idle
5.90 0.00 0.85 89.71 0.00 3.55
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz
avgqu-sz await svctm %util
sdb 0.00 0.20 1.80 209.70 23.20 895.20 8.68
104.89 477.52 4.66 98.60
sdc 0.00 0.50 1.40 205.70 16.80 891.20 8.77
52.39 260.74 3.58 74.10
sda1 0.00 5.40 0.00 1.10 0.00 26.00 47.27
0.00 0.00 0.00 0.00
dm-0 0.00 0.00 3.10 407.70 39.60 1754.40 8.73
157.38 377.56 2.44 100.10
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00
dm-0 is the partition with the xfs filesystem, and it's striped across
sdb and sdc.
Now I'm off trying to figure out how to upgrade to rrdtool > 1.2.23
More information about the collectd