[collectd] Collect scalability

Wed Jan 7 20:31:41 CET 2009

Florian Forster wrote:
> Hi Jason,
> 
> On Tue, Jan 06, 2009 at 05:16:04PM -0600, Jason wrote:
>> What is collectd designed to be capable of, and has anyone had any
>> scalability issues with it on similar systems?
> 
> the limiting factor in such installations is RRDtool (or librrd in the
> case of collectd). So the interesting number is the number of RRD files
> collectd is writing to.
> 
> The biggest installation I personally have access to consists of over
> 37.000 files from over 700 nodes (a significant number of those nodes is
> network nodes collected via SNMP).
> 
> Values are cached in the rrdtool plugin for 300 seconds, so there are
> about 125 updates per second. The machines (there are two identical
> machines for fault tolerance - no load balancing is done) have 8 GBytes
> RAM each and six 10k RPM disks arranged as RAID 10 (or 0+1? I'm not
> sure..). The disks are formatted with EXT3 and are mounted with the
> `noatime,commit=60' options.

I have about 250 machines sending data to one server which has about
40,000 files stored.  They are being placed on an xfs partition on SAN
over iSCSI, at the moment, probably not the ideal solution.  I'll try
out higher caching as the disk load is rather high, but there is memory
use to deal with when aggressively caching.  We may also consider
throwing more hardware at the problem and/or adding another machine to
take over half the hosts.

> There have been two hard to find bugs in the last half year or so:
> 
> 1) A corrupt RRD file caused a SIGFPE in librrd, killing the
>    application. Since `collectdmon' automatically restarted the daemon
>    we did not notice that right away and wondered where the gaps in the
>    graphs came from and why the `unixsock' plugin worked so unreliably.
> 
>    Once the corrupt file was identified and removed, everything worked
>    as expected again.

We have recently been experiencing crashes which, though I have not yet
thoroughly investigated, may be similar to this.  Do you have a script
handy which would validate rrd files in mass and determine which ones
may be the offenders?

Thanks,
Jason