[collectd] Collect scalability

Wed Jan 7 14:18:59 CET 2009

Hi Jason,

On Tue, Jan 06, 2009 at 05:16:04PM -0600, Jason wrote:
> What is collectd designed to be capable of, and has anyone had any
> scalability issues with it on similar systems?

the limiting factor in such installations is RRDtool (or librrd in the
case of collectd). So the interesting number is the number of RRD files
collectd is writing to.

The biggest installation I personally have access to consists of over
37.000 files from over 700 nodes (a significant number of those nodes is
network nodes collected via SNMP).

Values are cached in the rrdtool plugin for 300 seconds, so there are
about 125 updates per second. The machines (there are two identical
machines for fault tolerance - no load balancing is done) have 8 GBytes
RAM each and six 10k RPM disks arranged as RAID 10 (or 0+1? I'm not
sure..). The disks are formatted with EXT3 and are mounted with the
`noatime,commit=60' options.

There have been two hard to find bugs in the last half year or so:

1) A corrupt RRD file caused a SIGFPE in librrd, killing the
   application. Since `collectdmon' automatically restarted the daemon
   we did not notice that right away and wondered where the gaps in the
   graphs came from and why the `unixsock' plugin worked so unreliably.

   Once the corrupt file was identified and removed, everything worked
   as expected again.

2) As more and more nodes were added to the system, gaps in the graphs
   started to appear again. It turns out, that an old problem had it's
   renaissance: Because data was not read fast enough from the socket,
   the receive buffer filled up ultimately leading to packets being
   discarded. Since the `network' plugin uses UDP, the data is not
   retransmitted and lost.

   To work around this issue I added the following lines to
   `/etc/sysctl.conf'. They increase the default receive buffer to
   4 MBytes which is sufficient for now.

   Currently, there are two threads in the `network' plugin: One reads
   packets from the socket and attaches them to a queue, the other
   dequeues the packets, decodes them and dispatches the parsed values
   to collectd. My guess right now is that under very high IO load,
   locking the queue takes too long (this is the only potential blocking
   call apart from `recv'). I'll probably change that code to work as
   follows:
   -- 8< --
    while (true) {
      entry = recv;
      enqueue (private_queue, entry);
      if (try_lock (lock)) {
        enqueue (public_queue, private_queue);
        private_queue = NULL;
      }
    }
   -- >8 --
   The important change being: If `try_lock' cannot lock the queue, it
   will return immediately, thus not blocking the thread.

Does that help you? What kind of problem are you experiencing? How does
your setup look like?

Regards,
-octo
-- 
Florian octo Forster
Hacker in training
GnuPG: 0x91523C3D
http://verplant.org/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : http://mailman.verplant.org/pipermail/collectd/attachments/20090107/c22b314b/attachment.pgp