[collectd] a collectd success story

Jason Pepas cell at ices.utexas.edu
Fri Nov 18 17:03:10 CET 2005


So, after the nfs capabilities were added to collectd, we were able to
track down the source of the heavy, constant load which was plaguing our
nfs server.

As you can see from these weekly graphs:

http://vertex.ices.utexas.edu:9999/weekly/Nov13-Nov20/index.html

we have dramatically reduced our load.  We went from a sustained load of
over 14,000 rpc (nfs) operations per second to just over 1,000 on
average.

The culprit turned out to be an older and poorly configured version of
"gamin" (the file alteration monitor, see
http://www.gnome.org/~veillard/gamin/).

Our solution was to create an rpm of the latest version (0.1.7), and
put the following in /etc/gamin/mandatory_rc on all clients:

    fsset nfs none
    fsset autofs none

(the autofs line may not be neccessary)
(/etc/gamin/mandatory_rc was not supported until a recent version of
gamin)

additionally, the attached client/server python script I wrote was
instrumental in tracking down the hosts which were causing most of the
problem.  here is some sample output, which shows the average total nfs
activity of each host in ops/sec:

 *** top 20 offenders ***
3346 fire.ices.utexas.edu
3080 orinoco.ices.utexas.edu
1927 antigua.ices.utexas.edu
1798 super.ices.utexas.edu
1448 tronix.ices.utexas.edu
755 cozumel.ices.utexas.edu
463 reunion.ices.utexas.edu
305 tobago.ices.utexas.edu
265 cletus.ices.utexas.edu
237 otto.ices.utexas.edu
207 promise.ices.utexas.edu
161 velma.ices.utexas.edu
160 sally.ices.utexas.edu
135 water.ices.utexas.edu
108 sauron.ices.utexas.edu
99 nugloo.ices.utexas.edu
60 cancun.ices.utexas.edu
46 retina.ices.utexas.edu
38 carbon.ices.utexas.edu
27 aussie.ices.utexas.edu

It was interesting that after installing the new version of gamin, having
the users log out, kill their gam_server process, and log back in was not
enough to fix the problem.  Each machine had to be rebooted for the
changes to take effect.  My guess is that this had something to do with
the fact that autofs doesn't appear to actually get restarted until you
reboot ("service autofs restart" appears to do nothing significant while
you have nfs filesystems automounted).

-jason pepas

-------------- next part --------------
A non-text attachment was scrubbed...
Name: client-wrap.sh
Type: application/x-sh
Size: 84 bytes
Desc: not available
Url : http://mailman.verplant.org/pipermail/collectd/attachments/20051118/7604e443/client-wrap.sh
-------------- next part --------------
#!/usr/bin/python

# see http://gnosis.cx/publish/programming/sockets2.html

import time
import random
import socket
import sys

def randsleep(interval):
    rand_fudge = interval * random.random()
    myinterval = interval + rand_fudge
    then = time.time()
    while True:
        now = time.time()
        elapsed = now - then
        if elapsed >= myinterval:
            break
        else:
            remaining = myinterval - elapsed
            time.sleep(remaining)
    return elapsed

def get_stats():
    for line in file("/proc/net/rpc/nfs"):
        fields = line.split()
        if fields[0] == "proc3":
            numfields = int(fields[1])
            firstfield = 2
            sum = 0
            for i in range(firstfield, firstfield+numfields):
                sum = sum + int(fields[i])
            return sum

previous_get_stats_time = time.time()
previous_stats = 0

def get_rates():
    now = time.time()
    stats = get_stats()

    global previous_get_stats_time
    elapsed = now - previous_get_stats_time

    global previous_stats
    delta_stats = stats - previous_stats

    stats_rate = delta_stats / elapsed

    previous_get_stats_time = now
    previous_stats = stats

    return (stats_rate, elapsed)

def print_stats():
    (stats_rate, stats_elapsed) = get_rates()
    print stats_elapsed, stats_rate

# throw away the first set of values, as they are invalid.
get_rates()

if len(sys.argv) < 2:
    print "Usage: %s print|net" % sys.argv[0]
    sys.exit(1)

if sys.argv[1] == "print":
    while True:
        time.sleep(1)
        print int(get_rates()[0])
elif sys.argv[1] == "net":
    myhostname = socket.gethostname()
    sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
    while True:
        randsleep(30)
        (stats_rate, stats_elapsed) = get_rates()
        sock.sendto(str(int(stats_rate)), (sys.argv[2],1337))
else:
    print "Usage: %s (print)|(net hostname)" % sys.argv[0]
    sys.exit(1)

-------------- next part --------------
#!/usr/bin/python

# see http://gnosis.cx/publish/programming/sockets2.html

import socket

def rsortfunc(x, y):
    # from http://2701.org/archive/200311230000.html
    if x[1] > y[1]:
        return -1
    elif x[1] == y[1]:
        return 0
    else:
        return 1

sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
sock.bind(('', 1337))
rates = {}
while True:
    (data, address) = sock.recvfrom(256)
    rate = int(data)
    hostname = socket.gethostbyaddr(address[0])[0]
    rates[hostname] = rate
    items = rates.items()
    items.sort(rsortfunc)
    print
    print
    print
    print " *** top 20 offenders ***"
    for i in range(min(len(items),20)):
        print items[i][1], items[i][0]



More information about the Collectd mailing list