[collectd] exec plugin hang in getpwnam_r

Riches Jr, Robert M robert.m.riches.jr at intel.com
Fri Feb 22 22:36:27 CET 2013


(Thanks for the tip that helped lead to the dlclose() solution.)

Now, I'm running into an issue where the exec plugin's child process hangs in a call to getpwname_r() from inside exec_child().  The hang happens reliably if I use upstart to start collectdmon, which starts collectd.  If I manually start collectdmon, everything works the way it is supposed to.  (This hang happened even before I added the dlclose() patch, and it happens earlier in exec_child() than anything done by the dlclose() patch.)  This is with collectd 5.2.0 on Ubuntu 12.04LTS.  The 'top' command reports 32 logical cpus.  I have only one item that the exec plugin should be calling.  If I gather correctly, my getpwnam_r() comes from Glibc, not from the ifdef'ed function in common.c.  It appears this issue may be related to the following issues:

http://mailman.verplant.org/pipermail/collectd/2010-March/003650.html
https://github.com/collectd/collectd/issues/229
https://gist.github.com/jessereynolds/2878994
http://www.mail-archive.com/collectd@verplant.org/msg00524.html

and possibly

http://monkey.org/freebsd/archive/freebsd-threads/200307/msg00110.html

Here's the output of strace on a hung child process:

Process 2435 attached - interrupt to quit
futex(0x7fd431c3cdb0, FUTEX_WAIT_PRIVATE, 2, NULL

Here's the output of the gdb 'where' command on a hung child process:

(gdb) where
#0  0x00007fcf120d09bb in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fcf120d591c in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007fcf120d4f2b in __nss_database_lookup ()
   from /lib/x86_64-linux-gnu/libc.so.6
#3  0x00007fcf120d630c in __nss_passwd_lookup2 ()
   from /lib/x86_64-linux-gnu/libc.so.6
#4  0x00007fcf1208dac8 in getpwnam_r () from /lib/x86_64-linux-gnu/libc.so.6
#5  0x00007fcf0c6e38ed in exec_child (pl=0x18e5e90) at exec.c:303
#6  fork_child (pl=0x18e5e90, fd_in=<optimized out>, fd_out=<optimized out>,
    fd_err=0x7fcf09aca5e8) at exec.c:509
#7  0x00007fcf0c6e3f5e in exec_read_one (arg=0x18e5e90) at exec.c:560
#8  0x00007fcf12599e9a in start_thread ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#9  0x00007fcf120c2cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#10 0x0000000000000000 in ?? ()

(I could compile with --enable-debug, but that's a significant effort and it doesn't appear it would add much info at this point.)

Here's /proc/$pid/status for a hung child process:

Name:   collectd
State:  S (sleeping)
Tgid:   128026
Pid:    128026
PPid:   128018
TracerPid:      0
Uid:    0       0       0       0
Gid:    0       0       0       0
FDSize: 64
Groups:
VmPeak:   624472 kB
VmSize:   624472 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:      1484 kB
VmRSS:      1484 kB
VmData:   516584 kB
VmStk:       136 kB
VmExe:       164 kB
VmLib:     10916 kB
VmPTE:       268 kB
VmSwap:        0 kB
Threads:        1
SigQ:   5/256851
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000001000
SigCgt: 0000000180014202
CapInh: 0000000000000000
CapPrm: ffffffffffffffff
CapEff: ffffffffffffffff
CapBnd: ffffffffffffffff
Cpus_allowed:   ffffffff,ffffffff,ffffffff,ffffffff
Cpus_allowed_list:      0-127
Mems_allowed:   00000000,00000003
Mems_allowed_list:      0-1
voluntary_ctxt_switches:        9
nonvoluntary_ctxt_switches:     0

Some documents on the web said getpwnam_r() could hang if using NIS or LDAP.  Here's my /etc/nsswitch.conf:

# /etc/nsswitch.conf
#
# Example configuration of GNU Name Service Switch functionality.
# If you have the `glibc-doc-reference' and `info' packages installed, try:
# `info libc "Name Service Switch"' for information about this file.

passwd:         compat
group:          compat
shadow:         compat

hosts:          files dns
networks:       files

protocols:      db files
services:       db files
ethers:         db files
rpc:            db files

netgroup:       nis

If I gather correctly, 'compat' for the passwd, group, and shadow entries should be equivalent to 'files' if there are no exception items.  The 'netgroup: nis' entry is apparently from the stock Ubuntu installation.  As far as I am aware, there is no NIS active for this machine.

Is anyone here acquainted with reasons getpwnam_r() might hang and/or a better solution than adding a 'sleep(1)' mentioned in a reference URL?  (It's likely a sleep wouldn't help my case, because I only have one thing called by the exec plugin.)

Thanks,

Robert Riches

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.verplant.org/pipermail/collectd/attachments/20130222/3c6ff76f/attachment-0001.html>


More information about the collectd mailing list