[collectd] exec plugin hang in getpwnam_r
Riches Jr, Robert M
robert.m.riches.jr at intel.com
Fri Feb 22 22:36:27 CET 2013
(Thanks for the tip that helped lead to the dlclose() solution.)
Now, I'm running into an issue where the exec plugin's child process hangs in a call to getpwname_r() from inside exec_child(). The hang happens reliably if I use upstart to start collectdmon, which starts collectd. If I manually start collectdmon, everything works the way it is supposed to. (This hang happened even before I added the dlclose() patch, and it happens earlier in exec_child() than anything done by the dlclose() patch.) This is with collectd 5.2.0 on Ubuntu 12.04LTS. The 'top' command reports 32 logical cpus. I have only one item that the exec plugin should be calling. If I gather correctly, my getpwnam_r() comes from Glibc, not from the ifdef'ed function in common.c. It appears this issue may be related to the following issues:
http://mailman.verplant.org/pipermail/collectd/2010-March/003650.html
https://github.com/collectd/collectd/issues/229
https://gist.github.com/jessereynolds/2878994
http://www.mail-archive.com/collectd@verplant.org/msg00524.html
and possibly
http://monkey.org/freebsd/archive/freebsd-threads/200307/msg00110.html
Here's the output of strace on a hung child process:
Process 2435 attached - interrupt to quit
futex(0x7fd431c3cdb0, FUTEX_WAIT_PRIVATE, 2, NULL
Here's the output of the gdb 'where' command on a hung child process:
(gdb) where
#0 0x00007fcf120d09bb in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007fcf120d591c in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007fcf120d4f2b in __nss_database_lookup ()
from /lib/x86_64-linux-gnu/libc.so.6
#3 0x00007fcf120d630c in __nss_passwd_lookup2 ()
from /lib/x86_64-linux-gnu/libc.so.6
#4 0x00007fcf1208dac8 in getpwnam_r () from /lib/x86_64-linux-gnu/libc.so.6
#5 0x00007fcf0c6e38ed in exec_child (pl=0x18e5e90) at exec.c:303
#6 fork_child (pl=0x18e5e90, fd_in=<optimized out>, fd_out=<optimized out>,
fd_err=0x7fcf09aca5e8) at exec.c:509
#7 0x00007fcf0c6e3f5e in exec_read_one (arg=0x18e5e90) at exec.c:560
#8 0x00007fcf12599e9a in start_thread ()
from /lib/x86_64-linux-gnu/libpthread.so.0
#9 0x00007fcf120c2cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#10 0x0000000000000000 in ?? ()
(I could compile with --enable-debug, but that's a significant effort and it doesn't appear it would add much info at this point.)
Here's /proc/$pid/status for a hung child process:
Name: collectd
State: S (sleeping)
Tgid: 128026
Pid: 128026
PPid: 128018
TracerPid: 0
Uid: 0 0 0 0
Gid: 0 0 0 0
FDSize: 64
Groups:
VmPeak: 624472 kB
VmSize: 624472 kB
VmLck: 0 kB
VmPin: 0 kB
VmHWM: 1484 kB
VmRSS: 1484 kB
VmData: 516584 kB
VmStk: 136 kB
VmExe: 164 kB
VmLib: 10916 kB
VmPTE: 268 kB
VmSwap: 0 kB
Threads: 1
SigQ: 5/256851
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000001000
SigCgt: 0000000180014202
CapInh: 0000000000000000
CapPrm: ffffffffffffffff
CapEff: ffffffffffffffff
CapBnd: ffffffffffffffff
Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff
Cpus_allowed_list: 0-127
Mems_allowed: 00000000,00000003
Mems_allowed_list: 0-1
voluntary_ctxt_switches: 9
nonvoluntary_ctxt_switches: 0
Some documents on the web said getpwnam_r() could hang if using NIS or LDAP. Here's my /etc/nsswitch.conf:
# /etc/nsswitch.conf
#
# Example configuration of GNU Name Service Switch functionality.
# If you have the `glibc-doc-reference' and `info' packages installed, try:
# `info libc "Name Service Switch"' for information about this file.
passwd: compat
group: compat
shadow: compat
hosts: files dns
networks: files
protocols: db files
services: db files
ethers: db files
rpc: db files
netgroup: nis
If I gather correctly, 'compat' for the passwd, group, and shadow entries should be equivalent to 'files' if there are no exception items. The 'netgroup: nis' entry is apparently from the stock Ubuntu installation. As far as I am aware, there is no NIS active for this machine.
Is anyone here acquainted with reasons getpwnam_r() might hang and/or a better solution than adding a 'sleep(1)' mentioned in a reference URL? (It's likely a sleep wouldn't help my case, because I only have one thing called by the exec plugin.)
Thanks,
Robert Riches
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.verplant.org/pipermail/collectd/attachments/20130222/3c6ff76f/attachment-0001.html>
More information about the collectd
mailing list