[collectd] exec plugin stuck on mutex
Ryan Tomayko
r at tomayko.com
Tue Mar 9 20:46:20 CET 2010
On Tue, Mar 9, 2010 at 10:03 AM, Florian Forster <octo at verplant.org> wrote:
> Hi Ryan,
>
> On Tue, Mar 02, 2010 at 08:27:27PM -0800, Ryan Tomayko wrote:
>> collectd 4.8.1, http://collectd.org/
>
> no applicable problem has been fixed in the Exec plugin in the meantime,
> so the problem should still exist in the master branch.
>
>> Once I notice the plugin has stopped reporting, I have an extra
>> process (28489) hanging around:
>>
>> $ pstree -apu 22935
>> collectdmon,22935 -P /var/run/collectdmon.pid -- -C
>> /etc/collectd/collectd.conf
>> collectd,22936 -C /etc/collectd/collectd.conf -f
>> collectd,28489 -C /etc/collectd/collectd.conf -f
>> {collectd},22937
>> {collectd},22938
>> {collectd},22939
>> {collectd},22940
>> {collectd},22941
>> {collectd},28487
>
>> That process seems to exist only when the exec plugin is no longer
>> reporting. Sometimes there's two of these processes.
>
> This looks like the code that is supposed to spawn a new instance of the
> script failed after fork(2) but before exec(2).
>
> There are various cases in which the exec(2) is not reached in
> "exec_child()", but they all emit an error message. I take it there is
> no error message somewhere in the logs or in syslog?
I'm embarassed to admit I have not had the logs turned on. There's
very likely an error message I'm not seeing. I'll switch them on
today and see what happens.
>> strace reports that the extra process is sitting in a mutex. It never
>> leaves this state:
>>
>> $ sudo strace -p 28489
>> Process 28489 attached - interrupt to quit
>> futex(0x7f2f7d4e8fb0, FUTEX_WAIT_PRIVATE, 2, NULL
>
> There is a mutex in the exec plugin, but I doubt that this is the
> problem. It is held just before a thread is spawned (to set a flag) and
> just before that thread exits (to reset the flag). I don't see any way
> this could lead to a deadlock or starvation.
>
> I'm much more concerned about the SIGCHLD handler and the various
> waitpid(2)s in the code. I could see the controlling thread missing its
> child's signal and waiting forever. This shouldn't create weird new
> processes though.
>
>> Any ideas what might be going on here or information I could provide
>> to help find a root cause?
>
> I'm a bit puzzled by the described behavior, I have to admit. Maybe you
> could provide the "lsof -p $PID" output for one of those weird child
> processes?
Here's the lsof output:
# lsof -p 25335
COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
collectd 25335 root cwd DIR 254,0 4096 650195 /var/lib/collectd
collectd 25335 root rtd DIR 254,0 4096 2 /
collectd 25335 root txt REG 254,0 155960 114923 /usr/sbin/collectd
collectd 25335 root mem REG 254,0 93016 262221 /lib/libgcc_s.so.1
collectd 25335 root mem REG 254,0 72568 262188 /lib/libresolv-2.7.so
collectd 25335 root mem REG 254,0 18752 262195 /lib/libnss_dns-2.7.so
collectd 25335 root mem REG 254,0 47520 262191
/lib/libnss_files-2.7.so
collectd 25335 root mem REG 254,0 421512 122929
/usr/lib/libgcrypt.so.11.4.4
collectd 25335 root mem REG 254,0 40560 139512
/usr/lib/collectd/network.so
collectd 25335 root mem REG 254,0 6624 139559
/usr/lib/collectd/memory.so
collectd 25335 root mem REG 254,0 12424 139514
/usr/lib/collectd/memcached.so
collectd 25335 root mem REG 254,0 5288 139551
/usr/lib/collectd/load.so
collectd 25335 root mem REG 254,0 7208 139530
/usr/lib/collectd/interface.so
collectd 25335 root mem REG 254,0 22128 139513
/usr/lib/collectd/exec.so
collectd 25335 root mem REG 254,0 13344 139548
/usr/lib/collectd/df.so
collectd 25335 root mem REG 254,0 6672 139586
/usr/lib/collectd/cpu.so
collectd 25335 root mem REG 254,0 8040 139515
/usr/lib/collectd/logfile.so
collectd 25335 root mem REG 254,0 1375536 262194 /lib/libc-2.7.so
collectd 25335 root mem REG 254,0 14616 262198 /lib/libdl-2.7.so
collectd 25335 root mem REG 254,0 130114 262190 /lib/libpthread-2.7.so
collectd 25335 root mem REG 254,0 119288 262197 /lib/ld-2.7.so
collectd 25335 root mem REG 254,0 13168 123277
/usr/lib/libgpg-error.so.0.3.0
collectd 25335 root 0r FIFO 0,6 1285434092 pipe
collectd 25335 root 1w FIFO 0,6 1285434093 pipe
collectd 25335 root 2w FIFO 0,6 1285434094 pipe
Not much useful in there that I can see.
> While looking into this I did find a path in "exec_read_one()" where the
> function returned without clearing the "PL_RUNNING" flag. I don't see
> how this could produce a child process, but maybe it's worth a try. The
> commit is 66c0d62 ([0]).
>
> Regards,
> —octo
> [0] <http://github.com/octo/collectd/commit/66c0d62a769d8bb363c8d19e82896d6cf5bdcc2b>
I'll give that a shot. Thanks a ton for all your help here.
Ryan
More information about the collectd
mailing list