[collectd] exec plugin stuck on mutex

Ryan Tomayko r at tomayko.com
Tue Mar 9 20:46:20 CET 2010


On Tue, Mar 9, 2010 at 10:03 AM, Florian Forster <octo at verplant.org> wrote:
> Hi Ryan,
>
> On Tue, Mar 02, 2010 at 08:27:27PM -0800, Ryan Tomayko wrote:
>>     collectd 4.8.1, http://collectd.org/
>
> no applicable problem has been fixed in the Exec plugin in the meantime,
> so the problem should still exist in the master branch.
>
>> Once I notice the plugin has stopped reporting, I have an extra
>> process (28489) hanging around:
>>
>>     $ pstree -apu 22935
>>     collectdmon,22935 -P /var/run/collectdmon.pid -- -C
>> /etc/collectd/collectd.conf
>>       collectd,22936 -C /etc/collectd/collectd.conf -f
>>           collectd,28489 -C /etc/collectd/collectd.conf -f
>>           {collectd},22937
>>           {collectd},22938
>>           {collectd},22939
>>           {collectd},22940
>>           {collectd},22941
>>           {collectd},28487
>
>> That process seems to exist only when the exec plugin is no longer
>> reporting. Sometimes there's two of these processes.
>
> This looks like the code that is supposed to spawn a new instance of the
> script failed after fork(2) but before exec(2).
>
> There are various cases in which the exec(2) is not reached in
> "exec_child()", but they all emit an error message. I take it there is
> no error message somewhere in the logs or in syslog?

I'm embarassed to admit I have not had the logs turned on. There's
very likely an error message I'm not seeing.  I'll switch them on
today and see what happens.

>> strace reports that the extra process is sitting in a mutex. It never
>> leaves this state:
>>
>>     $ sudo strace -p 28489
>>     Process 28489 attached - interrupt to quit
>>     futex(0x7f2f7d4e8fb0, FUTEX_WAIT_PRIVATE, 2, NULL
>
> There is a mutex in the exec plugin, but I doubt that this is the
> problem. It is held just before a thread is spawned (to set a flag) and
> just before that thread exits (to reset the flag). I don't see any way
> this could lead to a deadlock or starvation.
>
> I'm much more concerned about the SIGCHLD handler and the various
> waitpid(2)s in the code. I could see the controlling thread missing its
> child's signal and waiting forever. This shouldn't create weird new
> processes though.
>
>> Any ideas what might be going on here or information I could provide
>> to help find a root cause?
>
> I'm a bit puzzled by the described behavior, I have to admit. Maybe you
> could provide the "lsof -p $PID" output for one of those weird child
> processes?

Here's the lsof output:

# lsof -p 25335
COMMAND    PID USER   FD   TYPE DEVICE    SIZE       NODE NAME
collectd 25335 root  cwd    DIR  254,0    4096     650195 /var/lib/collectd
collectd 25335 root  rtd    DIR  254,0    4096          2 /
collectd 25335 root  txt    REG  254,0  155960     114923 /usr/sbin/collectd
collectd 25335 root  mem    REG  254,0   93016     262221 /lib/libgcc_s.so.1
collectd 25335 root  mem    REG  254,0   72568     262188 /lib/libresolv-2.7.so
collectd 25335 root  mem    REG  254,0   18752     262195 /lib/libnss_dns-2.7.so
collectd 25335 root  mem    REG  254,0   47520     262191
/lib/libnss_files-2.7.so
collectd 25335 root  mem    REG  254,0  421512     122929
/usr/lib/libgcrypt.so.11.4.4
collectd 25335 root  mem    REG  254,0   40560     139512
/usr/lib/collectd/network.so
collectd 25335 root  mem    REG  254,0    6624     139559
/usr/lib/collectd/memory.so
collectd 25335 root  mem    REG  254,0   12424     139514
/usr/lib/collectd/memcached.so
collectd 25335 root  mem    REG  254,0    5288     139551
/usr/lib/collectd/load.so
collectd 25335 root  mem    REG  254,0    7208     139530
/usr/lib/collectd/interface.so
collectd 25335 root  mem    REG  254,0   22128     139513
/usr/lib/collectd/exec.so
collectd 25335 root  mem    REG  254,0   13344     139548
/usr/lib/collectd/df.so
collectd 25335 root  mem    REG  254,0    6672     139586
/usr/lib/collectd/cpu.so
collectd 25335 root  mem    REG  254,0    8040     139515
/usr/lib/collectd/logfile.so
collectd 25335 root  mem    REG  254,0 1375536     262194 /lib/libc-2.7.so
collectd 25335 root  mem    REG  254,0   14616     262198 /lib/libdl-2.7.so
collectd 25335 root  mem    REG  254,0  130114     262190 /lib/libpthread-2.7.so
collectd 25335 root  mem    REG  254,0  119288     262197 /lib/ld-2.7.so
collectd 25335 root  mem    REG  254,0   13168     123277
/usr/lib/libgpg-error.so.0.3.0
collectd 25335 root    0r  FIFO    0,6         1285434092 pipe
collectd 25335 root    1w  FIFO    0,6         1285434093 pipe
collectd 25335 root    2w  FIFO    0,6         1285434094 pipe

Not much useful in there that I can see.

> While looking into this I did find a path in "exec_read_one()" where the
> function returned without clearing the "PL_RUNNING" flag. I don't see
> how this could produce a child process, but maybe it's worth a try. The
> commit is 66c0d62 ([0]).
>
> Regards,
> —octo
> [0] <http://github.com/octo/collectd/commit/66c0d62a769d8bb363c8d19e82896d6cf5bdcc2b>

I'll give that a shot. Thanks a ton for all your help here.

Ryan



More information about the collectd mailing list