[collectd] exec plugin stuck on mutex

Tue Mar 9 19:03:53 CET 2010

Hi Ryan,

On Tue, Mar 02, 2010 at 08:27:27PM -0800, Ryan Tomayko wrote:
>     collectd 4.8.1, http://collectd.org/

no applicable problem has been fixed in the Exec plugin in the meantime,
so the problem should still exist in the master branch.

> Once I notice the plugin has stopped reporting, I have an extra
> process (28489) hanging around:
> 
>     $ pstree -apu 22935
>     collectdmon,22935 -P /var/run/collectdmon.pid -- -C
> /etc/collectd/collectd.conf
>       collectd,22936 -C /etc/collectd/collectd.conf -f
>           collectd,28489 -C /etc/collectd/collectd.conf -f
>           {collectd},22937
>           {collectd},22938
>           {collectd},22939
>           {collectd},22940
>           {collectd},22941
>           {collectd},28487

> That process seems to exist only when the exec plugin is no longer
> reporting. Sometimes there's two of these processes.

This looks like the code that is supposed to spawn a new instance of the
script failed after fork(2) but before exec(2).

There are various cases in which the exec(2) is not reached in
"exec_child()", but they all emit an error message. I take it there is
no error message somewhere in the logs or in syslog?

> strace reports that the extra process is sitting in a mutex. It never
> leaves this state:
> 
>     $ sudo strace -p 28489
>     Process 28489 attached - interrupt to quit
>     futex(0x7f2f7d4e8fb0, FUTEX_WAIT_PRIVATE, 2, NULL

There is a mutex in the exec plugin, but I doubt that this is the
problem. It is held just before a thread is spawned (to set a flag) and
just before that thread exits (to reset the flag). I don't see any way
this could lead to a deadlock or starvation.

I'm much more concerned about the SIGCHLD handler and the various
waitpid(2)s in the code. I could see the controlling thread missing its
child's signal and waiting forever. This shouldn't create weird new
processes though.

> Any ideas what might be going on here or information I could provide
> to help find a root cause?

I'm a bit puzzled by the described behavior, I have to admit. Maybe you
could provide the "lsof -p $PID" output for one of those weird child
processes?

While looking into this I did find a path in "exec_read_one()" where the
function returned without clearing the "PL_RUNNING" flag. I don't see
how this could produce a child process, but maybe it's worth a try. The
commit is 66c0d62 ([0]).

Regards,
—octo

[0] <http://github.com/octo/collectd/commit/66c0d62a769d8bb363c8d19e82896d6cf5bdcc2b>
-- 
Florian octo Forster
Hacker in training
GnuPG: 0x91523C3D
http://verplant.org/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : http://mailman.verplant.org/pipermail/collectd/attachments/20100309/957055bf/attachment.pgp