[collectd] exec plugin stuck on mutex
Florian Forster
octo at verplant.org
Tue Mar 9 19:03:53 CET 2010
Hi Ryan,
On Tue, Mar 02, 2010 at 08:27:27PM -0800, Ryan Tomayko wrote:
> collectd 4.8.1, http://collectd.org/
no applicable problem has been fixed in the Exec plugin in the meantime,
so the problem should still exist in the master branch.
> Once I notice the plugin has stopped reporting, I have an extra
> process (28489) hanging around:
>
> $ pstree -apu 22935
> collectdmon,22935 -P /var/run/collectdmon.pid -- -C
> /etc/collectd/collectd.conf
> collectd,22936 -C /etc/collectd/collectd.conf -f
> collectd,28489 -C /etc/collectd/collectd.conf -f
> {collectd},22937
> {collectd},22938
> {collectd},22939
> {collectd},22940
> {collectd},22941
> {collectd},28487
> That process seems to exist only when the exec plugin is no longer
> reporting. Sometimes there's two of these processes.
This looks like the code that is supposed to spawn a new instance of the
script failed after fork(2) but before exec(2).
There are various cases in which the exec(2) is not reached in
"exec_child()", but they all emit an error message. I take it there is
no error message somewhere in the logs or in syslog?
> strace reports that the extra process is sitting in a mutex. It never
> leaves this state:
>
> $ sudo strace -p 28489
> Process 28489 attached - interrupt to quit
> futex(0x7f2f7d4e8fb0, FUTEX_WAIT_PRIVATE, 2, NULL
There is a mutex in the exec plugin, but I doubt that this is the
problem. It is held just before a thread is spawned (to set a flag) and
just before that thread exits (to reset the flag). I don't see any way
this could lead to a deadlock or starvation.
I'm much more concerned about the SIGCHLD handler and the various
waitpid(2)s in the code. I could see the controlling thread missing its
child's signal and waiting forever. This shouldn't create weird new
processes though.
> Any ideas what might be going on here or information I could provide
> to help find a root cause?
I'm a bit puzzled by the described behavior, I have to admit. Maybe you
could provide the "lsof -p $PID" output for one of those weird child
processes?
While looking into this I did find a path in "exec_read_one()" where the
function returned without clearing the "PL_RUNNING" flag. I don't see
how this could produce a child process, but maybe it's worth a try. The
commit is 66c0d62 ([0]).
Regards,
—octo
[0] <http://github.com/octo/collectd/commit/66c0d62a769d8bb363c8d19e82896d6cf5bdcc2b>
--
Florian octo Forster
Hacker in training
GnuPG: 0x91523C3D
http://verplant.org/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : http://mailman.verplant.org/pipermail/collectd/attachments/20100309/957055bf/attachment.pgp
More information about the collectd
mailing list