[collectd] Strange issue with exec and unixsock plugin

XANi xani666 at gmail.com
Mon Aug 23 15:24:40 CEST 2010


Dnia 2010-08-23, pon o godzinie 15:05 +0200, Sebastian Harl pisze:

> On Mon, Aug 23, 2010 at 02:50:33PM +0200, XANi wrote:
> > Dnia 2010-08-23, pon o godzinie 13:42 +0200, Sebastian Harl pisze:
> > > On Mon, Aug 23, 2010 at 01:34:08PM +0200, XANi wrote:
> > > > Dnia 2010-08-23, pon o godzinie 13:11 +0200, Sebastian Harl pisze:
> > > > > On Mon, Aug 23, 2010 at 04:02:57AM +0200, XANi wrote:
> > > > > > So after running something like:  
> > > > > > while sleep 30 ; do /etc/init.d/collectd restart; done
> > > > > > after some time (sometimes few minutes sometimes an hour or more) i get
> > > > > > tons of collectd processes lying around (ive added output of ps aux as
> > > > > > attachment) and sometimes after restart.
> > > > > […]
> > > > > > It seems to trigger when both exec and unixsock plugins are on, if i
> > > > > > turn off one of them it works fine. Ah and im using 64 bit debian
> > > > > > testing.
> > > > > 
> > > > > Uhm, strange. Could you please check (e.g. using "strace -p <pid>") what
> > > > > those collectd processes are doing? What's the parent of those processes
> > > > > (PPID in "ps ax -l" or use something like "ps axjf")? Are you able to
> > > > > kill those processes using signal SIGINT or SIGTERM?
> > > 
> > > > Ok so:
> > > > --
> > > > # ps ax |grep col
> > > > 4792 ?        SLsl   0:00 /usr/sbin/collectd
> > > > -C /etc/collectd/collectd.conf -P /var/run/collectd.pid
> > > > 4800 ?        S      0:00 /usr/sbin/collectd
> > > > -C /etc/collectd/collectd.conf -P /var/run/collectd.pid
> > > > --
> > > > as attachment result of strace -t -ff -o /tmp/4792 -p 4792 and
> > > > strace -t -ff -o /tmp/4800 -p 4800
> > > > 
> > > > parent of PID 4800 is 4792
> > > > 4792 reacts on sigterm, 4800 both SIGTERM and SIGQUIT doesn't work, only
> > > > SIGKILL
> > > 
> > > > 4800.4800:
> > > > 13:25:33 futex(0x7fe9098f7550, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
> > > 
> > > Thanks. Looks like some kind of deadlock :-/ I'll look into that.
> > 
> > If u want i can give u access to VM with that bug already "trigerred"
> > and root access so u can install debug tools, just send me ur ssh pubkey
> 
> Thanks. I'll have a look at the code first but I might come back to that
> offer after that ;-) Not quite sure when I'll have some time for that
> though. Possibly some time this week.
> 
> Cheers,
> Sebastian
> 


Ive noticed it's much easier to trigger on VM too (maybe because host is
quite busy with other machines), on my desktop it sometimes takes an
hour or 2 to trigger, on VM its triggered after few mintutes max. Also i
noticed that "locked" process is running as user ive told exec plugin to
run script as so
Exec postfix "/usr/local/bin/a.pl"
results in:
template:~# ps aux |grep coll|grep -v grep
root      2469  0.0  0.2 162764  1436 ?        S<Lsl 15:22
0:00 /usr/sbin/collectd -C /etc/collectd/collectd.conf
-P /var/run/collectd.pid
postfix   2476  0.0  0.2 101408  1168 ?        S<   15:22
0:00 /usr/sbin/collectd -C /etc/collectd/collectd.conf
-P /var/run/collectd.pid

Hope that helps :)

-- 
Mariusz Gronczewski (XANi) <xani666 at gmail.com>
GnuPG: 0xEA8ACE64
http://devrandom.pl
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.verplant.org/pipermail/collectd/attachments/20100823/8109c7ed/attachment-0001.htm 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 836 bytes
Desc: This is a digitally signed message part
Url : http://mailman.verplant.org/pipermail/collectd/attachments/20100823/8109c7ed/attachment-0001.pgp 


More information about the collectd mailing list