[collectd] randomly getting dl_open_worker assertion

Riches Jr, Robert M robert.m.riches.jr at intel.com
Thu Feb 21 17:01:18 CET 2013


(Apologies for the M$ formatting--haven't figured out how to make the company-provided email client quote/reply properly.)

Thanks for the info that the dlopen'ed shared object is not closed but remains open when fork() is called.  That and the #4578 glibc bug report do resonate with this situation.  Pressing '1' while running 'top' on the test machine shows 32 logical cpus, numbered 0 through 31, inclusive, so SMP-related concurrency issues definitely appear possible.  According to 'dpkg', it appears I'm running glibc 2.15-0ubuntu10.3 on this Ubuntu 12.04.2 system (plus updates).  The assertion failure happens around 30-50% of the time I attempt to start collectd with the exec plugin enabled.  I can almost always get a failure within about 2-7 attempts.

Is there a possibility that the dlopen'ed shared object could be finalized or tidied up before the fork()?  I would think there must be lots of other programs out there that dlopen() a shared object and then later call fork() and exec...().

Thanks,

Robert Riches

-----Original Message-----
From: Florian Forster [mailto:octo at collectd.org] 
Sent: Thursday, February 21, 2013 2:04 AM
To: Riches Jr, Robert M
Cc: collectd at verplant.org
Subject: Re: [collectd] randomly getting dl_open_worker assertion

Hi Robert,

On Wed, Feb 20, 2013 at 04:21:16PM +0000, Riches Jr, Robert M wrote:
> [2013-02-20 08:01:31] exec plugin: exec_read_one: error = Inconsistency detected by ld.so: dl-open.c: 221: dl_open_worker: Assertion `_dl_debug_initialize (0, args->nsid)->r_state == RT_CONSISTENT' failed!

this appears to be an assertion within glibc's implementation of dlopen(3). [0] It looks like this bug from 2007 could be related: [1]

> There doesn't seem to be any rhyme or reason as to whether I get the 
> expected result or the assertion failure.  I've googled for answers 
> until my keyboard is wearing out, but nothing has come up that shows 
> promise of a solution.

From what you describe, it feels like a concurrency issue. collectd is using dlopen() to load the plugins, including the exec plugin. This happens at start-up only; later the mechanism is no longer used, but the dlopen'ed shared object are never closed, so they are still open when
fork() is called.

> Regarding the behavior when I run the real script that doesn't send 
> anything to stderr, […]

I don't think this is related to I/O. It sounds more like a problem between dlopen() and fork().

How many processors does the machine have on which this problem occurs?
Which libc are you using? Approximately, how often does this happen?

Best regards,
—octo

[0] <http://code.woboq.org/userspace/glibc/elf/dl-open.c.html#259>
[1] <http://www.sourceware.org/bugzilla/show_bug.cgi?id=4578>
--
collectd – The system statistics collection daemon
Website: http://collectd.org
Google+: http://collectd.org/+
GitHub:  https://github.com/collectd
Twitter: http://twitter.com/collectd


More information about the collectd mailing list