[collectd] randomly getting dl_open_worker assertion

Thanks for the info that the dlopen'ed shared object is not closed but remains open when fork() is called.  That and the #4578 glibc bug report do resonate with this situation.  Pressing '1' while running 'top' on the test machine shows 32 logical cpus, numbered 0 through 31, inclusive, so SMP-related concurrency issues definitely appear possible.  According to 'dpkg', it appears I'm running glibc 2.15-0ubuntu10.3 on this Ubuntu 12.04.2 system (plus updates).  The assertion failure happens around 30-50% of the time I attempt to start collectd with the exec plugin enabled.  I can almost always get a failure within about 2-7 attempts.

Is there a possibility that the dlopen'ed shared object could be finalized or tidied up before the fork()?  I would think there must be lots of other programs out there that dlopen() a shared object and then later call fork() and exec...().


Robert Riches

Hi Robert,

On Wed, Feb 20, 2013 at 04:21:16PM +0000, Riches Jr, Robert M wrote:
> [2013-02-20 08:01:31] exec plugin: exec_read_one: error = Inconsistency detected by ld.so: dl-open.c: 221: dl_open_worker: Assertion `_dl_debug_initialize (0, args->nsid)->r_state == RT_CONSISTENT' failed!

this appears to be an assertion within glibc's implementation of dlopen(3). [0] It looks like this bug from 2007 could be related: [1]

> There doesn't seem to be any rhyme or reason as to whether I get the 
> expected result or the assertion failure.  I've googled for answers 
> until my keyboard is wearing out, but nothing has come up that shows 
> promise of a solution.

From what you describe, it feels like a concurrency issue. collectd is using dlopen() to load the plugins, including the exec plugin. This happens at start-up only; later the mechanism is no longer used, but the dlopen'ed shared object are never closed, so they are still open when
fork() is called.

> Regarding the behavior when I run the real script that doesn't send 
> anything to stderr, […]

I don't think this is related to I/O. It sounds more like a problem between dlopen() and fork().

How many processors does the machine have on which this problem occurs?
Which libc are you using? Approximately, how often does this happen?

Best regards,

[0] <http://code.woboq.org/userspace/glibc/elf/dl-open.c.html#259>
[1] <http://www.sourceware.org/bugzilla/show_bug.cgi?id=4578>
