[collectd] randomly getting dl_open_worker assertion
Riches Jr, Robert M
robert.m.riches.jr at intel.com
Thu Feb 21 17:01:18 CET 2013
(Apologies for the M$ formatting--haven't figured out how to make the company-provided email client quote/reply properly.)
Thanks for the info that the dlopen'ed shared object is not closed but remains open when fork() is called. That and the #4578 glibc bug report do resonate with this situation. Pressing '1' while running 'top' on the test machine shows 32 logical cpus, numbered 0 through 31, inclusive, so SMP-related concurrency issues definitely appear possible. According to 'dpkg', it appears I'm running glibc 2.15-0ubuntu10.3 on this Ubuntu 12.04.2 system (plus updates). The assertion failure happens around 30-50% of the time I attempt to start collectd with the exec plugin enabled. I can almost always get a failure within about 2-7 attempts.
Is there a possibility that the dlopen'ed shared object could be finalized or tidied up before the fork()? I would think there must be lots of other programs out there that dlopen() a shared object and then later call fork() and exec...().
From: Florian Forster [mailto:octo at collectd.org]
Sent: Thursday, February 21, 2013 2:04 AM
To: Riches Jr, Robert M
Cc: collectd at verplant.org
Subject: Re: [collectd] randomly getting dl_open_worker assertion
On Wed, Feb 20, 2013 at 04:21:16PM +0000, Riches Jr, Robert M wrote:
> [2013-02-20 08:01:31] exec plugin: exec_read_one: error = Inconsistency detected by ld.so: dl-open.c: 221: dl_open_worker: Assertion `_dl_debug_initialize (0, args->nsid)->r_state == RT_CONSISTENT' failed!
this appears to be an assertion within glibc's implementation of dlopen(3).  It looks like this bug from 2007 could be related: 
> There doesn't seem to be any rhyme or reason as to whether I get the
> expected result or the assertion failure. I've googled for answers
> until my keyboard is wearing out, but nothing has come up that shows
> promise of a solution.
From what you describe, it feels like a concurrency issue. collectd is using dlopen() to load the plugins, including the exec plugin. This happens at start-up only; later the mechanism is no longer used, but the dlopen'ed shared object are never closed, so they are still open when
fork() is called.
> Regarding the behavior when I run the real script that doesn't send
> anything to stderr, […]
I don't think this is related to I/O. It sounds more like a problem between dlopen() and fork().
How many processors does the machine have on which this problem occurs?
Which libc are you using? Approximately, how often does this happen?
collectd – The system statistics collection daemon
More information about the collectd