[collectd] failure :( caused by the network malfunction

Luboš Staněk lubek at users.sourceforge.net
Fri Nov 24 16:18:15 CET 2006


Hi Florian,
I do not know where the problem hides.

Florian Forster napsal(a):
> Hi,
>
> On Thu, Nov 23, 2006 at 07:18:03PM +0100, Lubo?? Stan??k wrote:
>> The collectd was not able to to gather any data from plugins due to
>> network timeouts.

At first I would bet, it was caused by the ping plugin.

But the log contains records of the main loop (Not sleeping...).
It means that the daemon's plugin_read_all() loop was performed,
although it took more than 2 minutes.
In such case I would expect at least some records in .rrds.
But all .rrds has NaN records for more than hour.

I will try to simulate similar failure and test it with the debug build.


> 
> I think with this number of plugins we should start thinking about
> threads more seriously. Sebastian has already investigated a little in
> this direction and my biggest concern, `errno', doesn't seem to be such
> a problem afterall.
> 
> I'll investigate this issue further..
> 

The number of plugins is the problem with crossing the step time. But
this has much more relationship to blocking calls that exceeds the step
time much more.

I have mentioned threads several times before.

I did some investigations too. Although I would like to see using
kernel's threads because of the multiprocessor support, it seems that
the optimal possibility will be the 'pth' due to the multiplatform support.
All plugins would have to be checked for blocking calls and replaced
with available 'pth_' functions due to the nature of the 'pth'.

Best regards,
Lubos



More information about the collectd mailing list