[collectd] Exporting the collectd's loop variable for plugins

Luboš Staněk lubek at users.sourceforge.net
Mon Nov 6 15:58:22 CET 2006


Hi Florian,

Florian Forster napsal(a):
> On Sun, Nov 05, 2006 at 07:07:56PM +0100, Lubo?? Stan??k wrote:
>> Well, the callback functions only handle the need to clean up the
>> plugin's resources which would be good of course. It does not break
>> the plugin's reading/submitting/writing process.
> 
> The idea behind the `loop' variable was to not break the process. You
> could, of course, pass a reference to `loop' to `plugin_read_all' and
> let it check if the termination has been initiated before calling each
> plugins read-function. But I don't quite see _why_ this would be an
> advantage yet. It speeds up the shutdown process, but why's that good/
> important?
> 

I consider it important because the collectd plugin is the slowest
ending daemon (in the time the step is actually running). If we increase
number of plugins, it will shut down several tenths of seconds in the
worst case waiting for all plugins to fulfill the collect step.
Because it currently cannot reload configuration, the restart is the
only way to start it with different settings.

Skipping the remaining tasks in the collecting step is not a problem,
because if we are restarting, the daemon will start sooner and will also
start to collect sooner.

I have a slightly modified init script but "killproc $prog" is the same
like in the contrib and other FC/RH init scripts. The killproc sequence
is as follows (usleep 100000 = 0.1 seconds):

kill -TERM $pid
usleep 100000
if checkpid $pid && sleep 1 &&
	checkpid $pid && sleep 3 &&
	checkpid $pid ; then
	kill -KILL $pid >/dev/null 2>&1
	usleep 100000
fi

It waits aprox. 4.1 and sth. (probably 5-6) seconds for the process to
handle the -TERM, then it kills the process.
That is why I get not enough "Exiting normally" messages.


>> The collect step time requirements will increase with more plugins.
> 
> I'm afraid so. Since most of that is IO (or at least I hope it is IO)
> threads could significantly speed up one collection cycle. There has
> been quite some discussion a while back, but nothing has been done in
> this direction yet.
> 

I have read it already.
The kernel threads would be my preferred solution due to the
multiprocessor support. But I cannot evaluate the portability to other
systems.


>> BTW I am writing one. :)
> 
> You make me curious ;)
> 

It has appeared in the discussions also.
It is the bind9 statistical output (See the BIND Administrators
Reference Manual, Chapter 6, point 6.2.14.15.).
I consider it of higher value than the one you mentioned (dnsgraph?).
I do not want to collect what information clients are querying but the
server's load. The server can split global statistics for zones and
views and one interesting statistical output is about querying the chaos
class if you enable it.


>> I am getting also "Not sleeping because `timeval_sub_timespec'
>> returned non-zero!" on a slower machine during cron.daily jobs and
>> probably clamav scans.
> 
> That doesn't sound good :/
> 


I started the collectd (3.10.2) with fixed ntpd plugin at November, 3rd,
in the evening. It runs on a really decent machine (Athlon64 X2 3800, 2
GB RAM, SATA 2 RAID 1, FC4 i386, serving the samba domain, domain time
server, printing, mail, proxy, software compilation...)
collectd: local, apache, apcups, cpu, cpufreq, df, disk, hddtemp, load,
memory, nfs, ntpd, ping, processes, sensors (ExtendedNaming :)), swap,
traffic, users.

The first "Not sleeping..." occured at the cron.daily.
The second "Not sleeping..." occured at the simultaneous domain logins
of about three stations (samba was serving huge roaming profiles).
The last log line means that I have restarted the daemon with modified
configuration - "Exited normally" was missing.

Here is the extracted collectd's log:

Nov  3 22:41:07 test1 collectd[6879]: cpufreq found 2 cpu(s)
Nov  4 04:02:24 test1 collectd[6879]: Cannot open `/proc/25769/stat': No
such file or directory
repeated 21 times for 2 minutes
Nov  5 04:15:21 test1 collectd[6879]: Not sleeping because
`timeval_sub_timespec' returned non-zero!
Nov  5 04:23:01 test1 collectd[6879]: Cannot open `/proc/10497/stat': No
such file or directory
repeated 55 times for 5 minutes
Nov  6 07:50:31 test1 collectd[6879]: Not sleeping because
`timeval_sub_timespec' returned non-zero!
Nov  6 08:30:02 test1 collectd[6879]: Cannot open `/proc/12157/stat': No
such file or directory
Nov  6 09:30:33 test1 collectd[6879]: Cannot open `/proc/13582/stat': No
such file or directory
Nov  6 10:43:05 test1 collectd[6879]: Cannot open `/proc/21588/stat': No
such file or directory
Nov  6 10:43:15 test1 collectd[6879]: Cannot open `/proc/24352/stat': No
such file or directory
Nov  6 10:43:15 test1 collectd[6879]: Cannot open `/proc/24353/stat': No
such file or directory
Nov  6 10:43:45 test1 collectd[6879]: Cannot open `/proc/4690/stat': No
such file or directory
Nov  6 10:43:45 test1 collectd[6879]: Cannot open `/proc/4691/stat': No
such file or directory
Nov  6 10:44:55 test1 collectd[6879]: Cannot open `/proc/10758/stat': No
such file or directory
Nov  6 11:21:32 test1 collectd[12883]: cpufreq found 2 cpu(s)

Best regards,
Lubos



More information about the collectd mailing list