[collectd] ntpd plugin complaining [not FIXED yet]

Thu Nov 2 11:10:43 CET 2006

Hi,
I do not have too much time to test it. Moreover I am working with the
production server. It is a bit dangerous to fiddle with it.
I will try to run the debug build there.

Florian Forster napsal(a):
> Hi again ;)
> 
> On Wed, Nov 01, 2006 at 09:00:07PM +0100, Lubo?? Stan??k wrote:
>> I did the same modification on ntpd like you yesterday.
>> The rest of the remark was that I got rid of the famous "getnameinfo
>> failed: ai_family not supported" but I found out another problem.
>> My .rrd directory contains files like (also time_dispersion and
>> time_offset .rrd files):
>> delay-::.rrd
>> delay-::c021:0:100:0:9a99:9999.rrd
>> delay-0:6e73::6e65:7400:7265:6e74.rrd
>> delay-0.0.0.0.rrd
>> delay-100::.rrd
>> delay-119.121.0.0.rrd
>> delay-::145.2.0.0.rrd
> 
> There are some really weird hosts in there :/
> 

No, those records are a garbage.

>> Moreover the number of files increases.
> 
> You mean the number of files increases constantly? How many files per
> hour, day or whatever reasonable do get created?
> 

Whenever the ntpd returns a garbage in the response.
It is not so frequent, it comes probably from the ntpd restarts. When
the ntpd retains the list of servers unchanged, only the same garbage is
reported, mainly :: and 0.0.0.0.

>> Some of them look like reference clocks. The rest seems to be some
>> garbage. I would bet for a text in some of them.
> 
> Hm, the addresses used for reference clocks start with 0x7F7F or, in the
> usual decimal notation, 127.127. These names should be resolved
> correctly.
> 

I know, it was only a guess.

>> So the conclusion is that you must do some fields validation procedure
>> before processing IPV6 address.
> 
> I've double checked with `ntpdc's source: They don't do it.. I wonder
> what they are doing different?
> 

I have looked into the ntpdc source and noticed exceptional processing
of old ntpd versions responses. Could not it be the reason?

>> First I thought that one of the servers returns the garbage.  I
>> checked all of them one by one. No troubles with any of more than 30.
>> Later I returned to the previous ntp.conf and the garbage appeared
>> again.
> 
> So as long as you have only one server configured everything's fine? Did
> you configure a specific server or the round-robin address?
> 

Yes.
When I configure ntpd with fixed servers, no garbage is collected.
36 in the last test.

>> Nov  1 20:18:20 ls collectd[31163]: rrd_update failed:
>>   ntpd/time_offset-::.rrd: illegal attempt to update using time
>>   1162408699 when last update time is 1162408699 (minimum one second
>>   step)
> 
> This most likely means, that more than one `host' with the address `::'
> is being reported.
> 

The ntpd gets the similar garbage in the query repeatedly. This comes
from the ps enumeration cycle (after REQ_PEER_LIST_SUM query in
ntpd_read()).

>> It seems that the problem is caused by the combination of the RR DNS
>> servers .pool.ntp.org and ntpd server.
> 
> I'm using `pool.ntp.org' myself, and I usually configure between two and
> three servers that are being chosen randomly at startup. I've never had
> these issues so there's at least one additional factor.
> 

Maybe you could look at the RedHat patches to the ntpd.
the source rpm is available at any Fedora Core mirror:
http://fedora.redhat.com/download/mirrors.html

>> I have not found any information about ntpd's behavior in such case.
>> But it seems it returns invalid information in the query response. It
>> maybe prepares to switch the servers and returns partially filled
>> structures.
> 
> The structures seem to be filled with random bytes, at least part of the
> structures: They all seem to contain some null-bytes on either end of
> the structure..
> 

It is still a mystery for me.

>> I am sorry but I have not found a solution so far. I am glad that I
>> have found the problem source.
> 
> Could you run the following two command and send their output? These
> command essentially do what the plugin is doing - with and without name
> translation:
>   ntpdc -c peers
>   ntpdc -n -c peers
> 

This is the report with my previous setup (no RR servers):
=LOCAL(0)        127.0.0.1       10   64  377 0.00000  0.000000 0.03072
=ntps1-0.cs.tu-b 213.220.195.229  1 1024  377 0.04590 -0.012662 0.12175
=swisstime.ee.et 213.220.195.229  2 1024  377 0.03712 -0.010287 0.12184
*tik.cesnet.cz   213.220.195.229  1 1024  377 0.00601 -0.011426 0.12178
=kamelot2.dkm.cz 213.220.195.229  2 1024  377 0.00711 -0.025134 0.12172

=127.127.1.0     127.0.0.1       10   64  377 0.00000  0.000000 0.03072
=130.149.17.21   213.220.195.229  1 1024  377 0.04590 -0.012662 0.12175
=129.132.2.21    213.220.195.229  2 1024  377 0.03712 -0.010287 0.12184
*195.113.144.201 213.220.195.229  1 1024  377 0.00601 -0.011426 0.12178
=62.24.64.33     213.220.195.229  2 1024  377 0.00711 -0.025134 0.12172

and correspondent .rrd files:
delay/time_dispersion/time_offset-kamelot2.dkm.cz.rrd
delay/time_dispersion/time_offset-ntps1-0.cs.tu-berlin.de.rrd
delay/time_dispersion/time_offset-swisstime.ee.ethz.ch.rrd
delay/time_dispersion/time_offset-tik.cesnet.cz.rrd
delay/time_dispersion/time_offset-LOCAL.rrd
frequency_offset-loop.rrd
time_offset-error.rrd
time_offset-loop.rrd

This is the report with RR servers 0/1/2.fedora.pool.ntp.org:
=LOCAL(0)        127.0.0.1       10   64  377 0.00000  0.000000 0.03061
=server213-171-2 213.220.195.229  3   64  377 0.04135 -0.068969 0.05424
=sense.xs4all.nl 213.220.195.229  3   64  377 0.04305 -0.067047 0.05289
*ntps1-0.cs.tu-b 213.220.195.229  1   64  377 0.04399 -0.057553 0.06734
=swisstime.ee.et 213.220.195.229  2   64  377 0.03549 -0.059151 0.05942
=tik.cesnet.cz   213.220.195.229  1   64  377 0.00627 -0.063582 0.05707
=kamelot2.dkm.cz 213.220.195.229  2   64  377 0.00462 -0.085522 0.05063
=ns1.kamino.fr   213.220.195.229  2   64  377 0.03474 -0.080320 0.04622

=127.127.1.0     127.0.0.1       10   64  377 0.00000  0.000000 0.03061
=213.171.221.57  213.220.195.229  3   64  377 0.04135 -0.068969 0.05424
=213.84.230.57   213.220.195.229  3   64  377 0.04305 -0.067047 0.05289
*130.149.17.21   213.220.195.229  1   64  377 0.04399 -0.057553 0.06734
=129.132.2.21    213.220.195.229  2   64  377 0.03549 -0.059151 0.06537
=195.113.144.201 213.220.195.229  1   64  377 0.00627 -0.063582 0.05707
=62.24.64.33     213.220.195.229  2   64  377 0.00462 -0.085522 0.05063
=213.246.63.72   213.220.195.229  2   64  377 0.03474 -0.080320 0.05424

and correspondent .rrd files:
delay/time_dispersion/time_offset-kamelot2.dkm.cz.rrd
delay/time_dispersion/time_offset-ntps1-0.cs.tu-berlin.de.rrd
delay/time_dispersion/time_offset-swisstime.ee.ethz.ch.rrd
delay/time_dispersion/time_offset-tik.cesnet.cz.rrd
delay/time_dispersion/time_offset-LOCAL.rrd
delay/time_dispersion/time_offset-server213-171-221-57.live-servers.net.rrd
and the garbage:
delay-::.rrd
delay-0.0.0.0.rrd
delay-0:0:100:0:100::.rrd
delay-121.0.0.9.rrd
delay-160.178.60.0.rrd
delay-20b2:3c00:20b2:3c00:4147:4500::.rrd
delay-::7779:2e6e:6574:72:656e:7463.rrd
delay-::8084:2ec1:0:0:8084:2e41.rrd

and the number of files is increasing with additional garbage addresses.

I am missing the 2 servers here that are reported by the ntpdc:
=sense.xs4all.nl 213.220.195.229  3   64  377 0.04305 -0.067047 0.05289
=ns1.kamino.fr   213.220.195.229  2   64  377 0.03474 -0.080320 0.04622

=213.84.230.57   213.220.195.229  3   64  377 0.04305 -0.067047 0.05289
=213.246.63.72   213.220.195.229  2   64  377 0.03474 -0.080320 0.05424

I have appended the collectd debug log excerpt. The debug version died
several times with the last call to ntpd_submit().
Look at the the srcaddr reports. They really do not look like IP
addresses or reference clocks.

I left there also the log of the new sensors (last patch version) for
you to look at. BTW the plugin works like a charm.

Best regars,
Lubos
-------------- next part --------------
A non-text attachment was scrubbed...
Name: collectd.log.gz
Type: application/x-gzip
Size: 2785 bytes
Desc: not available
Url : http://mailman.verplant.org/pipermail/collectd/attachments/20061102/973fd4fc/collectd.log-0001.bin