[collectd] collectd with network plugin: ~60 updates per client fails per day

Thomas D. whissi at whissi.de
Mon Dec 2 11:44:15 CET 2013


Hi,

in our network we see ~60 failed updates per client each day.

Logs from collectd (running on "logger3") which is running the network
plugin as server:

> Dec  2 08:16:49 logger3 collectd[7186]: rrdcached plugin: rrdc_update (/var/lib/collectd/rrd/web4.example.org/interface-internal0/if_packets.rrd, [1385968599:63072:373977], 1) failed with status -1.

Logs from strace attached to rrdcached (also running on "logger3"; I am
just including the parts containing "if_packets.rrd" from "web4"):

> 7203  08:16:49.168136 read(3, "update /var/lib/collectd/rrd/web4.example.org/interface-internal0/if_packets.rrd 1385968599:63072:373977\n", 8192) = 111
> 7203  08:16:49.168194 write(3, "-1 illegal attempt to update using time 1385968599.000000 when last update time is 1385968609.000000 (minimum one second step)\n", 127) = 127

Going backwards in logs (reversed order!):

> 7203  08:16:49.164592 write(8, "date /var/lib/collectd/rrd/web4.example.org/ethstat-internal0/derive-tx_flow_control_xoff.rrd 1385968609:0\nwrote /var/lib/collectd/rrd/web4.example.org/ethstat-internal0/derive-tx_restart_queue.rrd\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-internal0/derive-rx_smbus.rrd 1385968609:103\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-internal0/derive-rx_csum_offload_errors.rrd 1385968609:0\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-internal0/derive-tx_smbus.rrd 1385968609:0\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-internal0/derive-corr_ecc_errors.rrd 1385968609:0\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-internal0/derive-uncorr_ecc_errors.rrd 1385968609:0\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-internal0/derive-rx_dma_failed.rrd 1385968609:0\nupdate /var/lib/collectd/rrd/web4.example.org/load/load.rrd 1385968609:0.000000:0.010000:0.050000\nupdate /var/lib/collectd/rrd/web4.examp
le.org/iptables-filter-net2fw/ipt_bytes-Blacklist.rrd 1385968609:6768\nupdate /var/lib/collectd/rrd/web4.example.org/iptables-filter-net2fw/ipt_packets-Blacklist.rrd 1385968609:130\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-internal0/derive-dropped_smbus.rrd 1385968609:0\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-rx_packets.rrd 1385968609:59694\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-tx_packets.rrd 1385968609:28090\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-rx_bytes.rrd 1385968609:4673356\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-tx_bytes.rrd 1385968609:4341247\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-rx_broadcast.rrd 1385968609:31269\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-tx_broadcast.rrd 1385968609:1\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-rx_multicast.rrd 1
385968609:12\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-tx_multicast.rrd 1385968609:7\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-rx_errors.rrd 1385968609:0\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-tx_errors.rrd 1385968609:0\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-tx_dropped.rrd 1385968609:0\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-multicast.rrd 1385968609:12\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-collisions.rrd 1385968609:0\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-rx_length_errors.rrd 1385968609:0\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-rx_over_errors.rrd 1385968609:0\nupdate /var/lib/collectd/rrd/web4.example.org/interface-internal0/if_octets.rrd 1385968609:61081416:457187844\nupdate /var/lib/collectd/rrd/web4.example.org/interface-interna
l0/if_errors.rrd 1385968609:0:0\nupdate /var/lib/collectd/rrd/web4.example.org/interface-external0/if_octets.rrd 1385968609:4673356:4341247\nupdate /var/lib/collectd/rrd/web4.example.org/interface-external0/if_errors.rrd 1385968609:0:0\nupdate /var/lib/collectd/rrd/web4.example.org/interface-internal0/if_packets.rrd 1385968609:63073:373985\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-rx_no_buffer_count.rrd 1385968609:0\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-rx_crc_errors.rrd 1385968609:0\nupdate /var/lib/collectd/rrd/web4.example.org/interface-external0/if_packets.rrd 1385968609:59694:28090\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-tx_aborted_errors.rrd 1385968609:0\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-rx_frame_errors.rrd 1385968609:0\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-tx_c", 4096) = 4096

> 7203  08:16:49.163859 write(3, "0 errors, enqueued 1 value(s).\n", 31) = 31
> 7203  08:16:49.163806 read(3, "update /var/lib/collectd/rrd/web4.example.org/interface-internal0/if_packets.rrd 1385968609:63073:373985\n", 8192) = 111

> 22310 08:16:29.171935 write(8, "nal0/derive-tx_heartbeat_errors.rrd 1385968589:0\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-tx_abort_late_coll.rrd 1385968589:0\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-tx_single_coll_ok.rrd 1385968589:0\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-tx_multi_coll_ok.rrd 1385968589:0\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-rx_long_length_errors.rrd 1385968589:0\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-internal0/derive-rx_short_length_errors.rrd 1385968589:0\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-internal0/derive-rx_flow_control_xon.rrd 1385968589:0\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-tx_restart_queue.rrd 1385968589:0\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-rx_short_length_errors.rrd 1385968589:0\nupdate /var/lib/collectd/rrd/web4.example.
org/ethstat-external0/derive-tx_timeout_count.rrd 1385968589:0\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-internal0/derive-rx_errors.rrd 1385968589:0\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-rx_align_errors.rrd 1385968589:0\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-rx_flow_control_xoff.rrd 1385968589:0\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-tx_tcp_seg_failed.rrd 1385968589:0\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-tx_flow_control_xon.rrd 1385968589:0\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-tx_tcp_seg_good.rrd 1385968589:66\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-rx_flow_control_xon.rrd 1385968589:0\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-rx_csum_offload_good.rrd 1385968589:19609\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-rx_c
sum_offload_errors.rrd 1385968589:16\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-tx_flow_control_xoff.rrd 1385968589:0\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-rx_header_split.rrd 1385968589:0\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-alloc_rx_buff_failed.rrd 1385968589:0\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-dropped_smbus.rrd 1385968589:0\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-rx_dma_failed.rrd 1385968589:0\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-tx_dma_failed.rrd 1385968589:0\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-tx_smbus.rrd 1385968589:0\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-rx_smbus.rrd 1385968589:0\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-uncorr_ecc_errors.rrd 1385968589:0\nupdate /var/lib/collect
d/rrd/web4.example.org/ethstat-external0/derive-rx_hwtstamp_cleared.rrd 1385968589:0\nupdate /var/lib/collectd/rrd/web4.example.org/iptables-filter-net2fw/ipt_packets-Blacklist.rrd 1385968589:130\nupdate /var/lib/collectd/rrd/web4.example.org/iptables-filter-net2fw/ipt_bytes-Blacklist.rrd 1385968589:6768\nupdate /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-corr_ecc_errors.rrd 1385968589:0\nupdate /var/lib/collectd/rrd/web4.example.org/interface-internal0/if_packets.rrd 1385968589:63071:373969\nupdate /var/lib/collectd/rrd/web8.example.org/swap/swap-used.rrd 1385968579:0.000000\nupdate /var/lib/collectd/rrd/web8.example.org/swap/swap-free.rrd 1385968579:4294963200.000000\nupdate /var/lib/collectd/rrd/web8.example.org/swap/swap-cached.rrd 1385968579:0.000000\nupdate /var/lib/collectd/rrd/web8.example.org/swap/swap_io-in.rrd 1385968579:0\nwrote /var/lib/collectd/rrd/web4.example.org/ethstat-external0/derive-corr_ecc_errors", 4096) = 4096

> 7203  08:16:29.171206 write(3, "0 errors, enqueued 1 value(s).\n", 31) = 31
> 7203  08:16:29.171154 read(3, "update /var/lib/collectd/rrd/web4.example.org/interface-internal0/if_packets.rrd 1385968589:63071:373969\n", 8192) = 111

Looks like rrdcached didn't received data for "if_packets.rrd" for
"web4" on "08:16:39" but received two updates for this file on
"08:16:49" (the first update (08:16:49.163806) seems to be the "regular"
expected update; the seconds update (08:16:49.168136) seems to be the
missing data from "08:16:39").

The question is "WHY"? Is it a time problem?

ntp logs from logger3 (which is also the only time source for web4):

> Dec  2 05:55:17 logger3 ntpd[9910]: adjusting local clock by 0.000272s
> [...]
> Dec  2 08:08:00 logger3 ntpd[9911]: reply from 192.53.103.103: offset 0.000466 delay 0.024575, next query 1526s
> Dec  2 08:16:10 logger3 ntpd[9911]: reply from 131.188.3.222: offset 0.000921 delay 0.020624, next query 1594s
> Dec  2 08:23:27 logger3 ntpd[9911]: reply from 195.145.119.188: offset -0.003621 delay 0.006547, next query 161
> [...]
> Dec  2 08:57:20 logger3 ntpd[9910]: adjusting local clock by 0.000502s

ntp logs from web4:

> Dec  2 07:23:26 web4 ntpd[11310]: adjusting local clock by 0.000297s
> Dec  2 07:50:20 web4 ntpd[11311]: reply from logger3: offset -0.000051 delay 0.000207, next query 1566s
> Dec  2 08:16:26 web4 ntpd[11311]: reply from logger3: offset -0.000069 delay 0.000172, next query 1540s
> Dec  2 08:42:06 web4 ntpd[11311]: reply from logger3: offset -0.000077 delay 0.000185, next query 1508s
> Dec  2 09:07:14 web4 ntpd[11311]: reply from logger3: offset 0.000417 delay 0.000179, next query 1534s
> Dec  2 09:32:48 web4 ntpd[11311]: reply from logger3: offset 0.000419 delay 0.000227, next query 1560s
> Dec  2 09:32:48 web4 ntpd[11310]: adjusting local clock by -0.000046s

No changes to the clock are logged, offset before and after 08:16 looks
fine, so time should be OK, no?

web4 is just an example. We see this for *every* client. Also, the
plugins/files will vary.


Well yes,

  86400 seconds / 10 seconds interval * 8 plugins
  = 69120 updates per 24h per client

So if only 60 updates per client and day will fail this isn't very much
I guess. But I want to know if this is "normal" or not.

Also: Would you recommend to run the TimeDiff plugin on the "server" to
prevent this?


We are running collectd v5.4.0 on server and client.

The collectd server (which is acting as time source for the clients)
currently receives data from 10 clients.

All servers are located in the same rack, they are connected through an
internal HP ProCurve Switch 2824.

Any feedback/comment is appreciated. Thanks.


-Thomas



More information about the collectd mailing list