[collectd] loss of data on collectd-server
eric fauser
ef_cd<span style="display: none;">.trailing-username</span>(a)<span style="display: none;">leading-domain.</span>apa.at
Wed Jan 3 11:54:12 CET 2007
hi all
i have a little problem regarding client to server rrd-updates.
it seems that collectd-server does not collect all the data that arrives via
unicast udp
(or maybe the rrdtool-part could not update the corresponding rrd-file(s)
a look into the cpu-1.rrd files shows that there is no data
>rrdtool fetch cpu-1.rrd AVERAGE --start now-1000
===
1167817420: 1.0000000000e-01 0.0000000000e+00 1.0000000000e-01
9.9300000000e+01 5.0000000000e-01
1167817430: 1.0000000000e-01 0.0000000000e+00 2.0000000000e-01
9.8900000000e+01 9.0000000000e-01
1167817440: 5.0000000000e-01 0.0000000000e+00 4.0000000000e-01
9.8700000000e+01 4.0000000000e-01
1167817450: 0.0000000000e+00 0.0000000000e+00 0.0000000000e+00
9.9500000000e+01 5.0000000000e-01
1167817460: nan nan nan nan nan
1167817470: nan nan nan nan nan
1167817480: nan nan nan nan nan
1167817490: 1.0000000000e-01 0.0000000000e+00 1.0000000000e-01
9.7600000000e+01 2.3000000000e+00
1167817500: 3.0000000000e-01 0.0000000000e+00 3.0000000000e-01
9.8900000000e+01 5.0000000000e-01
1167817510: nan nan nan nan nan
1167817520: nan nan nan nan nan
1167817530: nan nan nan nan nan
1167817540: nan nan nan nan nan
1167817550: nan nan nan nan nan
1167817560: nan nan nan nan nan
1167817570: nan nan nan nan nan
1167817580: nan nan nan nan nan
1167817590: nan nan nan nan nan
1167817600: nan nan nan nan nan
1167817610: nan nan nan nan nan
1167817620: nan nan nan nan nan
1167817630: nan nan nan nan nan
1167817640: nan nan nan nan nan
1167817650: nan nan nan nan nan
1167817660: nan nan nan nan nan
1167817670: nan nan nan nan nan
1167817680: nan nan nan nan nan
1167817690: nan nan nan nan nan
1167817700: nan nan nan nan nan
1167817710: nan nan nan nan nan
1167817720: nan nan nan nan nan
1167817730: nan nan nan nan nan
===
so the next thing i checked is, if data will arrive via the network.
the tcpdump show that the specific network arrives on the nic
>tcpdump -x host apaXoses1
===
10:47:30.860342 IP apaXoses1.1686 > apaXmgm01.25826: UDP, length 52
0x0000: 4500 0050 0000 4000 3f11 c75b c29e 85aa E..P...trailing-username(a)leading-domain..?..[....
0x0010: c2e8 6910 0696 64e2 003c 4771 6370 7520 ..i...d..<Gqcpu.
0x0020: 3120 3131 3637 3831 3736 3530 3a32 3939 1.1167817650:299
0x0030: 3934 383a 3334 3a32 3336 3739 323a 3137 948:34:236792:17
0x0040: 3835 3534 3135 383a 3134 3238 3033 3800 8554158:1428038.
10:47:40.862894 IP apaXoses1.1686 > apaXmgm01.25826: UDP, length 52
0x0000: 4500 0050 0000 4000 3f11 c75b c29e 85aa E..P...trailing-username(a)leading-domain..?..[....
0x0010: c2e8 6910 0696 64e2 003c 506f 6370 7520 ..i...d..<Pocpu.
0x0020: 3120 3131 3637 3831 3736 3630 3a32 3939 1.1167817660:299
0x0030: 3934 393a 3334 3a32 3336 3739 333a 3137 949:34:236793:17
0x0040: 3835 3535 3135 333a 3134 3238 3034 3100 8555153:1428041.
10:47:50.865423 IP apaXoses1.1686 > apaXmgm01.25826: UDP, length 52
0x0000: 4500 0050 0000 4000 3f11 c75b c29e 85aa E..P...trailing-username(a)leading-domain..?..[....
0x0010: c2e8 6910 0696 64e2 003c 456f 6370 7520 ..i...d..<Eocpu.
0x0020: 3120 3131 3637 3831 3736 3730 3a32 3939 1.1167817670:299
0x0030: 3934 393a 3334 3a32 3336 3739 353a 3137 949:34:236795:17
0x0040: 3835 3536 3134 353a 3134 3238 3034 3700 8556145:1428047.
10:48:00.868163 IP apaXoses1.1686 > apaXmgm01.25826: UDP, length 52
0x0000: 4500 0050 0000 4000 3f11 c75b c29e 85aa E..P...trailing-username(a)leading-domain..?..[....
0x0010: c2e8 6910 0696 64e2 003c 436e 6370 7520 ..i...d..<Cncpu.
0x0020: 3120 3131 3637 3831 3736 3830 3a32 3939 1.1167817680:299
0x0030: 3935 333a 3334 3a32 3336 3739 383a 3137 953:34:236798:17
0x0040: 3835 3537 3132 383a 3134 3238 3035 3800 8557128:1428058.
10:48:10.870616 IP apaXoses1.1686 > apaXmgm01.25826: UDP, length 52
0x0000: 4500 0050 0000 4000 3f11 c75b c29e 85aa E..P...trailing-username(a)leading-domain..?..[....
0x0010: c2e8 6910 0696 64e2 003c 4d6c 6370 7520 ..i...d..<Mlcpu.
0x0020: 3120 3131 3637 3831 3736 3930 3a32 3939 1.1167817690:299
0x0030: 3935 333a 3334 3a32 3336 3739 383a 3137 953:34:236798:17
0x0040: 3835 3538 3132 313a 3134 3238 3036 3400 8558121:1428064.
10:48:20.873137 IP apaXoses1.1686 > apaXmgm01.25826: UDP, length 52
0x0000: 4500 0050 0000 4000 3f11 c75b c29e 85aa E..P...trailing-username(a)leading-domain..?..[....
0x0010: c2e8 6910 0696 64e2 003c 4b6b 6370 7520 ..i...d..<Kkcpu.
0x0020: 3120 3131 3637 3831 3737 3030 3a32 3939 1.1167817700:299
0x0030: 3935 333a 3334 3a32 3336 3739 393a 3137 953:34:236799:17
0x0040: 3835 3539 3130 393a 3134 3238 3037 3600 8559109:1428076.
10:48:30.876795 IP apaXoses1.1686 > apaXmgm01.25826: UDP, length 52
0x0000: 4500 0050 0000 4000 3f11 c75b c29e 85aa E..P...trailing-username(a)leading-domain..?..[....
0x0010: c2e8 6910 0696 64e2 003c 5473 6370 7520 ..i...d..<Tscpu.
0x0020: 3120 3131 3637 3831 3737 3130 3a32 3939 1.1167817710:299
0x0030: 3935 343a 3334 3a32 3336 3739 393a 3137 954:34:236799:17
0x0040: 3835 3630 3130 303a 3134 3238 3038 3300 8560100:1428083.
===
also the ls command shows that something must be wrong:
updates will be done on some rrd-files (timestamp 10:48)
but some files got no update (timestamps 10:45 or 10:46)
> ll
===
-rw-r--r-- 1 root root 1477784 Jan 3 10:45 cpu-0.rrd
-rw-r--r-- 1 root root 1477784 Jan 3 10:45 cpu-1.rrd
-rw-r--r-- 1 root root 591992 Jan 3 10:46 df-appl-IMdb.rrd
-rw-r--r-- 1 root root 591992 Jan 3 10:48 df-appl.rrd
-rw-r--r-- 1 root root 591992 Jan 3 10:48 df-boot.rrd
-rw-r--r-- 1 root root 591992 Jan 3 10:46 df-dev-shm.rrd
-rw-r--r-- 1 root root 591992 Jan 3 10:48 df-dev-vx.rrd
-rw-r--r-- 1 root root 591992 Jan 3 10:48 df-root.rrd
-rw-r--r-- 1 root root 2363576 Jan 3 10:48 disk-104-0.rrd
===
via the /var/log/messages i always got an error because the collectd-client
sends
duplicate stats for udf.dev-vx, these stops when collectd-server doesn't
update
some rrd-files (another hint that the is a problem with some data?)
===
Jan 3 10:30:10 apaXmgm01 collectd[11682]: rrd_update failed:
apaXoses1/df-dev-vx.rrd: illegal attempt to
Jan 3 10:30:20 apaXmgm01 collectd[11682]: rrd_update failed:
apaXoses1/df-dev-vx.rrd: illegal attempt to
Jan 3 10:30:30 apaXmgm01 collectd[11682]: rrd_update failed:
apaXoses1/df-dev-vx.rrd: illegal attempt to
Jan 3 10:30:40 apaXmgm01 collectd[11682]: rrd_update failed:
apaXoses1/df-dev-vx.rrd: illegal attempt to
Jan 3 10:30:50 apaXmgm01 collectd[11682]: rrd_update failed:
apaXoses1/df-dev-vx.rrd: illegal attempt to
Jan 3 10:31:00 apaXmgm01 collectd[11682]: rrd_update failed:
apaXoses1/df-dev-vx.rrd: illegal attempt to
Jan 3 10:38:20 apaXmgm01 collectd[11682]: rrd_update failed:
apaXoses1/df-dev-vx.rrd: illegal attempt to
Jan 3 10:39:00 apaXmgm01 collectd[11682]: rrd_update failed:
apaXoses1/df-dev-vx.rrd: illegal attempt to
Jan 3 10:39:20 apaXmgm01 collectd[11682]: rrd_update failed:
apaXoses1/df-dev-vx.rrd: illegal attempt to
>tcpdump -x host apaXoses1
===
10:48:30.876240 IP apaXoses1.1686 > apaXmgm01.25826: UDP, length 28
0x0000: 4500 0038 0000 4000 3f11 c773 c29e 85aa E..8...trailing-username(a)leading-domain..?..s....
0x0010: c2e8 6910 0696 64e2 0024 ad75 6466 2064 ..i...d..$.udf.d
0x0020: 6576 2d76 7820 3131 3637 3831 3737 3130 ev-vx.1167817710
0x0030: 3a30 3a34 3039 3600 :0:4096.
10:48:30.876294 IP apaXoses1.1686 > apaXmgm01.25826: UDP, length 28
0x0000: 4500 0038 0000 4000 3f11 c773 c29e 85aa E..8...trailing-username(a)leading-domain..?..s....
0x0010: c2e8 6910 0696 64e2 0024 ad75 6466 2064 ..i...d..$.udf.d
0x0020: 6576 2d76 7820 3131 3637 3831 3737 3130 ev-vx.1167817710
0x0030: 3a30 3a34 3039 3600 :0:4096
===
we use collectd-3.10.4 and rrdtool-1.2.15 on a new server,
where 50 collectd-client-streams arrive.
(i had the same problem on another machine, but after
2-3 restarts of the collectd-server the problem has gone)
has anyone an idea what to check next, or how to get more debug information
?
or knows what really to do ;)
thanks
eric
More information about the collectd
mailing list