[collectd] loss of data on collectd-server

eric fauser ef_cd<span style="display: none;">.trailing-username</span>(a)<span style="display: none;">leading-domain.</span>apa.at
Wed Jan 3 11:54:12 CET 2007


hi all

i have a little problem regarding client to server rrd-updates.

it seems that collectd-server does not collect all the data that arrives via 
unicast udp
(or maybe the rrdtool-part could not update the corresponding rrd-file(s)

a look into the cpu-1.rrd files shows that there is no data

>rrdtool fetch cpu-1.rrd AVERAGE --start now-1000
===
1167817420: 1.0000000000e-01 0.0000000000e+00 1.0000000000e-01 
9.9300000000e+01 5.0000000000e-01
1167817430: 1.0000000000e-01 0.0000000000e+00 2.0000000000e-01 
9.8900000000e+01 9.0000000000e-01
1167817440: 5.0000000000e-01 0.0000000000e+00 4.0000000000e-01 
9.8700000000e+01 4.0000000000e-01
1167817450: 0.0000000000e+00 0.0000000000e+00 0.0000000000e+00 
9.9500000000e+01 5.0000000000e-01
1167817460: nan nan nan nan nan
1167817470: nan nan nan nan nan
1167817480: nan nan nan nan nan
1167817490: 1.0000000000e-01 0.0000000000e+00 1.0000000000e-01 
9.7600000000e+01 2.3000000000e+00
1167817500: 3.0000000000e-01 0.0000000000e+00 3.0000000000e-01 
9.8900000000e+01 5.0000000000e-01
1167817510: nan nan nan nan nan
1167817520: nan nan nan nan nan
1167817530: nan nan nan nan nan
1167817540: nan nan nan nan nan
1167817550: nan nan nan nan nan
1167817560: nan nan nan nan nan
1167817570: nan nan nan nan nan
1167817580: nan nan nan nan nan
1167817590: nan nan nan nan nan
1167817600: nan nan nan nan nan
1167817610: nan nan nan nan nan
1167817620: nan nan nan nan nan
1167817630: nan nan nan nan nan
1167817640: nan nan nan nan nan
1167817650: nan nan nan nan nan
1167817660: nan nan nan nan nan
1167817670: nan nan nan nan nan
1167817680: nan nan nan nan nan
1167817690: nan nan nan nan nan
1167817700: nan nan nan nan nan
1167817710: nan nan nan nan nan
1167817720: nan nan nan nan nan
1167817730: nan nan nan nan nan
===

so the next thing i checked is, if data will arrive via the network.
the tcpdump show that the specific network arrives on the nic

>tcpdump -x host apaXoses1
===
10:47:30.860342 IP apaXoses1.1686 > apaXmgm01.25826: UDP, length 52
        0x0000:  4500 0050 0000 4000 3f11 c75b c29e 85aa  E..P...trailing-username(a)leading-domain..?..[....
        0x0010:  c2e8 6910 0696 64e2 003c 4771 6370 7520  ..i...d..<Gqcpu.
        0x0020:  3120 3131 3637 3831 3736 3530 3a32 3939  1.1167817650:299
        0x0030:  3934 383a 3334 3a32 3336 3739 323a 3137  948:34:236792:17
        0x0040:  3835 3534 3135 383a 3134 3238 3033 3800  8554158:1428038.
10:47:40.862894 IP apaXoses1.1686 > apaXmgm01.25826: UDP, length 52
        0x0000:  4500 0050 0000 4000 3f11 c75b c29e 85aa  E..P...trailing-username(a)leading-domain..?..[....
        0x0010:  c2e8 6910 0696 64e2 003c 506f 6370 7520  ..i...d..<Pocpu.
        0x0020:  3120 3131 3637 3831 3736 3630 3a32 3939  1.1167817660:299
        0x0030:  3934 393a 3334 3a32 3336 3739 333a 3137  949:34:236793:17
        0x0040:  3835 3535 3135 333a 3134 3238 3034 3100  8555153:1428041.
10:47:50.865423 IP apaXoses1.1686 > apaXmgm01.25826: UDP, length 52
        0x0000:  4500 0050 0000 4000 3f11 c75b c29e 85aa  E..P...trailing-username(a)leading-domain..?..[....
        0x0010:  c2e8 6910 0696 64e2 003c 456f 6370 7520  ..i...d..<Eocpu.
        0x0020:  3120 3131 3637 3831 3736 3730 3a32 3939  1.1167817670:299
        0x0030:  3934 393a 3334 3a32 3336 3739 353a 3137  949:34:236795:17
        0x0040:  3835 3536 3134 353a 3134 3238 3034 3700  8556145:1428047.
10:48:00.868163 IP apaXoses1.1686 > apaXmgm01.25826: UDP, length 52
        0x0000:  4500 0050 0000 4000 3f11 c75b c29e 85aa  E..P...trailing-username(a)leading-domain..?..[....
        0x0010:  c2e8 6910 0696 64e2 003c 436e 6370 7520  ..i...d..<Cncpu.
        0x0020:  3120 3131 3637 3831 3736 3830 3a32 3939  1.1167817680:299
        0x0030:  3935 333a 3334 3a32 3336 3739 383a 3137  953:34:236798:17
        0x0040:  3835 3537 3132 383a 3134 3238 3035 3800  8557128:1428058.
10:48:10.870616 IP apaXoses1.1686 > apaXmgm01.25826: UDP, length 52
        0x0000:  4500 0050 0000 4000 3f11 c75b c29e 85aa  E..P...trailing-username(a)leading-domain..?..[....
        0x0010:  c2e8 6910 0696 64e2 003c 4d6c 6370 7520  ..i...d..<Mlcpu.
        0x0020:  3120 3131 3637 3831 3736 3930 3a32 3939  1.1167817690:299
        0x0030:  3935 333a 3334 3a32 3336 3739 383a 3137  953:34:236798:17
        0x0040:  3835 3538 3132 313a 3134 3238 3036 3400  8558121:1428064.
10:48:20.873137 IP apaXoses1.1686 > apaXmgm01.25826: UDP, length 52
        0x0000:  4500 0050 0000 4000 3f11 c75b c29e 85aa  E..P...trailing-username(a)leading-domain..?..[....
        0x0010:  c2e8 6910 0696 64e2 003c 4b6b 6370 7520  ..i...d..<Kkcpu.
        0x0020:  3120 3131 3637 3831 3737 3030 3a32 3939  1.1167817700:299
        0x0030:  3935 333a 3334 3a32 3336 3739 393a 3137  953:34:236799:17
        0x0040:  3835 3539 3130 393a 3134 3238 3037 3600  8559109:1428076.
10:48:30.876795 IP apaXoses1.1686 > apaXmgm01.25826: UDP, length 52
        0x0000:  4500 0050 0000 4000 3f11 c75b c29e 85aa  E..P...trailing-username(a)leading-domain..?..[....
        0x0010:  c2e8 6910 0696 64e2 003c 5473 6370 7520  ..i...d..<Tscpu.
        0x0020:  3120 3131 3637 3831 3737 3130 3a32 3939  1.1167817710:299
        0x0030:  3935 343a 3334 3a32 3336 3739 393a 3137  954:34:236799:17
        0x0040:  3835 3630 3130 303a 3134 3238 3038 3300  8560100:1428083.
===


also the ls command shows that something must be wrong:
updates will be done on some rrd-files (timestamp 10:48)
but some files got no update (timestamps 10:45 or 10:46)

> ll
===
-rw-r--r--  1 root root 1477784 Jan  3 10:45 cpu-0.rrd
-rw-r--r--  1 root root 1477784 Jan  3 10:45 cpu-1.rrd
-rw-r--r--  1 root root  591992 Jan  3 10:46 df-appl-IMdb.rrd
-rw-r--r--  1 root root  591992 Jan  3 10:48 df-appl.rrd
-rw-r--r--  1 root root  591992 Jan  3 10:48 df-boot.rrd
-rw-r--r--  1 root root  591992 Jan  3 10:46 df-dev-shm.rrd
-rw-r--r--  1 root root  591992 Jan  3 10:48 df-dev-vx.rrd
-rw-r--r--  1 root root  591992 Jan  3 10:48 df-root.rrd
-rw-r--r--  1 root root 2363576 Jan  3 10:48 disk-104-0.rrd
===


via the /var/log/messages i always got an error because the collectd-client 
sends
duplicate stats for udf.dev-vx, these stops when collectd-server doesn't 
update
some rrd-files (another hint that the is a problem with some data?)
===
Jan  3 10:30:10 apaXmgm01 collectd[11682]: rrd_update failed: 
apaXoses1/df-dev-vx.rrd: illegal attempt to
Jan  3 10:30:20 apaXmgm01 collectd[11682]: rrd_update failed: 
apaXoses1/df-dev-vx.rrd: illegal attempt to
Jan  3 10:30:30 apaXmgm01 collectd[11682]: rrd_update failed: 
apaXoses1/df-dev-vx.rrd: illegal attempt to
Jan  3 10:30:40 apaXmgm01 collectd[11682]: rrd_update failed: 
apaXoses1/df-dev-vx.rrd: illegal attempt to
Jan  3 10:30:50 apaXmgm01 collectd[11682]: rrd_update failed: 
apaXoses1/df-dev-vx.rrd: illegal attempt to
Jan  3 10:31:00 apaXmgm01 collectd[11682]: rrd_update failed: 
apaXoses1/df-dev-vx.rrd: illegal attempt to
Jan  3 10:38:20 apaXmgm01 collectd[11682]: rrd_update failed: 
apaXoses1/df-dev-vx.rrd: illegal attempt to
Jan  3 10:39:00 apaXmgm01 collectd[11682]: rrd_update failed: 
apaXoses1/df-dev-vx.rrd: illegal attempt to
Jan  3 10:39:20 apaXmgm01 collectd[11682]: rrd_update failed: 
apaXoses1/df-dev-vx.rrd: illegal attempt to

>tcpdump -x host apaXoses1
===
10:48:30.876240 IP apaXoses1.1686 > apaXmgm01.25826: UDP, length 28
        0x0000:  4500 0038 0000 4000 3f11 c773 c29e 85aa  E..8...trailing-username(a)leading-domain..?..s....
        0x0010:  c2e8 6910 0696 64e2 0024 ad75 6466 2064  ..i...d..$.udf.d
        0x0020:  6576 2d76 7820 3131 3637 3831 3737 3130  ev-vx.1167817710
        0x0030:  3a30 3a34 3039 3600                      :0:4096.
10:48:30.876294 IP apaXoses1.1686 > apaXmgm01.25826: UDP, length 28
        0x0000:  4500 0038 0000 4000 3f11 c773 c29e 85aa  E..8...trailing-username(a)leading-domain..?..s....
        0x0010:  c2e8 6910 0696 64e2 0024 ad75 6466 2064  ..i...d..$.udf.d
        0x0020:  6576 2d76 7820 3131 3637 3831 3737 3130  ev-vx.1167817710
        0x0030:  3a30 3a34 3039 3600                      :0:4096
===


we use collectd-3.10.4 and rrdtool-1.2.15 on a new server,
where 50 collectd-client-streams arrive.
(i had the same problem on another machine, but after
2-3 restarts of the collectd-server the problem has gone)

has anyone an idea what to check next, or how to get more debug information 
?
or knows what really to do ;)

thanks
eric 




More information about the collectd mailing list