[collectd] Issues with network buffers with Solaris

Eric LeBlanc eleblanc at taleo.com
Thu Jun 5 20:41:22 CEST 2008


Hi again,

I tried to narrow down the bug and I found out that it seems 
collectd  "re-initialize" all plugins without verifiying if some data are 
still in memory ?

To summarize that, here's what I did:

First, I added this:

DEBUG("RET_BUFFER: %d\n", *ret_buffer_len);

into function write_part_values() in network.c, at line 382, just after this 
following line:

  *ret_buffer_len -= packet_len;

Then, I compiled and fired up collectd.  Here's the log (cleaned for better 
reading):

#############
# FIRST PART
#############
[2008-06-05 14:23:06] plugin: plugin_read_all: Signalling `read_cond'
[2008-06-05 14:23:06] [thread #    5] plugin: plugin_read_thread: Handling 
exec
[2008-06-05 14:23:06] [thread #    5] plugin: plugin_read_thread: Waiting on 
read_cond.
[2008-06-05 14:23:06] [thread #    3] plugin: plugin_read_thread: Waiting on 
read_cond.
[2008-06-05 14:23:06] [thread #    4] plugin: plugin_read_thread: Waiting on 
read_cond.
[2008-06-05 14:23:06] [thread #    2] plugin: plugin_read_thread: Waiting on 
read_cond.
[2008-06-05 14:23:06] [thread #    6] plugin: plugin_read_thread: Waiting on 
read_cond.
#############


Nothing wrong, we continue...



#############
# SECOND PART
#############
[2008-06-05 14:23:06] plugin_dispatch_values: time = 1212690186; interval = 
60; host = hostXXXXX; plugin = disk_stats_md10; plugin_instance = ; type = 
sol_disk_iops; type_instance = ;
[2008-06-05 14:23:06] uc_update: hostXXXXX/disk_stats_md10/sol_disk_iops: 
ds[0] = 0.000000
[2008-06-05 14:23:06] uc_update: hostXXXXX/disk_stats_md10/sol_disk_iops: 
ds[1] = 0.000000
[2008-06-05 14:23:06] uc_update: hostXXXXX/disk_stats_md10/sol_disk_iops: 
ds[2] = 0.000000
[2008-06-05 14:23:06] network plugin: cache_check: key = 
hostXXXXX/disk_stats_md10/sol_disk_iops; time = 1212690186; retval = 0

[2008-06-05 14:23:06] RET_BUFFER: 916

[2008-06-05 14:23:06] plugin_dispatch_values: time = 1212690186; interval = 
60; host = hostXXXXX; plugin = disk_stats_md10; plugin_instance = ; type = 
sol_disk_throughput; type_instance = ;
[2008-06-05 14:23:06] uc_update: 
hostXXXXX/disk_stats_md10/sol_disk_throughput: ds[0] = 0.000000
[2008-06-05 14:23:06] uc_update: 
hostXXXXX/disk_stats_md10/sol_disk_throughput: ds[1] = 0.000000
[2008-06-05 14:23:06] uc_update: 
hostXXXXX/disk_stats_md10/sol_disk_throughput: ds[2] = 0.000000
[2008-06-05 14:23:06] network plugin: cache_check: key = 
hostXXXXX/disk_stats_md10/sol_disk_throughput; time = 1212690186; retval = 0

[2008-06-05 14:23:06] RET_BUFFER: 859

[2008-06-05 14:23:06] plugin_dispatch_values: time = 1212690186; interval = 
60; host = hostXXXXX; plugin = disk_stats_md10; plugin_instance = ; type = 
sol_disk_avg_svctime; type_instance = ;
[2008-06-05 14:23:06] uc_update: 
hostXXXXX/disk_stats_md10/sol_disk_avg_svctime: ds[0] = 0.000000
[2008-06-05 14:23:06] uc_update: 
hostXXXXX/disk_stats_md10/sol_disk_avg_svctime: ds[1] = 0.000000
[2008-06-05 14:23:06] network plugin: cache_check: key = 
hostXXXXX/disk_stats_md10/sol_disk_avg_svctime; time = 1212690186; retval = 0

[2008-06-05 14:23:06] RET_BUFFER: 810
#############


As you can see, the RET_BUFFER continue to drop as long as data are 
accumulated.



#############
# THIRD PART
#############
[2008-06-05 14:23:08] plugin_dispatch_values: time = 1212690188; interval = 
60; host = hostXXXXXX; plugin = mem_stats; plugin_instance = ; type = 
sol_free_list; type_instance = ;
[2008-06-05 14:23:08] uc_update: hostXXXXXX/mem_stats/sol_free_list: ds[0] = 
10553800.000000
[2008-06-05 14:23:08] network plugin: cache_check: key = 
hostXXXXXX/mem_stats/sol_free_list; time = 1212690188; retval = 0

[2008-06-05 14:23:08] RET_BUFFER: 2

[2008-06-05 14:23:08] network plugin: flush_buffer: send_buffer_fill = 1022
[2008-06-05 14:23:08] network plugin: network_send_buffer: buffer_len = 1022
#############


That's fine, since we almost reached the max bufsize of the array, data are 
sent to the collectd server which does all the work to put it in RRD data.

Then, just after:



#############
# FOURTH PART
#############
[2008-06-05 14:23:08] plugin_dispatch_values: time = 1212690128; interval = 
60; host = hostXXXXXX; plugin = mem_stats; plugin_instance = ; type = 
sol_swapin
g; type_instance = ;
[2008-06-05 14:23:08] uc_update: hostXXXXXX/mem_stats/sol_swaping: ds[0] = 
0.000000
[2008-06-05 14:23:08] network plugin: cache_check: key = 
hostXXXXXX/mem_stats/sol_swaping; time = 1212690128; retval = 0

[2008-06-05 14:23:08] RET_BUFFER: 942

[2008-06-05 14:23:08] plugin_dispatch_values: time = 1212690128; interval = 
60; host = hostXXXXXX; plugin = mem_stats; plugin_instance = ; type = 
sol_scanra
te; type_instance = ;
[2008-06-05 14:23:08] uc_update: hostXXXXXX/mem_stats/sol_scanrate: ds[0] = 
0.000000
[2008-06-05 14:23:08] network plugin: cache_check: key = 
hostXXXXXX/mem_stats/sol_scanrate; time = 1212690128; retval = 0

[2008-06-05 14:23:08] RET_BUFFER: 910

[2008-06-05 14:23:08] plugin_dispatch_values: time = 1212690128; interval = 
60; host = hostXXXXXX; plugin = network_stats; plugin_instance = ; type = 
sol_tc
p_stats; type_instance = ;
[2008-06-05 14:23:08] uc_update: hostXXXXXX/network_stats/sol_tcp_stats: ds[0] 
= 149.000000
[2008-06-05 14:23:08] uc_update: hostXXXXXX/network_stats/sol_tcp_stats: ds[1] 
= 1.000000
[2008-06-05 14:23:08] uc_update: hostXXXXXX/network_stats/sol_tcp_stats: ds[2] 
= 1210.000000
[2008-06-05 14:23:08] uc_update: hostXXXXXX/network_stats/sol_tcp_stats: ds[3] 
= 10886.000000
[2008-06-05 14:23:08] uc_update: hostXXXXXX/network_stats/sol_tcp_stats: ds[4] 
= 10819.000000
[2008-06-05 14:23:08] network plugin: cache_check: key = 
hostXXXXXX/network_stats/sol_tcp_stats; time = 1212690128; retval = 0

[2008-06-05 14:23:08] RET_BUFFER: 823

[2008-06-05 14:23:08] plugin_dispatch_values: time = 1212690128; interval = 
60; host = hostXXXXXX; plugin = network_stats; plugin_instance = ; type = 
sol_ud
p_stats; type_instance = ;
[2008-06-05 14:23:08] uc_update: hostXXXXXX/network_stats/sol_udp_stats: ds[0] 
= 3.000000
[2008-06-05 14:23:08] uc_update: hostXXXXXX/network_stats/sol_udp_stats: ds[1] 
= 5.000000
[2008-06-05 14:23:08] uc_update: hostXXXXXX/network_stats/sol_udp_stats: ds[2] 
= 0.000000
[2008-06-05 14:23:08] network plugin: cache_check: key = 
hostXXXXXX/network_stats/sol_udp_stats; time = 1212690128; retval = 0

[2008-06-05 14:22:08] RET_BUFFER: 772

#############


Very good, the buffer is flushed and re-filled with new data according to 
RET_BUFFER until ...


#############
# FIFTH PART
#############
[2008-06-05 14:23:16] plugin: plugin_read_all: Signalling `read_cond'
[2008-06-05 14:23:16] [thread #    6] plugin: plugin_read_thread: Handling 
exec
[2008-06-05 14:23:16] [thread #    6] plugin: plugin_read_thread: Waiting on 
read_cond.
[2008-06-05 14:23:16] [thread #    2] plugin: plugin_read_thread: Waiting on 
read_cond.
[2008-06-05 14:23:16] [thread #    3] plugin: plugin_read_thread: Waiting on 
read_cond.
[2008-06-05 14:23:16] [thread #    5] plugin: plugin_read_thread: Waiting on 
read_cond.
[2008-06-05 14:23:16] [thread #    4] plugin: plugin_read_thread: Waiting on 
read_cond.

[2008-06-05 14:24:01] kstat chain has been updated
[2008-06-05 14:24:06] kstat chain has been updated

[2008-06-05 14:24:06] plugin_dispatch_values: time = 1212690186; interval = 
60; host = hostXXXXXX; plugin = disk_stats_md10; plugin_instance = ; type = 
sol_
disk_iops; type_instance = ;
[2008-06-05 14:24:06] uc_update: hostXXXXXX/disk_stats_md10/sol_disk_iops: 
ds[0] = 0.000000
[2008-06-05 14:24:06] uc_update: hostXXXXXX/disk_stats_md10/sol_disk_iops: 
ds[1] = 0.000000
[2008-06-05 14:24:06] uc_update: hostXXXXXX/disk_stats_md10/sol_disk_iops: 
ds[2] = 0.000000
[2008-06-05 14:24:06] network plugin: cache_check: key = 
hostXXXXXX/disk_stats_md10/sol_disk_iops; time = 1212690186; retval = 0

[2008-06-05 14:23:06] RET_BUFFER: 916
#############

Take a look at "RET_BUFFER", Instead of sending data, they seems to be 
discarded because the 'master plugin' did a 'reset' to the network plugin 
according to "kstat chain has been updated" .

Indeed, I didn't see something like this:

[2008-06-05 14:23:08] network plugin: network_send_buffer: buffer_len = 1022

That would tell me that the data are sent before resetting.

Am I wrong?

Thank you very much!

Eric




On Wednesday 04 June 2008 15:53, Eric LeBlanc wrote:
> Hi,
>
> I installed collectd on a linux server (which create rrd) and on a solaris.
> Both run on 4.4.1.
>
> It seems that there's a limit with buffers in network plugin on Solaris
> client.  Or at least, if we have more than 1024 chars, the remaining is
> discarded (it's my pure guess).
>
> Before all, here is the type.db config:
>
> sol_disk_iops                   read_iops:GAUGE:0:4294967295,
> write_iops:GAUGE:0:4294967295, total_iops:GAUGE:0:4294967295
> sol_disk_throughput             read_throughput:GAUGE:0:4294967295,
> write_throughput:GAUGE:0:4294967295, total_throughput:GAUGE:0:4294967295
> sol_disk_avg_svctime            total_iowait:GAUGE:0:4294967295,
> total_ioactive:GAUGE:0:4294967295
> sol_disk_avg_iops               wait_iops_avg:GAUGE:0:4294967295,
> active_iops_avg:GAUGE:0:4294967295, total_iops_avg:GAUGE:0:4294967295
> sol_tcp_stats                   tcpActiveOpen:GAUGE:0:4294967295,
> tcpPassiveOpen:GAUGE:0:4294967295, tcpACurrEstab:GAUGE:0:4294967295,
> tcpInSeg_rate:GAUGE:0:4294967295, tcpOutSeg_rate:GAUGE:0:4294967295
> sol_udp_stats                   InDatagrams_rate:GAUGE:0:4294967295,
> OutDatagrams_rate:GAUGE:0:4294967295, InErrors:GAUGE:0:4294967295
> sol_iface_stats                 rx_rate:GAUGE:0:4294967295,
> rxerrs:GAUGE:0:4294967295, tx_rate:GAUGE:0:4294967295,
> txerrs:GAUGE:0:4294967295
> sol_cpu_process                 running:GAUGE:0:4294967295,
> blocked:GAUGE:0:4294967295, waiting:GAUGE:0:4294967295
> sol_cpu_usage                   user:GAUGE:0:100, system:GAUGE:0:100,
> idle:GAUGE:0:100
> sol_avail_swap                  free_swap:GAUGE:0:4294967295
> sol_free_list                   free_list:GAUGE:0:4294967295
> sol_swaping                     swapout:GAUGE:0:4294967295
> sol_scanrate                    scanrate:GAUGE:0:4294967295
>
>
> Here is the example of print to the collectd daemon:
>
>
> hostXXXXX/cpu_usage/sol_cpu_usage interval=2 N:93:7:0
> hostXXXXX/mem_stats/sol_avail_swap interval=2 N:26754864
> hostXXXXX/mem_stats/sol_free_list interval=2 N:10950552
> hostXXXXX/mem_stats/sol_swaping interval=2 N:0
> hostXXXXX/mem_stats/sol_scanrate interval=2 N:0
> hostXXXXX/disk_stats_md10/sol_disk_iops interval=2 N:0.0:0.0:0
> hostXXXXX/disk_stats_md10/sol_disk_throughput interval=2 N:0.0:0.0:0
> hostXXXXX/disk_stats_md10/sol_disk_avg_svctime interval=2 N:0.0:0.0
> hostXXXXX/disk_stats_md10/sol_disk_avg_iops interval=2 N:0.0:0.0:0
> hostXXXXX/disk_stats_md30/sol_disk_iops interval=2 N:0.0:1.5:1.5
> hostXXXXX/disk_stats_md30/sol_disk_throughput interval=2 N:0.0:16.0:16
> hostXXXXX/disk_stats_md30/sol_disk_avg_svctime interval=2 N:0.0:9.1
> hostXXXXX/disk_stats_md30/sol_disk_avg_iops interval=2 N:0.0:0.0:0
> hostXXXXX/disk_stats_md40/sol_disk_iops interval=2 N:0.0:42.9:42.9
> hostXXXXX/disk_stats_md40/sol_disk_throughput interval=2 N:0.0:343.2:343.2
> hostXXXXX/disk_stats_md40/sol_disk_avg_svctime interval=2 N:2.6:7.2
> hostXXXXX/disk_stats_md40/sol_disk_avg_iops interval=2 N:0.1:0.3:0.4
> hostXXXXX/iface_stats/sol_iface_stats interval=2 N:317:0:230:0
> hostXXXXX/network_stats/sol_tcp_stats interval=2 N:6:0:1165:61:60
> hostXXXXX/network_stats/sol_udp_stats interval=2 N:0:0:0
>
> Now, below, take a look at the modification time (6th column), on Linux
> server, which is receiving all statistical data from the Solaris box:
>
> ./cpu_usage:
> -rw-r--r--  1 root root 745344 Jun  4 15:36 sol_cpu_process.rrd
> -rw-r--r--  1 root root 745344 Jun  4 15:36 sol_cpu_usage.rrd
> ./disk_stats_md10:
> -rw-r--r--  1 root root 745344 Jun  4 15:35 sol_disk_avg_iops.rrd
> -rw-r--r--  1 root root 497384 Jun  4 15:35 sol_disk_avg_svctime.rrd
> -rw-r--r--  1 root root 745344 Jun  4 15:35 sol_disk_iops.rrd
> -rw-r--r--  1 root root 745344 Jun  4 15:35 sol_disk_throughput.rrd
> ./disk_stats_md30:
> -rw-r--r--  1 root root 745344 Jun  4 15:35 sol_disk_avg_iops.rrd
> -rw-r--r--  1 root root 497384 Jun  4 15:35 sol_disk_avg_svctime.rrd
> -rw-r--r--  1 root root 745344 Jun  4 15:35 sol_disk_iops.rrd
> -rw-r--r--  1 root root 745344 Jun  4 15:35 sol_disk_throughput.rrd
> ./disk_stats_md40:
> -rw-r--r--  1 root root 745344 Jun  4 15:21 sol_disk_avg_iops.rrd
> -rw-r--r--  1 root root 497384 Jun  4 15:21 sol_disk_avg_svctime.rrd
> -rw-r--r--  1 root root 745344 Jun  4 15:21 sol_disk_iops.rrd
> -rw-r--r--  1 root root 745344 Jun  4 15:21 sol_disk_throughput.rrd
> ./iface_stats:
> -rw-r--r--  1 root root 993304 Jun  4 15:36 sol_iface_stats.rrd
> ./mem_stats:
> -rw-r--r--  1 root root 249424 Jun  4 15:36 sol_avail_swap.rrd
> -rw-r--r--  1 root root 249424 Jun  4 15:36 sol_free_list.rrd
> -rw-r--r--  1 root root 249424 Jun  4 15:35 sol_scanrate.rrd
> -rw-r--r--  1 root root 249424 Jun  4 15:35 sol_swaping.rrd
> ./network_stats:
> -rw-r--r--  1 root root 1241264 Jun  4 15:35 sol_tcp_stats.rrd
> -rw-r--r--  1 root root  745344 Jun  4 15:35 sol_udp_stats.rrd
>
> Sometimes, it's the rrd files from network_stats that aren't updated,
> sometime it's disk_stats_md40 (which is currently the case).
>
> If you want the debug log of the collectd daemon or if you need any others
> details, just ask me.
>
> Thank you very much for your help!
>
> E.

-- 
Eric LeBlanc <eleblanc at taleo.com>
Unix System Administrator
Taleo inc.



More information about the collectd mailing list