[collectd] Issues with network buffers with Solaris
Eric LeBlanc
eleblanc at taleo.com
Thu Jun 5 20:41:22 CEST 2008
Hi again,
I tried to narrow down the bug and I found out that it seems
collectd "re-initialize" all plugins without verifiying if some data are
still in memory ?
To summarize that, here's what I did:
First, I added this:
DEBUG("RET_BUFFER: %d\n", *ret_buffer_len);
into function write_part_values() in network.c, at line 382, just after this
following line:
*ret_buffer_len -= packet_len;
Then, I compiled and fired up collectd. Here's the log (cleaned for better
reading):
#############
# FIRST PART
#############
[2008-06-05 14:23:06] plugin: plugin_read_all: Signalling `read_cond'
[2008-06-05 14:23:06] [thread # 5] plugin: plugin_read_thread: Handling
exec
[2008-06-05 14:23:06] [thread # 5] plugin: plugin_read_thread: Waiting on
read_cond.
[2008-06-05 14:23:06] [thread # 3] plugin: plugin_read_thread: Waiting on
read_cond.
[2008-06-05 14:23:06] [thread # 4] plugin: plugin_read_thread: Waiting on
read_cond.
[2008-06-05 14:23:06] [thread # 2] plugin: plugin_read_thread: Waiting on
read_cond.
[2008-06-05 14:23:06] [thread # 6] plugin: plugin_read_thread: Waiting on
read_cond.
#############
Nothing wrong, we continue...
#############
# SECOND PART
#############
[2008-06-05 14:23:06] plugin_dispatch_values: time = 1212690186; interval =
60; host = hostXXXXX; plugin = disk_stats_md10; plugin_instance = ; type =
sol_disk_iops; type_instance = ;
[2008-06-05 14:23:06] uc_update: hostXXXXX/disk_stats_md10/sol_disk_iops:
ds[0] = 0.000000
[2008-06-05 14:23:06] uc_update: hostXXXXX/disk_stats_md10/sol_disk_iops:
ds[1] = 0.000000
[2008-06-05 14:23:06] uc_update: hostXXXXX/disk_stats_md10/sol_disk_iops:
ds[2] = 0.000000
[2008-06-05 14:23:06] network plugin: cache_check: key =
hostXXXXX/disk_stats_md10/sol_disk_iops; time = 1212690186; retval = 0
[2008-06-05 14:23:06] RET_BUFFER: 916
[2008-06-05 14:23:06] plugin_dispatch_values: time = 1212690186; interval =
60; host = hostXXXXX; plugin = disk_stats_md10; plugin_instance = ; type =
sol_disk_throughput; type_instance = ;
[2008-06-05 14:23:06] uc_update:
hostXXXXX/disk_stats_md10/sol_disk_throughput: ds[0] = 0.000000
[2008-06-05 14:23:06] uc_update:
hostXXXXX/disk_stats_md10/sol_disk_throughput: ds[1] = 0.000000
[2008-06-05 14:23:06] uc_update:
hostXXXXX/disk_stats_md10/sol_disk_throughput: ds[2] = 0.000000
[2008-06-05 14:23:06] network plugin: cache_check: key =
hostXXXXX/disk_stats_md10/sol_disk_throughput; time = 1212690186; retval = 0
[2008-06-05 14:23:06] RET_BUFFER: 859
[2008-06-05 14:23:06] plugin_dispatch_values: time = 1212690186; interval =
60; host = hostXXXXX; plugin = disk_stats_md10; plugin_instance = ; type =
sol_disk_avg_svctime; type_instance = ;
[2008-06-05 14:23:06] uc_update:
hostXXXXX/disk_stats_md10/sol_disk_avg_svctime: ds[0] = 0.000000
[2008-06-05 14:23:06] uc_update:
hostXXXXX/disk_stats_md10/sol_disk_avg_svctime: ds[1] = 0.000000
[2008-06-05 14:23:06] network plugin: cache_check: key =
hostXXXXX/disk_stats_md10/sol_disk_avg_svctime; time = 1212690186; retval = 0
[2008-06-05 14:23:06] RET_BUFFER: 810
#############
As you can see, the RET_BUFFER continue to drop as long as data are
accumulated.
#############
# THIRD PART
#############
[2008-06-05 14:23:08] plugin_dispatch_values: time = 1212690188; interval =
60; host = hostXXXXXX; plugin = mem_stats; plugin_instance = ; type =
sol_free_list; type_instance = ;
[2008-06-05 14:23:08] uc_update: hostXXXXXX/mem_stats/sol_free_list: ds[0] =
10553800.000000
[2008-06-05 14:23:08] network plugin: cache_check: key =
hostXXXXXX/mem_stats/sol_free_list; time = 1212690188; retval = 0
[2008-06-05 14:23:08] RET_BUFFER: 2
[2008-06-05 14:23:08] network plugin: flush_buffer: send_buffer_fill = 1022
[2008-06-05 14:23:08] network plugin: network_send_buffer: buffer_len = 1022
#############
That's fine, since we almost reached the max bufsize of the array, data are
sent to the collectd server which does all the work to put it in RRD data.
Then, just after:
#############
# FOURTH PART
#############
[2008-06-05 14:23:08] plugin_dispatch_values: time = 1212690128; interval =
60; host = hostXXXXXX; plugin = mem_stats; plugin_instance = ; type =
sol_swapin
g; type_instance = ;
[2008-06-05 14:23:08] uc_update: hostXXXXXX/mem_stats/sol_swaping: ds[0] =
0.000000
[2008-06-05 14:23:08] network plugin: cache_check: key =
hostXXXXXX/mem_stats/sol_swaping; time = 1212690128; retval = 0
[2008-06-05 14:23:08] RET_BUFFER: 942
[2008-06-05 14:23:08] plugin_dispatch_values: time = 1212690128; interval =
60; host = hostXXXXXX; plugin = mem_stats; plugin_instance = ; type =
sol_scanra
te; type_instance = ;
[2008-06-05 14:23:08] uc_update: hostXXXXXX/mem_stats/sol_scanrate: ds[0] =
0.000000
[2008-06-05 14:23:08] network plugin: cache_check: key =
hostXXXXXX/mem_stats/sol_scanrate; time = 1212690128; retval = 0
[2008-06-05 14:23:08] RET_BUFFER: 910
[2008-06-05 14:23:08] plugin_dispatch_values: time = 1212690128; interval =
60; host = hostXXXXXX; plugin = network_stats; plugin_instance = ; type =
sol_tc
p_stats; type_instance = ;
[2008-06-05 14:23:08] uc_update: hostXXXXXX/network_stats/sol_tcp_stats: ds[0]
= 149.000000
[2008-06-05 14:23:08] uc_update: hostXXXXXX/network_stats/sol_tcp_stats: ds[1]
= 1.000000
[2008-06-05 14:23:08] uc_update: hostXXXXXX/network_stats/sol_tcp_stats: ds[2]
= 1210.000000
[2008-06-05 14:23:08] uc_update: hostXXXXXX/network_stats/sol_tcp_stats: ds[3]
= 10886.000000
[2008-06-05 14:23:08] uc_update: hostXXXXXX/network_stats/sol_tcp_stats: ds[4]
= 10819.000000
[2008-06-05 14:23:08] network plugin: cache_check: key =
hostXXXXXX/network_stats/sol_tcp_stats; time = 1212690128; retval = 0
[2008-06-05 14:23:08] RET_BUFFER: 823
[2008-06-05 14:23:08] plugin_dispatch_values: time = 1212690128; interval =
60; host = hostXXXXXX; plugin = network_stats; plugin_instance = ; type =
sol_ud
p_stats; type_instance = ;
[2008-06-05 14:23:08] uc_update: hostXXXXXX/network_stats/sol_udp_stats: ds[0]
= 3.000000
[2008-06-05 14:23:08] uc_update: hostXXXXXX/network_stats/sol_udp_stats: ds[1]
= 5.000000
[2008-06-05 14:23:08] uc_update: hostXXXXXX/network_stats/sol_udp_stats: ds[2]
= 0.000000
[2008-06-05 14:23:08] network plugin: cache_check: key =
hostXXXXXX/network_stats/sol_udp_stats; time = 1212690128; retval = 0
[2008-06-05 14:22:08] RET_BUFFER: 772
#############
Very good, the buffer is flushed and re-filled with new data according to
RET_BUFFER until ...
#############
# FIFTH PART
#############
[2008-06-05 14:23:16] plugin: plugin_read_all: Signalling `read_cond'
[2008-06-05 14:23:16] [thread # 6] plugin: plugin_read_thread: Handling
exec
[2008-06-05 14:23:16] [thread # 6] plugin: plugin_read_thread: Waiting on
read_cond.
[2008-06-05 14:23:16] [thread # 2] plugin: plugin_read_thread: Waiting on
read_cond.
[2008-06-05 14:23:16] [thread # 3] plugin: plugin_read_thread: Waiting on
read_cond.
[2008-06-05 14:23:16] [thread # 5] plugin: plugin_read_thread: Waiting on
read_cond.
[2008-06-05 14:23:16] [thread # 4] plugin: plugin_read_thread: Waiting on
read_cond.
[2008-06-05 14:24:01] kstat chain has been updated
[2008-06-05 14:24:06] kstat chain has been updated
[2008-06-05 14:24:06] plugin_dispatch_values: time = 1212690186; interval =
60; host = hostXXXXXX; plugin = disk_stats_md10; plugin_instance = ; type =
sol_
disk_iops; type_instance = ;
[2008-06-05 14:24:06] uc_update: hostXXXXXX/disk_stats_md10/sol_disk_iops:
ds[0] = 0.000000
[2008-06-05 14:24:06] uc_update: hostXXXXXX/disk_stats_md10/sol_disk_iops:
ds[1] = 0.000000
[2008-06-05 14:24:06] uc_update: hostXXXXXX/disk_stats_md10/sol_disk_iops:
ds[2] = 0.000000
[2008-06-05 14:24:06] network plugin: cache_check: key =
hostXXXXXX/disk_stats_md10/sol_disk_iops; time = 1212690186; retval = 0
[2008-06-05 14:23:06] RET_BUFFER: 916
#############
Take a look at "RET_BUFFER", Instead of sending data, they seems to be
discarded because the 'master plugin' did a 'reset' to the network plugin
according to "kstat chain has been updated" .
Indeed, I didn't see something like this:
[2008-06-05 14:23:08] network plugin: network_send_buffer: buffer_len = 1022
That would tell me that the data are sent before resetting.
Am I wrong?
Thank you very much!
Eric
On Wednesday 04 June 2008 15:53, Eric LeBlanc wrote:
> Hi,
>
> I installed collectd on a linux server (which create rrd) and on a solaris.
> Both run on 4.4.1.
>
> It seems that there's a limit with buffers in network plugin on Solaris
> client. Or at least, if we have more than 1024 chars, the remaining is
> discarded (it's my pure guess).
>
> Before all, here is the type.db config:
>
> sol_disk_iops read_iops:GAUGE:0:4294967295,
> write_iops:GAUGE:0:4294967295, total_iops:GAUGE:0:4294967295
> sol_disk_throughput read_throughput:GAUGE:0:4294967295,
> write_throughput:GAUGE:0:4294967295, total_throughput:GAUGE:0:4294967295
> sol_disk_avg_svctime total_iowait:GAUGE:0:4294967295,
> total_ioactive:GAUGE:0:4294967295
> sol_disk_avg_iops wait_iops_avg:GAUGE:0:4294967295,
> active_iops_avg:GAUGE:0:4294967295, total_iops_avg:GAUGE:0:4294967295
> sol_tcp_stats tcpActiveOpen:GAUGE:0:4294967295,
> tcpPassiveOpen:GAUGE:0:4294967295, tcpACurrEstab:GAUGE:0:4294967295,
> tcpInSeg_rate:GAUGE:0:4294967295, tcpOutSeg_rate:GAUGE:0:4294967295
> sol_udp_stats InDatagrams_rate:GAUGE:0:4294967295,
> OutDatagrams_rate:GAUGE:0:4294967295, InErrors:GAUGE:0:4294967295
> sol_iface_stats rx_rate:GAUGE:0:4294967295,
> rxerrs:GAUGE:0:4294967295, tx_rate:GAUGE:0:4294967295,
> txerrs:GAUGE:0:4294967295
> sol_cpu_process running:GAUGE:0:4294967295,
> blocked:GAUGE:0:4294967295, waiting:GAUGE:0:4294967295
> sol_cpu_usage user:GAUGE:0:100, system:GAUGE:0:100,
> idle:GAUGE:0:100
> sol_avail_swap free_swap:GAUGE:0:4294967295
> sol_free_list free_list:GAUGE:0:4294967295
> sol_swaping swapout:GAUGE:0:4294967295
> sol_scanrate scanrate:GAUGE:0:4294967295
>
>
> Here is the example of print to the collectd daemon:
>
>
> hostXXXXX/cpu_usage/sol_cpu_usage interval=2 N:93:7:0
> hostXXXXX/mem_stats/sol_avail_swap interval=2 N:26754864
> hostXXXXX/mem_stats/sol_free_list interval=2 N:10950552
> hostXXXXX/mem_stats/sol_swaping interval=2 N:0
> hostXXXXX/mem_stats/sol_scanrate interval=2 N:0
> hostXXXXX/disk_stats_md10/sol_disk_iops interval=2 N:0.0:0.0:0
> hostXXXXX/disk_stats_md10/sol_disk_throughput interval=2 N:0.0:0.0:0
> hostXXXXX/disk_stats_md10/sol_disk_avg_svctime interval=2 N:0.0:0.0
> hostXXXXX/disk_stats_md10/sol_disk_avg_iops interval=2 N:0.0:0.0:0
> hostXXXXX/disk_stats_md30/sol_disk_iops interval=2 N:0.0:1.5:1.5
> hostXXXXX/disk_stats_md30/sol_disk_throughput interval=2 N:0.0:16.0:16
> hostXXXXX/disk_stats_md30/sol_disk_avg_svctime interval=2 N:0.0:9.1
> hostXXXXX/disk_stats_md30/sol_disk_avg_iops interval=2 N:0.0:0.0:0
> hostXXXXX/disk_stats_md40/sol_disk_iops interval=2 N:0.0:42.9:42.9
> hostXXXXX/disk_stats_md40/sol_disk_throughput interval=2 N:0.0:343.2:343.2
> hostXXXXX/disk_stats_md40/sol_disk_avg_svctime interval=2 N:2.6:7.2
> hostXXXXX/disk_stats_md40/sol_disk_avg_iops interval=2 N:0.1:0.3:0.4
> hostXXXXX/iface_stats/sol_iface_stats interval=2 N:317:0:230:0
> hostXXXXX/network_stats/sol_tcp_stats interval=2 N:6:0:1165:61:60
> hostXXXXX/network_stats/sol_udp_stats interval=2 N:0:0:0
>
> Now, below, take a look at the modification time (6th column), on Linux
> server, which is receiving all statistical data from the Solaris box:
>
> ./cpu_usage:
> -rw-r--r-- 1 root root 745344 Jun 4 15:36 sol_cpu_process.rrd
> -rw-r--r-- 1 root root 745344 Jun 4 15:36 sol_cpu_usage.rrd
> ./disk_stats_md10:
> -rw-r--r-- 1 root root 745344 Jun 4 15:35 sol_disk_avg_iops.rrd
> -rw-r--r-- 1 root root 497384 Jun 4 15:35 sol_disk_avg_svctime.rrd
> -rw-r--r-- 1 root root 745344 Jun 4 15:35 sol_disk_iops.rrd
> -rw-r--r-- 1 root root 745344 Jun 4 15:35 sol_disk_throughput.rrd
> ./disk_stats_md30:
> -rw-r--r-- 1 root root 745344 Jun 4 15:35 sol_disk_avg_iops.rrd
> -rw-r--r-- 1 root root 497384 Jun 4 15:35 sol_disk_avg_svctime.rrd
> -rw-r--r-- 1 root root 745344 Jun 4 15:35 sol_disk_iops.rrd
> -rw-r--r-- 1 root root 745344 Jun 4 15:35 sol_disk_throughput.rrd
> ./disk_stats_md40:
> -rw-r--r-- 1 root root 745344 Jun 4 15:21 sol_disk_avg_iops.rrd
> -rw-r--r-- 1 root root 497384 Jun 4 15:21 sol_disk_avg_svctime.rrd
> -rw-r--r-- 1 root root 745344 Jun 4 15:21 sol_disk_iops.rrd
> -rw-r--r-- 1 root root 745344 Jun 4 15:21 sol_disk_throughput.rrd
> ./iface_stats:
> -rw-r--r-- 1 root root 993304 Jun 4 15:36 sol_iface_stats.rrd
> ./mem_stats:
> -rw-r--r-- 1 root root 249424 Jun 4 15:36 sol_avail_swap.rrd
> -rw-r--r-- 1 root root 249424 Jun 4 15:36 sol_free_list.rrd
> -rw-r--r-- 1 root root 249424 Jun 4 15:35 sol_scanrate.rrd
> -rw-r--r-- 1 root root 249424 Jun 4 15:35 sol_swaping.rrd
> ./network_stats:
> -rw-r--r-- 1 root root 1241264 Jun 4 15:35 sol_tcp_stats.rrd
> -rw-r--r-- 1 root root 745344 Jun 4 15:35 sol_udp_stats.rrd
>
> Sometimes, it's the rrd files from network_stats that aren't updated,
> sometime it's disk_stats_md40 (which is currently the case).
>
> If you want the debug log of the collectd daemon or if you need any others
> details, just ask me.
>
> Thank you very much for your help!
>
> E.
--
Eric LeBlanc <eleblanc at taleo.com>
Unix System Administrator
Taleo inc.
More information about the collectd
mailing list