[collectd] overflow of procstat_t cpu_user/system_counter in processes

james at jwarner.org james at jwarner.org
Tue Jul 14 19:30:13 CEST 2009


Hi,

First of all I want to say that I have been using collectd for awhile and
I really like the product.

However, when I was reading the source for the processes plugin I noticed
that the cpu_user_counter and cpu_system_counter value in ps_read_process
are unsigned long long values and that the procstat_t values for
cpu_user_counter and cpu_system_counter are unsigned long only.

I did some testing(on CentOS 5.2 and slackware 12.x) and if you have a
single process that is using close to 100% user time on a single core
machine performing the downcast to unsigned long  looks like it works
fine.

However, if you have a process that runs 2 threads at 100% user time on a
multicore box the downcast to unsigned long looks like it overflows.

Is this something that isn't a problem based on the final calculated value
that is written to disk or is this a bug?

I have attached a patch to convert the unsigned longs in procstat_t  to
unsigned long longs and a c program that generates the load that I
described.  I tested it briefly and it looks like it works, but I confess
that I don't understand the source to collectd all that well yet(you know
at all:).

It seems like another alternative to patching things the way that I have
here would be to remove the conversion from jiffies to
microseconds(cpu_user_counter   = cpu_user_counter   * 1000000 /
CONFIG_HZ;). However, I didn't do that in this instance because I think it
would require additional patches to the web frameworks for displaying the
data and I was hoping to keep things simple.  With that said I'm not sure
what the value is in converting to microseconds here since it seems like
/proc/<pid>/stat is reporting data in hundredths(USER_HZ?) of a second for
both Slackware and Centos and converting the reported value to
microseconds seems like it adds a level of precision that isn't obtainable
from /proc.


Anyway, if you could take a look and let me know if this is a bug or
something that I am simply failing to understand that would be great.

Thanks,

James Warner
-------------- next part --------------
A non-text attachment was scrubbed...
Name: processes.cpu_counter_overflow.patch
Type: application/octet-stream
Size: 1247 bytes
Desc: not available
Url : http://mailman.verplant.org/pipermail/collectd/attachments/20090714/9a6b265d/attachment.obj 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: thread_test.c
Type: application/octet-stream
Size: 525 bytes
Desc: not available
Url : http://mailman.verplant.org/pipermail/collectd/attachments/20090714/9a6b265d/attachment-0001.obj 


More information about the collectd mailing list