[collectd] overflow of procstat_t cpu_user/system_counter in processes

James Warner james at jwarner.org
Wed Jul 15 17:23:36 CEST 2009


Florian,

Thanks for looking into this and explaining.  It was absolutely confusing
me how this could be working(and yet the final value always seemed
correct).  I'm still fairly sure that I haven't completely grasped the way
the processes plugin is working, but for my purpose at the moment I don't
really have to.

Thanks again for the explanation.

James

> Hi James,
>
> On Tue, Jul 14, 2009 at 10:30:13AM -0700, james at jwarner.org wrote:
>> However, when I was reading the source for the processes plugin I
>> noticed that the cpu_user_counter and cpu_system_counter value in
>> ps_read_process are unsigned long long values and that the procstat_t
>> values for cpu_user_counter and cpu_system_counter are unsigned long
>> only.
>
> this is done on purpose, but I wouldn't be at all surprised if there was
> a bug in there somewhere..
>
> The base problem here is that we went to add counters to one another. If
> all counters have the same size, all works well enough:
>
>   32bit = (32bit + 32bit + ... + 32bit) mod 2^32
>
> This works, if the counters being added up are larger than the
> destination, too:
>
>   32bit = (64bit + 64bit + ... + 64bit) mod 2^32
>
> What does not work is if there is one counter which is smaller than the
> destination counter:
>
>   64bit = (32bit + 32bit + ... + 32bit) mod 2^64  <--- WRONG!
>
> If one of the 32bit counters overflows, the code will think the 64bit
> counter overflowed, too, resulting in a huge spike.
>
> It isn't a problem if a counter is added, too: You can assume it was
> there all along but was zero all the time. It *is* a problem if a
> counter is removed, though. And that's the problem I currently don't see
> how it's handled (if it's handled at all).
>
> Maybe it'd be easiest and most straight forward method to simply
> calculate a rate for each PID and then add all those rates to a private
> counter.
>
> Another problem is that `unsigned long' may be 64bit wide on 64bit
> architectures. If the counters provided by the operating system are only
> 32bit wide, we will have problems as described above.
>
> Regards,
> -octo
> --
> Florian octo Forster
> Hacker in training
> GnuPG: 0x91523C3D
> http://verplant.org/
>





More information about the collectd mailing list