[collectd] Persist parameter (for threshold) doesn't work as expected

Florian Forster octo at verplant.org
Tue Mar 18 15:40:05 CET 2008


Hi Dieter,

On Tue, Mar 18, 2008 at 02:11:44PM +0100, Dieter Bloms wrote:
> > yes, now I get the OKAY notification, but now I don't get any
> > notification about the midterm and longterm, if I set Persist false.
> 
> same with Persist true.
> I only get notifications for shortterm, but I need the longterm

the problem at hand is a bit messy - I'll try to explain what the
problem is, maybe someone has a good idea how to handle all this:

All values are first stored in a `cache'. Since the cache holds the old
value of a data set, it can translate counter-values to rates. This is
necessary for threshold checking, since comparing a raw counter value to
some threshold is nonsense.

After that the rate (or the plain gauge value, if applicable) of each
data source within the data set is compared to the threshold. It will
then consider the worst data source and, if needed, send out a
notification.[1]

Last but not least, once per interval the entire cache is searched for
``missing values'', i. e. values that were not updated recently.

The core of the problem is that the ``state'' is saved per _data set_,
i. e. for a set of data sources. So if `longterm' was too high and is
now okay again, the code has no way to determine _which_ data source
used to be bad and is now good again.

I think the best solution is to have thresholds and states on a per data
source basis. That way you could specify that you are only interested in
`longterm' but not in `shortterm' and `midterm'.

To implement that I'd turn every entry in the threshold tree into a
linked list and allow multiple definitions of the otherwise same
threshold. The code then uses the first matching threshold from that
list. That could be used like this:
  <Threshold>
    <Type "load">
      DataSource "longterm"
      FailureMax 4.0
    </Type>
    <Type "load">
      # The other data sources
      FailureMax 12.0
    </Type>
  </Threshold>

Does any of that sound reasonable?

Regards,
-octo

[1] If, for instance, you have the following thresholds:
    WarningMax 2.0
    FailureMax 8.0
    Now assume the current load is (7.0, 9.0, 4.0). A notification with
    the severity `error' will be created that will complain that
    `midterm' is too high.
    If more than one data source is outside the failure threshold, one
    notification will be sent for the first such data source.
-- 
Florian octo Forster
Hacker in training
GnuPG: 0x91523C3D
http://verplant.org/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : http://mailman.verplant.org/pipermail/collectd/attachments/20080318/0a66703a/attachment.pgp 


More information about the collectd mailing list