[collectd] RFC: Changes to data sources and naming schema
octo at collectd.org
Mon Sep 23 20:12:36 CEST 2013
[TLDR: Do you have a use-case for raw counter values?]
Good morning everybody,
we had a great time at the Hackathon  in Berlin yesterday. Thanks
again to everyone!
Amongst the ideas we discussed were some fundamental changes to the way
metrics are represented. These ideas might eventually result in a
collectd version 6, but hold you breath just yet – no actual coding has
been done in that direction, we're just collecting design ideas at the
1) Get rid of multiple "data sources" per metric.
Some metrics, e.g. the "if_octets" metrics from the "interface" plugin
and the "load" metric from the "load" plugin have multiple "data
sources". The "if_octets" metrics has data sources "rx" and "tx" for
received and transmitted bytes.
We would like to remove this functionality altogether. Rather than one
metric with two values, we would like the "interface" plugin to create
two metrics with one value each. Since version 5.0 this is mostly how
metrics are defined and only few cases are left, now we would like to
actually remove the functionality. We reached a consensus on this so
it's essentially a done deal.
* A lot of collectd code becomes a lot easier (less bugs)
* A lot of front-end and graphing code becomes a lot easier (more
and better front-ends)
* Mapping of collectd metrics to names used by other systems,
e.g. Graphite, is easier / more consistent
* Splitting up existing RRD files by "data source" is a solved
problem; writing a migration script is fairly simple
* A point which causes much confusion for new users is resolved
* Building a backwards compatibility layer for this is going to be
2) Calculate the rate of counters / DERIVEs early on and after that only
handle gauge values.
Right now, values come in four flavors: GAUGE and DERIVE, and two more
special cases which are hardly ever used. These numbers are passed
through the daemon as they are, i.e.:
* The CPU plugin gets a counter of how many ticks / jiffies the CPU
has spent in user mode since some unspecified time in the past.
* This number if "dispatched" as a DERIVE type value.
* The output plugins will write this absolute number.
However, in the case of DERIVE (and COUNTER) values these actual
absolute numbers are meaningless. In order to do anything meaningful
with them, the difference between two values (and their respective
times) is calculated, which results in the averaged _rate_ of change.
This is what output plugins do if they have an enabled "StoreRates"
setting. But not only there: Threshold checking, scaling, aggregation;
all of these operate on the _rate_ rather than the absolute number.
We would like to change the way DERIVEs are handled within collectd:
Instead of keeping the original absolute values, we would like to
calculate the rate as early as possible, possibly within the read
plugins, and only handle the rate form there on.
We only came up with one use case where having the raw counter values is
beneficial: If you want to calculate the average rate over arbitrary
time spans, it's easier to look up the raw counter values for those
points in time and go from there. However, you can also sum up the
individual rates to reach the same result. Finally, when handling
counter resets / overflows within this interval, integrating over /
summing rates is trivial by comparison.
Do you have any other use-case for raw counter values?
* Handling of values becomes easier.
* The rate is calculated only once, in contrast to potentially several
times, which might be more efficient (currently each rate conversion
involves a lookup call).
* Together with (1), this removes the need for having the "types.db",
which could be removed then. We were in wild agreement that this
would be a worthwhile goal.
* Original raw value is lost. It can be reconstructed except for a
(more or less) constant offset, though.
3) Changes to the naming schema.
This we discussed the most and the most diverse. Currently, collectd has
a very static naming schema consisting of host, plugin, type and two
optional fields, "plugin instance" and "type instance". This works well
in many cases, but has some drawbacks and limitations. For example, the
Varnish plugin puts the Varnish server and the subcomponent into the
"plugin instance", which is not ideal.
We discussed two alternatives:
* Use a path (or, expressed more sciency, an ordered list of strings)
to identify metrics. A CPU metric could look like this:
* Use an (unordered) set of key-value pairs to identify metrics. You
can think of this as a JSON object that only has string members, if
you like. We would likely make at least two fields mandatory, for
example "source" (or "host") and "metric" (or "name"). A CPU metric
could looke like this, for example:
"source": "example.com", // required
"metric": "cpu usage", // required
"cpu-id": "0", // optional
"cpu-state": "idle" // optional
When filtering or aggregating, the first option would require to use
indexes, for example "get metrics where index 0 is 'example.com'". Here,
"index 0" refers to the "source". The second alternative would allow us
to use names (rather than indexes) to refer to a specific part of the
name, e.g. "get metrics where 'source' is 'example.com'".
I'd love to hear what people think about the entire topic of naming.
There are good reasons for either schema and there are also good reasons
for staying with the current concept and live with its flaws. Which
schema would meet your needs best and why? What are those needs?
collectd – The system statistics collection daemon
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 836 bytes
Desc: Digital signature
More information about the collectd