[collectd] [PATCH] New plugin - lpar

Fri Aug 20 01:06:32 CEST 2010

Hello Florian,

That's a lot of questions! I'll try my best to answer them. Fortunately
I needed to ask them myself while developing this plugin... ;-)

> Manuel has recently sent a patch for "Workload Partitioning" (WPAR),
> also an AIX virtualization technique. Could you or Manuel enlighten me
> how these things relate to one another? Would it make sense to combine
> both plugins into one?

LPARs (Logical Partitions) and WPARs (Workload Partitions) are indeed
similar as they are both virtualization techniques. Please note that I'm
not familiar with WPARs, so I may be somewhat mistaken in my comparison.

I suppose WPARs are best described as the AIX equivalent of Solaris
zones. They provide a virtual AIX environment within a host AIX system.
Each WPAR is given access to a certain amount of memory, CPU power, and
a subset of the host's filesystem. WPARs are isolated from each other
using techniques like process isolation or some kind of chroot. Resource
allocation is under the control of the host AIX.

On the other hand, LPARs are a hardware solution. A hypervisor logically
assembles CPU units, memory, network interfaces, pci busses, scsi or
fiber channel adapters into multiple logical machines. Resource
allocation and isolation are done at the hardware level.

You can have a look there for something more complete:
http://santosh-aix.blogspot.com/2007/12/comparing-wpars-with-lpars.html

We can imagine (and it is probably the case for some sysadmins, though
not by me) WPARs running inside an LPAR.
IMHO these concepts are different enough to warrant separate plugins.

As an example, imagine an IBM power system with 16 CPUs. Using the
hypervisor, we create an LPAR which is entitled to 0.2 processor
capacity and 2 virtual processors. The AIX instance running on this LPAR
will see 2 CPUs, each having the equivalent power of 0.1 physical CPU.

Under heavy load, the standard cpu plugin will show the 2 cpus 100%
busy. The lpar plugin will peak at 0.2 physical CPU.

More interesting is the case when the cpus are mostly idle. The standard
cpu plugin shows 2 cpus 50% idle. In fact the hypervisor steals most of
the idle cpu cycles and gives them to other cpu hungry LPARs. The lpar
plugin will clearly show this: 0.1 physical cpu used (user+sys+wait),
maybe 0.02 idle, and an empty gap up to 0.2 entitled.

> 
> I have a couple of questions / comments regarding the data being
> collected, too:
> 
>   * You are calculating the time difference yourself and calculating a
>     rate from that. I'd prefer to use a DERIVE or COUNTER data source
>     type for this kind of data rather than converting the counters to a
>     "gauge" in the plugin.

Well, I find using a counter/derive more elegant myself. I just fail to
understand how it can work. The original counters are expressed in
'processor time spent in xxx mode' where time is not in seconds but in
custom cpu-clock dependent 'ticks'.

This hardware dependence disappears if we calculate the rate within the
plugin, as time_base is in ticks too.

If I'm not mistaken, using the raw counters the graphs will show cpu
usage scaled by a factor of 'ticks_per_second' which we cannot
compensate for as this value isn't known outside the host running the
plugin.

I can imagine applying this factor to the raw counter prior to returning
it, but this may have funny effects when the original counter
overflows/wraps while the scaled value is far away from the 32/64 bits
limit.

However, I have no deep knowledge of how rrdtool really handles all
this, and will welcome your advice...

> 
>   * What's the deal with the minimum, entitled, maximum "proc capacity"?
>     Is that something that actually does change often? It sounds more
>     like a static configuration thing. Why do you divide that number by
>     100? Is that some magical number required here?

As shown in the example above, entitlement is the processor capacity
each LPAR gets allocated by the hypervisor for its use. Once set, it
does not change but it can be dynamically adjusted by the admin to meet
workload changes. I find it useful to have both cpu usage and
entitlement on graphs: this allows to tell at a glance whether the cpu
resource are sufficent or overkill at a given time.

Minimum and maximum proc capacity are the lower and upper values between
which entitlement can be freely adjusted. These are really static
values, as changing them requires a reboot of the LPAR. I agree they
should probably be removed from the plugin.

These values are expressed in processor units, which are 1/100th of a
physical processor, hence the division.

> * Why do you use the chassis serial number as plugin instance? I'd
>     expect that this information would be either assigned to the host
>     name or that the partition's ID ("lpar_id") would be used as plugin
>     instance. If the partition is moved to another system, the physical
>     ID you're using changes, and this seems to be on purpose. I'd
>     however expect that you'd look for something that *doesn't* change
>     to identify the partition. Something like:
>       hostname         = "lpar_pool-%x", pool_id
>       plugin          = "lpar"
>       plugin_instance = "partition-%x", lpar_id / "global"
>     What am I missing?

You're right, there is a secret purpose to this. ;-)
I would like to graph the total cpu usage of the chassis itself, by
adding up the individual metrics of each participating LPAR. But there
is no static list of LPARs since they can be moved across chassis'. So I
need to know which rrd's are to be considered when graphing a given
chassis.

Another possibility would be a config option like "ReportByChassis =
true" which would tell the plugin to use the chassis's serial instead of
the hostname. So the rrd layout would not be:

lpar_hostname/lpar-chassis_serial/cpu.rrd

but:

chassis_serial/lpar-lpar_hostname/cpu.rrd

> 
>   * Why not use the name included in the struct,
>     (perfstat_partition_total_t *)->name?

This is the name of the LPAR as it is known by the hypervisor, ie
basically a label for a group of hardware resources. It does not
necessarily match the hostname of the AIX instance it is running,
although most sysadmins prefer when that is the case, for consistency's
sake.

> 
>   * What's this code doing?:
> > +			dlt_pit = lparstats.pool_idle_time - last_pool_idle_time;
> > +			total = (double)lparstats.phys_cpus_pool;
> > +			idle  = (double)dlt_pit / XINTFRAC / (double)delta_time_base;
>     Why don't you use the "pool_busy_time" member?

The code sample at
http://publib.boulder.ibm.com/infocenter/aix/v6r1/index.jsp?topic=/com.ibm.aix.prftools/doc/prftools/prftools07.htm
uses lparstats.pool_idle_time. I chose to stick with it and calculate
busy as total - idle, rather than trying on my own to convert
nanoseconds to physical cpus given the (crystal clear) API definitions: 

pool_idle_time: Number of clock tics a processor in the shared pool was
idle.
pool_busy_time: Summation of busy (non-idle) time accumulated across all
partitions in the pool (nano seconds).

After all, who am I to argue with IBM coders ?

Regards,

Aurélien Reynaud