As of Version 3.4.2 of collectl, collectl can now detect
dynamic changes to a CPU's state. In other words, going offline or coming back online.
When one or more CPUs is indeed found to be off line, collectl will include a message in
an output header to indicate this. Furthermore, when display CPU numbers in headers, those
names will have their number changed to Xs to indicate this has occurred, such as in the
following:
[root@node02 mjs]# ./collectl.pl -scj
waiting for 1 second sample...
# *** One or more CPUs disabled ***
#<--------CPU--------><-----------------Int------------------>
#cpu sys inter ctxsw Cpu0 Cpu1 Cpu2 CpuX Cpu4 Cpu5 Cpu6 Cpu7
0 0 1051 49 1000 17 0 0 4 0 0 29
As the state changes, the headers will change accordingly. If there is a state change between
headers this won't be seen until the next headers are displayed. If displaying detail data,
one CAN tell the stat has changed. In the case of looking at only CPU data, ALL percentages for
a CPU that is offline will display as zeros. If looking at interrupts, the CPU number will be
changed to an X in the header (see Restrictions below).
When logging to a file, if any CPUs are found to be offline when collectl starts, that number
will be written to the file header in the field CPUsDis. A new flag D will also
be added to the Flags field. However, one will still see the same effects of a CPU
state change in the output during playback.
Restrictions If a CPU goes offline after collectl has started and one is logging to disk, it will not
be noted in the file header.
When monitoring process data, this header will indicate if a CPU was found to be offline at the
time collectl started as well as during processing. However, if the state changes and you're not
explicitly displaying CPU data, there will be no indication of dynamic CPU state changes reported.
If you are only monitoring interrupt data and there is a state change things will get very messy.
As users typically monitor Interrupts and CPU data at the same time it is not felt to be worth
the extra effort or processng overhead to try and accommodate this rare case.
Caution: Large numa systems and the performance impact with kernels >= 2.6.32
Systems with higher CPU and numa node counts will now see additional monitoring
overhead, which can be as much as a factor or 50 or more, depending on the actual hardware
configuration! The specific reason is that it's now more expensive to read /proc/stat.
In this first example, which is a 4 core/2 node system runnng rhel5.3, we measure the
system overhead for monitoring cpu stats at about 3 seconds/day using standard
way to measure collectl overhead:
time collectl -sc -i0 -c 8640 -f/tmp
real 0m2.879s
user 0m1.908s
sys 0m0.913s
In this example we're measuring 1/10th the number of samples (I get impatient) on an 48 core/8 node
system running fedora 14, which is a 2.6.35 kernel, and this alone is over 5 times the overhead
of the previous example which normalizing to a full day would be over 50 times the load:
time collectl -sc -i0 -c 864 -f/tmp
real 0m16.783s
user 0m3.003s
sys 0m13.523s
This overhead will also be felt when monitoring memory as well, but by a
much smaller factor if only doing -sm. The -sM command will suffer a similar fate to cpu
stats, though more in the range of a factor of 10.