`> collectl - CPU Monitoring

CPU Monitoring

Introduction

As of Version 3.4.2 of collectl, collectl can now detect dynamic changes to a CPU's state. In other words, going offline or coming back online. When one or more CPUs is indeed found to be off line, collectl will include a message in an output header to indicate this. Furthermore, when display CPU numbers in headers, those names will have their number changed to Xs to indicate this has occurred, such as in the following:
[root@node02 mjs]# ./collectl.pl -scj
waiting for 1 second sample...
# *** One or more CPUs disabled ***
#<--------CPU--------><-----------------Int------------------>
#cpu sys inter  ctxsw Cpu0 Cpu1 Cpu2 CpuX Cpu4 Cpu5 Cpu6 Cpu7
   0   0  1051     49 1000   17    0    0    4    0    0   29
As the state changes, the headers will change accordingly. If there is a state change between headers this won't be seen until the next headers are displayed. If displaying detail data, one CAN tell the stat has changed. In the case of looking at only CPU data, ALL percentages for a CPU that is offline will display as zeros. If looking at interrupts, the CPU number will be changed to an X in the header (see Restrictions below).

When logging to a file, if any CPUs are found to be offline when collectl starts, that number will be written to the file header in the field CPUsDis. A new flag D will also be added to the Flags field. However, one will still see the same effects of a CPU state change in the output during playback.

Restrictions
If a CPU goes offline after collectl has started and one is logging to disk, it will not be noted in the file header.

When monitoring process data, this header will indicate if a CPU was found to be offline at the time collectl started as well as during processing. However, if the state changes and you're not explicitly displaying CPU data, there will be no indication of dynamic CPU state changes reported.

If you are only monitoring interrupt data and there is a state change things will get very messy. As users typically monitor Interrupts and CPU data at the same time it is not felt to be worth the extra effort or processng overhead to try and accommodate this rare case.

Caution: Large numa systems and the performance impact with kernels >= 2.6.32

Systems with higher CPU and numa node counts will now see additional monitoring overhead, which can be as much as a factor or 50 or more, depending on the actual hardware configuration! The specific reason is that it's now more expensive to read /proc/stat.

In this first example, which is a 4 core/2 node system runnng rhel5.3, we measure the system overhead for monitoring cpu stats at about 3 seconds/day using standard way to measure collectl overhead:

time collectl -sc -i0 -c 8640 -f/tmp
real    0m2.879s
user    0m1.908s
sys     0m0.913s
In this example we're measuring 1/10th the number of samples (I get impatient) on an 48 core/8 node system running fedora 14, which is a 2.6.35 kernel, and this alone is over 5 times the overhead of the previous example which normalizing to a full day would be over 50 times the load:
time collectl -sc -i0 -c 864 -f/tmp
real    0m16.783s
user    0m3.003s
sys     0m13.523s
This overhead will also be felt when monitoring memory as well, but by a much smaller factor if only doing -sm. The -sM command will suffer a similar fate to cpu stats, though more in the range of a factor of 10.
updated Sept 15, 2011