Hardware performance counters the easy way: quickstart likwid-perfctr

For a performance engineer, just measuring function runtime is not enough. Time is the most useful statistic to understand which function is the bottleneck, but it is mostly useless when it comes to understanding why is the function slow.

That is where hardware performance counters come into play. Hardware performance counters are special counters in the CPU that measure all sorts of things, e.g. amount of data cache misses, instruction cache misses, number of instructions, number of cycles, branch mispredictions, etc. Using them effectively is the key to understanding why a segment of code is slow.

Unfortunately, hardware performance counters are not easy to use. Reading raw counters doesn’t help much unless paired with other counters. Also, not all CPUs support all counters. Luckily, this is where LIKWID comes into play.

Like what you are reading? Follow us on LinkedIn or Twitter and get notified as soon as new content becomes available.

Welcome to LIKWID

LIKWID is an open-source performance monitoring and benchmarking suite for Linux that abstracts some of the differences between different manufacturers. LIKWID consists of many tools, but in this post we focus on likwid-perfctr, a tool used to read hardware performance counters. It supports many CPU types: Intel, AMD, ARM and IBM. The full list of supported CPUs is available here.

LIKWID is easy to install from the Linux repositories. If your CPU happens to be unsupported, send a support request on Likwid GitHub like I did here. In my case it took maintainers one day to add support for my CPU.

After the installation, a tool called likwid-perfctr should be available on your system. This is the tool we will use to get easily human readable information from the hardware timers.

If you are interested for the hardware performance counters from the whole program, the process is simple. Here is an example of the command you can use:

$ likwid-perfctr -C 0 -g MEM git status
--------------------------------------------------------------------------------
CPU name:	Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
CPU type:	Intel Kabylake processor
CPU clock:	2.11 GHz
--------------------------------------------------------------------------------
...
--------------------------------------------------------------------------------
Group 1: MEM
+-----------------------+---------+------------+
|         Event         | Counter | HWThread 0 |
+-----------------------+---------+------------+
|   INSTR_RETIRED_ANY   |  FIXC0  |   16678566 |
| CPU_CLK_UNHALTED_CORE |  FIXC1  |   13957793 |
|  CPU_CLK_UNHALTED_REF |  FIXC2  |   35989800 |
|       DRAM_READS      | MBOX0C1 |     663153 |
|      DRAM_WRITES      | MBOX0C2 |     122973 |
+-----------------------+---------+------------+

+-----------------------------------+------------+
|               Metric              | HWThread 0 |
+-----------------------------------+------------+
|        Runtime (RDTSC) [s]        |     0.0247 |
|        Runtime unhalted [s]       |     0.0066 |
|            Clock [MHz]            |   818.1673 |
|                CPI                |     0.8369 |
|  Memory load bandwidth [MBytes/s] |  1721.2553 |
|  Memory load data volume [GBytes] |     0.0424 |
| Memory evict bandwidth [MBytes/s] |   319.1842 |
| Memory evict data volume [GBytes] |     0.0079 |
|    Memory bandwidth [MBytes/s]    |  2040.4395 |
|    Memory data volume [GBytes]    |     0.0503 |
+-----------------------------------+------------+

We ran git status command through likwid-perfctr and we decided to read counter group related to memory, indicated through -g MEM. In order to for likwid to work, the program needs to be “pinned” to a CPU core. This is done through option -C 0 where we pin git to core zero.

From the output produced by likwid-perfctr, we can see that our program was reading from memory at the rate of 2040 MB/s and that it tranferred a total of 50 MB of data from the memory.

LIKWID hardware counter groups

The counters in likwid-perfctr are sorted in performance groups. In our previous example, we read the performance group MEM. You can see the list of all performance groups with likwid-perfctr -a:

$ likwid-perfctr -a
    Group name	Description
--------------------------------------------------------------------------------
   FALSE_SHARE	False sharing
            L2	L2 cache bandwidth in MBytes/s
       L3CACHE	L3 cache miss rate/ratio
      TLB_DATA	L2 data TLB miss rate/ratio
          UOPS	UOPs execution info
     TLB_INSTR	L1 Instruction TLB miss rate/ratio
CYCLE_ACTIVITY	Cycle Activities
     FLOPS_AVX	Packed AVX MFLOP/s
           TMA	Top down cycle allocation
           MEM	L3 cache bandwidth in MBytes/s
      FLOPS_DP	Double Precision MFLOP/s
          DATA	Load to store ratio
        ICACHE	Instruction cache miss rate/ratio
         CLOCK	Power and Energy consumption
  CYCLE_STALLS	Cycle Activities (Stalls)
        ENERGY	Power and Energy consumption
       L2CACHE	L2 cache miss rate/ratio
      FLOPS_SP	Single Precision MFLOP/s
     UOPS_EXEC	UOPs execution
            L3	L3 cache bandwidth in MBytes/s
      RECOVERY	Recovery duration
   UOPS_RETIRE	UOPs retirement
        MEM_SP	L3 cache bandwidth in MBytes/s
        DIVIDE	Divide unit information
        BRANCH	Branch prediction miss rate/ratio
        MEM_DP	Overview of arithmetic and main memory performance
    UOPS_ISSUE	UOPs issueing

The names of the groups are self explainatory. The names of the groups differ between CPU models, and as far as I can tell, Intel is much better supported than AMD.

Marker API

The information you can collect as described previously is fine, but you can collect similar information with perf stat or Intel’s VTUNE. However, one distinguishing feature of LIKWID is its Marker API. It allows you to do the same measurements as already described, but on a code segment, instead of the whole program.

Here is the short example used to demonstrate the Marker API:

#define LIKWID_PERFMON
#include <likwid.h>

float sum(std::vector<float>& arr, int repeat_count) {
    float result = 0.0;
    for (int k = 0; k < repeat_count; k++) {
        LIKWID_MARKER_START("Compute");
        for (int i = 0; i < arr.size(); i++) {
            result += arr[i];
        }
        LIKWID_MARKER_STOP("Compute");
    }
    return result;
}

int main(int argc, char** argv) {
    ...
    LIKWID_MARKER_INIT;
    LIKWID_MARKER_THREADINIT;

    double res = sum(test_array, 16);
    ...

    LIKWID_MARKER_CLOSE;
}

We need to include header likwid.h (line 2), but before including it, we need to define a macro called LIKWID_PERFMON. Without it, calls to LIKWID functions resolve to empty macros. We can define the macro either like in this example (line 1), or by passing option -DLIKWID_PERFMON to the compiler. Only if this macro is enabled will LIKWID work.

Next, we need to initialize LIKWID in the main function (lines 18-19). When the program exits, we need to clean up the resources (line 24).

We surround the region of the code we want to measure with LIKWID_MARKER_START and LIKWID_MARKER_STOP markers (lines 7 and 11). Every time our program reaches LIKWID_MARKER_START, it collects the state of the hardware counters; when it reaches LIKWID_MARKER_STOP, it collects the state again. The difference between start and stop values are the numbers interesting to us.

You need to provide the region name as a parameter for both LIKWID_MARKER_START and LIKWID_MARKER_STOP. The name should not contain empty spaces. This name is later used in the report. In our case, the name is Compute.

When you link your program, you need to provide -llikwid to the linker to resolve the LIKWID functions that are missing.

Like what you are reading? Follow us on LinkedIn or Twitter and get notified as soon as new content becomes available.

Running your program with Marker API

Running your program with the marker API is very similar to when you run it without it. The difference is that you need to specify the -m switch on the command line to collect the data. Here is the command line and the output of our program:

$ likwid-perfctr -m -C 0 -g MEM ./likwid-example
--------------------------------------------------------------------------------
CPU name:	Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
CPU type:	Intel Kabylake processor
CPU clock:	2.11 GHz
--------------------------------------------------------------------------------
...
--------------------------------------------------------------------------------
Region Compute, Group 1: MEM
+-------------------+------------+
|    Region Info    | HWThread 0 |
+-------------------+------------+
| RDTSC Runtime [s] |   0.563254 |
|     call count    |         16 |
+-------------------+------------+

+-----------------------+---------+------------+
|         Event         | Counter | HWThread 0 |
+-----------------------+---------+------------+
|   INSTR_RETIRED_ANY   |  FIXC0  | 1275133000 |
| CPU_CLK_UNHALTED_CORE |  FIXC1  | 1460581000 |
|  CPU_CLK_UNHALTED_REF |  FIXC2  | 1173268000 |
|       DRAM_READS      | MBOX0C1 |  141401800 |
|      DRAM_WRITES      | MBOX0C2 |    1036858 |
+-----------------------+---------+------------+

+-----------------------------------+------------+
|               Metric              | HWThread 0 |
+-----------------------------------+------------+
|        Runtime (RDTSC) [s]        |     0.5633 |
|        Runtime unhalted [s]       |     0.6914 |
|            Clock [MHz]            |  2629.6957 |
|                CPI                |     1.1454 |
|  Memory load bandwidth [MBytes/s] | 16066.8573 |
|  Memory load data volume [GBytes] |     9.0497 |
| Memory evict bandwidth [MBytes/s] |   117.8136 |
| Memory evict data volume [GBytes] |     0.0664 |
|    Memory bandwidth [MBytes/s]    | 16184.6708 |
|    Memory data volume [GBytes]    |     9.1161 |
+-----------------------------------+------------+

On line 9 we see the name of the region is Compute, as written in LIKWID_MARKER_START. We can also see the total runtime for our region RDTSC Runtime [s] : 0.563254 and we can see the number of invocations call count: 16. The number of invocations corresponds to the number of times we entered the region.

As far as the metrics are concerned, our program has transferred about 9.1 GB of data from the memory at speed of 16.1 GB/s. Let’s check if the data match. The size of our test array is 128 million elements, and each element is 4 bytes in size. This means we need to transfer 512 MB of data for each compute session. We have 16 iterations, so this means we need to transfer 8 GB. We measured 9.1 GB.

A word of caution

There are a few things I think it is important to know when using LIKWID.

LIKWID has a certain overhead which is typically small, but can grow if you are entering and exiting regions many times. If that is the case, the overhead of LIKWID accumulates and can become large, so that it skews the measurement results. So don’t use LIKWID to measure really short sequences of code.

Second warning is not related to LIKWID, but to hardware performance counters. These counters vary a lot between different CPU types. It can also sometime show misleading numbers. So take all information produced by LIKWID with a grain of salt. My recommendation is not to use absolute values, like we did here where we measured memory data volume. Instead, perform the measurements on an original version, do a slight modification, measure the modified version and finally compare results. This makes more sense.

Final Words

LIKWID through its tool likwid-perfctr is a really nice and simple way to get information from the hardware performance counters. In this post we covered the most basic scenario, but which should help you get started quickly. We didn’t cover measurements on multithreaded applications here, for this and other advanced topics, we refer you to LIKWID documentation.

LIKWID comes with other tools as well. You can find a very comprehensive introduction to LIKWID in a blog post written by Pramod Kumbhar. LIKWID documentation also contains much useful information.

Like what you are reading? Follow us on LinkedIn or Twitter and get notified as soon as new content becomes available.

Leave a Reply

Your email address will not be published. Required fields are marked *