| UPCOMING  WORKSHOPS Online: • AVX Vectorization Workshop: Nov 24th – 25th, 9 AM – 5 PM CET, info [at] johnnysswlab [dot] com (Register, Register for AVX Workshop, Hi! I would like to register for AVX Vectorization Workshop taking place on Nov 24th and 25th, 9 AM – 5 PM CET ), More info… • NEON Vectorization Workshop: TBD, info [at] johnnysswlab [dot] com (Express interest, Expressing Interest for NEON Workshop, Hi! I am interested in NEON Vectorization Workshop. Please inform me about the dates once this information is known.) , More info… | 
Developers want flexible code, right? Ideally, we want to write our code once and reuse it many times. Flexibility comes from parametrization: we provide parameters to our code, and depending on the provided parameters, the same code behaves a bit differently.
Flexibility, however, comes with a performance tag. If you want your program to be both flexible and fast, you will need to pay attention to a few details to get it right. In this post we give two examples of how flexibility doesn’t necessarily have to sacrifice good performance.
qsort vs std::sort
Consider the qsort function from the C library. The signature of this function is like this:
qsort(void *base, size_t nitems, size_t size, int (*compar)(const void *, const void*))
To allow for flexibility, the function takes a pointer to the array base, number of elements in the array nitems, size of each element size and a pointer to the comparison function compar. This function is very flexible, it works with integers, floats, characters or custom structs, as long as we provide good parameters to it.
From the performance point of view, however, it doesn’t look that great. The comparison function will be called many times, and typically it is a very simple function. Compiler cannot inline function pointers and the price of the call will typically outweigh the price of the comparison. The size of an element in the array are known only at runtime, so the compiler cannot do any optimizations to make the calculation of the addresses faster.
C++  takes a different approach. The C++ counterpart of qsort is std::sort, but std::sort allows parameters to be specified as template parameters. Here is the signature of std::sort:
template< class RandomIt, class Compare > void sort( RandomIt first, RandomIt last, Compare comp );
Function std::sort has two template parameters: RandomIt and Compare. Depending on the provided template parameters, the compiler will generate an implementation of std::sort for integers, another implementation for floats, and a third implementation that sorts integers in descending order.
The template version is faster. Because of the templates, the compiler can inline the Compare function. It can also generate efficient code to calculate element addresses. And all these optimizations do wonders. In our measurement, std::sort took 1990 ms on an array of ten million integers compared to qsort which took 3190 ms.
If you are developing in a constrained environment, where code size matters, you will want to use qsort. If performance is important, you will want to use std::sort. The template mechanism allows a very efficient way to move computations from runtime to compile-time and this really impacts the sorting performance.
Move unnecessary computations outside of the loop
Consider a following example (link to the full source is available here):
void calculate_array(float a[], float b[], int n, bool set_b) {
    float r1, r2;
    for (int i = 0; i < n; i++) {
        calculate(i, &r1, &r2);
        if (operation == ADD) {
            a[i] += r1;
            if (set_b) {
                b[i] += r2;
            }
        } else if (operation == SUBSTRACT) {
            a[i] -= r1;
            if (set_b) {
                b[i] -= r2;
            }
        } else { ... }
    }
}This loop is written with flexibility in mind. There are two parameters that determine the flow of the loop. The first one is operation and the second one is set_b. Under the assumption that function calculate doesn’t change the value of operation or set_b, then each iteration is evaluating the conditions with operation and set_b, although they never change.
If the compiler manages to inline function calculate, it might figure out that operation and set_b are loop invariant, i.e. their values never change inside the loop. It can then decide to perform an optimization called loop unswitching. It can move the invariant conditions outside of the loop and generate a separate loop for each of the moved conditions. Like this:
if (operation == ADD) {
    if (set_b) {
        for (int i = 0; i < n; i++) {
            calculate(i, &r1, &r2);
            a[i] += r1;
            b[i] += r2;
        }
    } else {
        for (int i = 0; i < n; i++) {
            calculate(i, &r1, nullptr);
            a[i] += r1;
        }
    }
} else { ... }The loops are now free of conditional statements.1. Typically, the compilers will often do this kind of optimizations if they are able to inline all the functions from the loop body and the code size doesn’t grow too much. But if our function calculate is in another compilation unit, the compiler will not be able to make an assumption that operation never changes and cannot move that evaluation outside of the loop.
In our example of 10 million floats, when we made the function calculate non-inlinable, the loop took 76 ms to execute, with inlining it took 50 ms, and when we explicitly moved out all the loop invariant conditions the loop took 14 ms.2. In our example, the speedup is substantial3!
The downside is code bloat: originally we had one loop, now we have six of them. That’s why compilers are shy when it comes to this kind of transformation. They don’t know the importance of a loop and doing this kind of transformation has the potential to increase the code size several times. I haven’t checked, but maybe the profile-guided optimizations might help the compiler to identify the loops where this transformation can be useful.
Another problem is that sometimes, due to code complexity, compilers can not rule out that the condition inside the loop doesn’t change. The optimization gets skipped, and your possibly critical loop runs slower than it needs to be.
Templates to the rescue
In C++, you can force loop unswitching with templates. Instead of passing arguments operation and set_b as regular function parameters, pass them as template parameters:
template <operation_e operation, bool set_b>
void calculate_array(float a[], float b[], int n) { 
    float r1, r2;
    for (int i = 0; i < n; i++) {
        calculate(i, &r1, &r2);
        if (operation == ADD) {
            a[i] += r1;
            if (set_b) { b[i] += r2; }
        } else if (operation == SUBSTRACT) {
            a[i] -= r1;
            if (set_b) { b[i] -= r2; }
        } else /* if (operation == REPLACE) */ {
            a[i] = r1;
            if (set_b) { b[i] = r2; }
        }
    }
}The source code can remain the same. If we provide ADD for operation and true for set_b, the compiler will remove all dead code and will generate a function like this:
template <>
void calculate_array<ADD, true>(float a[], float b[], int n) { 
    for (int i = 0; i < n; i++) {
        calculate(i, &r1, &r2);
        a[i] += r1;
        b[i] += r2;
     }
}The dead code is removed, the loop invariant branches are gone. The code base is maintanable and the performance is as good as it gets.
Edge cases
Sometimes you cannot use templates to perform loop unswitching. Consider the following loop:
void calculate_array(float a[], float b[], int n, int m) {
    for (int i = 0; i < n; i++) {
        calculate(i, &r1, &r2);
        a[i] = r1;
        if (m) b[i + m] = r2;
    }
}The parameter m is loop invariant, but we cannot pass it is a template parameter since each different version of m would instantiate another function. In that case, we can do the loop unswitching manually:
void calculate_array(float a[], float b[], int n, int m) {
    if (m) {
        for (int i = 0; i < n; i++) {
            calculate(i, &r1, &r2);
            a[i] = r1;
            b[i + m] = r2;
        }
    } else {
        for (int i = 0; i < n; i++) {
            calculate(i, &r1, nullptr);
            a[i] = r1;
        }
    }
}Conclusion
Normally you don’t want to do loop unswitching by hand, as it is often best to leave it to the compiler. But for performance-critical loops, you should definitely inspect the source code for loop invariant conditions, and move them outside of the loop. Compilers often fail to do that for us, especially in the case of loops with complex bodies.
Templates are ideal for this purpose, but you can also use macros if you are developing in C.
Generally, you don’t want loops with complex bodies on the hot path, you want tight loops. The tighter the loop, the easier it is for the compiler to optimize and for the hardware to execute. So moving everything unneeded outside of the loop body can, for example, enable auto-vectorization. Vectorized code is much faster than its non-vectorized counterpart. Tight loops also better fit instruction caches and are more hardware friendly.
- Some might argue that conditions that always evaluate to the same value are cheap, which is true, but the compiler has it much more difficult to generate efficient code with conditions in it. [↩]
- Command line for the testing: ./loop_unswitching -o add -b 1 -m add[↩]
- As always with performance, in this example the speed up is substantial, but that doesn’t necessarily have to be in your case. Therefore, always measure! [↩]

