Crash course introduction to parallelism: SIMD Parallelism

In the previous post we’ve covered parallel algorithms, and we talked about what kind of problems does parallelism efficiently solve. This time we will talk about what hardware resources developers have at their disposal to achieve parallelism, what are the benefits and limitations of each of them. Today on the menu is SIMD.

SIMD (single instruction multiple data) parallelism

SIMD parallelism is nothing new and all desktop CPUs and many embedded CPUs have it nowadays. The CPUs implement SIMD parallelism through SIMD instructions. Compared to regular instructions, a SIMD instructions (short for Single Instruction Multiple Data) processes more than one piece of data. A standard, scalar load to a register will load a single value, instruction add will add a single value from one register to another, etc. On the other hand, a SIMD load will load several values to a special SIMD register; a vector add instruction will add together several values, etc.

To illustrate this, consider the following scalar loop:

for (int i = 0; i < N; i++) {
    A[i] = B[i] + C[i];

In this loop we iterate over three arrays and adding the value of the arrays B and C to array A.

If we were to parallelize this code with SIMD, the code would translate to the following version:

for (int i = 0; i < N; i+=4) {
    vec4 B_tmp = load4(B + i); // Loads 4 integers into a SIMD register B_tmp
    vec4 C_tmp = load4(C + i);
    vec4 A_tmp = add4(B_tmp, C_tmp); // Adds together vectors of four integers
    store4(A_tmp, A + i); // Stores four integers

The CPU has a load instruction that can load four (or eight, or sixteen) consecutive values at once to a special SIMD register. Then it performs add on two SIMD registers which actually runs four additions in parallel. In the end, storing four results is also done in parallel. Note that each of these operations correspond roughly to one instruction.

Everything in this example is very performance efficient; loading, adding, and storing four values happens at the same time it would take for a regular scalar version operating on one value to complete.

Visualization of SIMD (single instruction multiple data) and SISD (single instruction single data) instructions. In the case of SIMD, one instruction works on a vector of two values compared to SISD, where one instructions works with only one value

Typically, the CPU has a fixed size SIMD register: 128 bits (SSE), 256 bits (AVX), or 512 bits (AVX512). For example, SSE SIMD on x86 processor can process 2 doubles or 4 ints or 8 shorts or 16 chars. The smaller the type you are processing, the more parallelism can be extracted. In 2020, modern desktop CPUs will typically have SIMD width of 256 bits, high-end server CPUs will have a width of 512 bits. Embedded CPUs mostly have a width of 128 bits.

SIMD Advantages

There are few good things about SIMD. The first is that it is available on almost all CPUs. Only really low-end embedded CPUs will be without SIMD nowdays.

The second is that SIMD is really the cheapest way to perform parallelization. Other parallelization techniques, such as multithreading or offloading to a graphic card come with a “warm-up” cost. When your input is small, it might cost you more to start threads or copy data to the graphic card than to do the actual computation.

And the last, but the most important thing, the compilers can automatically generate SIMD instructions for various loops in your programs in a process called autovectorization, without any intervention on your side. Setting the right compilation flags turns on autovectorization. Note that not all loops will be vectorized, and this depends on many factors: loop carried dependencies, conditional statements in the loop body, data memory layout, etc.

Autovectorization is an important topic for a performance engineer. Unfortunately, the compilers are conservative when it comes to vectorization and they will sometimes miss vectorization opportunities.1 On the upper side, the developer can often help the compiler by restructuring the code so effective vectorization can take place. We will cover autovectorization in one of the upcoming posts.

Like what you are reading? Follow us on LinkedIn or Twitter and get notified as soon as new content becomes available.

SIMD Drawbacks

SIMD become popular with MMX extensions to Pentium processors at the end of the 90s. The idea was to speed up graphical applications: applications where the program would apply the same transformation to the stream of data. These include: decoding and encoding images and videos, 3D graphics, processing audio, etc. So this is the area where SIMD really shines and where the performance advantages are the greatest.

Of course SIMD can be used for other applications as well, but often it happens it is not worth the improvements and in those cases the compilers switch back to scalar code.

Here are a few examples of SIMD drawbacks: cases where SIMD is impossible to use or inefficient.

Drawback: Management of conditionals

The first drawback is related to how SIMD handles if clauses in your program. Consider the following example:

for (int i = 0; i < N; i++) {
    if (A[i] >= 1) {
        A[i] = A[i] - B[i];

SIMD doesn’t have a concept of branching, but this is often simulated using bit masking. The above loop translates to the following SIMD sequence:

vec4 ones = { 1, 1, 1, 1 }
for (int i = 0; i < N; i+= 4) {
    vec4 A_tmp = load4(A + i);
    vec4 B_tmp = load4(B + i);
    // Calculates the result for each element in the vector, 
    // even those that won't be needing it.
    vec4 res = sub4(A_tmp, B_tmp);
    // Calculates the predicate, i.e. those elements of the
    // vector that will load the new value
    vec4<bool> greater_than_one = operator>=(A_tmp, ones);
    // Conditional load: loads the new value only for those
    // elements where greater_than_one is true.
    A_tmp = load_if_condition(A_tmp, greater_than_one);
    store4(A_tmp, A + i)

As you can see, the SIMD version of the loop translates to calculating A[i] - B[i] both when needed and not, and then filtering out those that are unneeded in load_if_condition.

If you have complicated nested ifs inside your loop body, your program will need to calculate all the operations inside each of ifs body and then filter out the irrelevant result. This makes SIMD much less useful, so the compilers will generally opt for non-SIMD versions of those loops. There are several ways to remove branches we already covered earlier and most of those techniques are applicable with SIMD.

Drawback: can efficiently process only simple types

The second drawback of SIMD is that it can only process simple types such as integers, floats, shorts, etc. So, if your algorithm searches for data in a binary tree, performs hash map lookups, etc, SIMD will be useless in these scenarios. With other parallelization techniques you could run several searches simultaneously, but not with SIMD (or at least, not easily).

Drawback: memory layout is important

If you are C++ developer following an object-oriented paradigm, you predominantly use classes as types. Even though, classes consist of simple types: integers, floats, etc, SIMD is of little use in this kind of environment. Consider the following example:

class student {
    std::string name;
    double average_mark;

double calculate_average_mark(const std::vector<student>& students) {
    double sum = 0.0;
    for (int i = 0; i < students.size(); i++) {
        sum += students[i].average_mark;
    return sum / students.size();

The code consists of a class student with two fields: name and an average_mark. It also consists of a function calculate_average_mark which takes the array of student instances and calculate the average mark of all students.

Let’s assume that the sizeof(std::string) is eight, and size(double) is also eight. In a vector of class student, float values containing the average mark are non-consecutive, i.e. they are located at memory offsets 8, 24, 40, etc. Even though it is possible to use SIMD in this example, most compilers will switch to scalar code because doing SIMD on a non-consecutive memory layout is slow.2

When your program is actually processing data that is important to be processed fast, you can rearrange your data in memory so that it can be processed more efficiently. Game developers have been doing it for years in a paradigm called data-oriented design, that which focuses more on data and operations, and less on abstractions and encapsulations.

The basic idea revolves around using struct-of-arrays instead of array-of-structs. We define a class called students that has two arrays, one for names and one for average marks.

class students {
    std::vector<std::name> names;
    std::vector<double> average_marks;

    double calculate_average_mark() {
        double sum = 0;
        for (int i = 0; i < average_marks.size(); i++) {
            sum += average_marks[i];
        return sum / average_marks.size();

The compiler with the right switches will generate SIMD instructions to calculate the average mark since the doubles containing the average marks are laid out sequentially in memory.

The speed difference? Here it is:

As you can see, changing the memory layout made the program run two times faster. Enabling SIMD made the program run again two times faster, so with SIMD and modified memory layout the program ran four times faster than the original.

In light of these numbers it is easy to understand why game developers are running away from the object-oriented paradigm. The game cannot get released if it renders slow, and this simple example demonstrates how OOP is wasteful to the resources.

Final Words

To effectively utilize SIMD, a few requirements need to be met: a loop with a large enough trip count, optimal memory layout, and simple types. And when you have these things in place, with the right compiler switches, the performance of your program will really go off.

Like what you are reading? Follow us on LinkedIn or Twitter and get notified as soon as new content becomes available.

Featured image is taken from Erik Engheim’s RISC-V Vector Instructions vs ARM and x86 SIMD with the permission of the author.

  1. Because of this, there are special programming languages where the developer can take explicit control over vectorization. One such example is ISPC. []
  2. This happens because additional instructions called gather instructions needs to be used to collect values from memory to the register, and also because the non-linear memory layout experiences a much larger data cache miss rate []

Leave a Reply

Your email address will not be published. Required fields are marked *