*We at Johny’s Software Lab LLC are experts in performance. If performance is in any way concern in your software project, feel free to contact us.*

We talked about data caches in several post (here, here and here), but I cannot stress enough the importance of undestanding how the data cache works for the speed of your code. In this post we will **talk about performance of matrix multiplication**. There are two reasons why we selected matrix multiplication: the **basic algorithm itself is very simple so it is simple to explain what’s going on in terms of performance**. The second reason is the operation’s omnipresence: it is used in video processing, physic simulation, machine learning, finance and in many other places. **Understanding this simple example can help you write faster code in general, when the proper opportunity arises**.

## Simple matrix multiplication

By definition, the matrix multiplication is a very simple algorithm. In the following examples we give source code for square matrices for simplicity sake, but the proposed techniques can be also applied to rectangular matrices.

Here is the source code of the matrix multiplication algorithm, as it is spelled out in linear algebra books:

for (int i = 0; i < n; i++) { for (int j = 0; j < n; j++) { c[i][j] = 0; for (int k = 0; k < n; k++) { c[i][j] = c[i][j] + a[i][k] * b[k][j]; } } }

This implementation is very readable and elegant. The result for a single cell `c[i][j]`

is obtained like this:

`c[i][j] = a[i][0] * b[0][j] + a[i][1] * b[1][j] + ... + a[i][n-1] * b[n-1][j]`

**To calculate the result for the cell c[i][j], we multiply each element in the row i of matrix a with the corresponding element in the column j of matrix b. Then we add up all the resulting multiplications**.

However, this naive implementation is not performance friendly. And here is why.

*Like what you are reading? Follow us on LinkedIn or Twitter and get notified as soon as new content becomes available. Need help with software performance? Contact us!*

## Memory Access Patterns and Performance

As we already talked in the post about parallel algorithms, **a good memory access pattern is important for performanc**e. There are four memory access patterns:

**Constant access pattern**: your program is accessing the same memory location over and over. Compilers can typically replace access to a constant memory location using a register. The fastest kind of memory access, but not very useful.**Sequential access pattern**: you are iterating through an array of a vector of simple types, such as int, double or char. This access pattern is very good for performance.**Strided access pattern**: in this context, a*stride*is the difference between two elements in memory your program is accessing. If your program is accessing every second element of an int array, we say that the stride is 2. This access pattern typically happens in three cases:- Your program is iterating through an array of a vector of simple types, but the difference between two neighboring elements is larger than one.
- Your program is iterating through an array of classes or structs – in this case the size of the struct is the stride.
- Your program is iterating through a matrix column-wise. If the matrix is store row-wise, then the stride is the row length.

- From the performance point of view strided access is not optimal, but it could be worse. The bigger the stride, the worse the performance.
**Random access pattern**: you are accessing memory unpredictably. This happens if you access data in fast random access data structures, such as trees or hash maps, or you are dereferencing pointers. This access pattern is very bad for software performance.

There exist techniques to move from worse to better access patterns. For example, moving away from Array-Of-Structs to Struct-Of-Arrays is equivalent to moving away from strided memory access pattern to sequential memory access pattern. In this post we investigate approaches to improve the memory access pattern for matrix multiplication,

## Analysis of Naive Matrix Multiplication

What the information presented in the previous section has to do with matrix multiplication performance? Let’s analyse the matrix multiplication algorithm again:

for (int i = 0; i < n; i++) { for (int j = 0; j < n; j++) { c[i][j] = 0; for (int k = 0; k < n; k++) { c[i][j] = c[i][j] + a[i][k] * b[k][j]; } } }

When analyzing algorithms, we should focus on the lines that execute the most often. In our case, it is the statement on line 5. Let’s analyze the access pattern of array accesses with regards to the innermost loop (loop over `k`

):

**Access**. Variable`c[i][j]`

represent a constant access`k`

increases by one, but we are always accessing the same element.**Access**. When the variable`a[i][k]`

represents a sequential access`k`

increases by one, we are accessing the neighboring memory address. This is also called “running through the matrix row-wise”**Access**. When the variable`b[k][j]`

represents a strided access with stride`n`

`k`

increases by one, the program is accessing memory address which is`n`

elements away from the memory address accesses in the previous iteration. This is also called “running through the matrix column-wise”.

Accesses to arrays `a`

and `c`

are fine, but access to `b`

is inefficient because it is a strided access. **Ideally, we would like to have only constant accesses and sequential accesses**. A technique called *loop interchange* can help fix some of the problems related to bad memory access pattern.

## Loop Interchange

Let’s introduce two new terms that we need before explaining loop interchange. We say that a **loop nest (several nested loops) is perfectly nested if all the statements doing any processing are in the innermost loop**.

// Perfectly nested loop nest for (int i = 0; i < n; i++) { for (int j = 0; j < n; j+) { c[i][j] = 0; } } // Imperfectly nested loop nest for (int i = 0; i < n; i++) { int sum = 0; // This statement prevents perfect nesting for (int j = 0; j < n; j+) { sum += c[i][j]; } a[i] = sum; // This statement also prevents perfect nesting }

The loop nest on lines 2-6 is perfectly nested. The loop nest on lines 9-15 is imperfectly nested because statements `int sum = 0;`

(line 10) and `a[i] = sum;`

(line 14) prevent perfect nesting.

We say that the **loop iterations are independent of one another if there is no data dependency between the current iteration and previous iterations**.

// Loop iterations are independent of each other for (int i = 0; i < n; i++) { for (int j = 0; j < n; j+) { c[i][j] = a[i] * b[j]; } } // Loop iteration in [i][j] depends on work done in previous iterations for (int i = 1; i < n; i++) { for (int j = 0; j < n; j+) { // Iteration [i][j] depends on iteration [i - 1][j] c[i][j] = c[i - 1][j] + 1; } }

In the above example, the statement on line 4 doesn’t depend on the values calculated by the same statement in the previous iterations. On the contrary, the statement on line 12 depends on the values calculated in the previous iterations. We cannot execute iteration `i`

before we have executed `i - 1`

. The loop on line 9, for instance, cannot run in reverse, i.e. the loop over `i`

cannot go from `n - 1`

to `1`

.

Now that we have the necessary prerequisites to explain loop interchange, here is the explanation: **if the loop nest is perfectly nested**^{1} **and there are no dependencies between the iterations of the loop**^{2}, **we can exchange the loops in any way we like**. For example, if our code has three nested loops: loop over `i`

, loop over `j`

and loop over `k`

, where the loop over `i`

is outermost and the loop over `k`

is the innermost, we can exchange the loops, for example: to make the loop over `k`

outermost and loop over `i`

innermost.

**In our naive multiplication example, however, our three loops are not perfectly nested**. The line in the innermost loop `c[i][j] = c[i][j] + a[i][k] * b[k][j]`

depends on `c[i][j]`

being properly initialized to zero. The good news is that we can make them perfectly nested by moving the initialization of `c[i][j]`

to a separate loop, like this:

for (int i = 0; i < n; i++) { for (int j = 0; j < n; j++) { c[i][j] = 0; } } for (int i = 0; i < n; i++) { for (int j = 0; j < n; j++) { for (int k = 0; k < n; k++) { c[i][j] = c[i][j] + a[i][k] * b[k][j]; } } }

After this transformation, if you look at the loop nest on line 7-13, you will see that the statement `c[i][j] = c[i][j] + a[i][k] * b[k][j]`

(line 10) doesn’t depend on the previous iterations of the loop. We can perform the loop interchange of loops over `i`

, `j`

and `k`

in any way we like and the results will remain the same.

So, how to use loop interchange for our benefit? If the innermost loop would iterate over `j`

, this would mean that we are getting rid of the strided accesses in the innermost loop completely. We can perform the loop interchange between the loops over `j`

and loops over `k`

and we get the following code:

for (int i = 0; i < n; i++) { for (int k = 0; k < n; k++) { for (int j = 0; j < n; j++) { c[i][j] = c[i][j] + a[i][k] * b[k][j]; } } }-

We performed loop interchanged and converted the loop nest (i, j, k) to loop nest (i, k, j). We *interchanged* the loops over `j`

and the loop over `k`

. We eliminated strided memory accesses in the statement in the innermost loop (line 4), now only constant (`a[i][k]`

) and sequential (`c[i][j]`

and `b[k][j]`

) memory access patterns remain.

This code is less readable than the original. The intention is not clear. However, it is faster. Here is the benchmarking result for a small matrix (240 x 240), medium matrix (1200 x 1200) and large matrix (1680 x 1680).

Matrix size | Naive version runtime | Interchanged version runtime |
---|---|---|

240 x 240 | 0.049 s | 0.006 s |

1200 x 1200 | 5.84 s | 1.22 s |

1680 x 1680 | 38.22 s | 3.59 s |

Interchanged version is more than 10 times faster than the naive version of matrix multiplication. The results are clear on the side of the version with loop interchange.

*Like what you are reading? Follow us on LinkedIn or Twitter and get notified as soon as new content becomes available. Need help with software performance? Contact us!*

## Memory Efficacy

In the example of our matrix multiplication, if the matrix is large enough, each element of the matrix will be brought from the main memory to the CPU’s data cache `n`

times. On modern CPUs, the bus between the CPU and the memory is one of the biggest bottlenecks. This means that the huge amount of data that needs to be transferred between the CPU and the memory will cause the matrix multiplication to be slower than it needs to be.

There is one useful property of data caches we can exploit to decrease the amount of traffic between the CPU and memory: **if the program has accessed a certain memory location, a second access to the same location happening soon after the original access will be cheap**. This happens because the data we are accessing the second time is probably still in the data cache.

In the case of our matrix multiplication with loop interchange, we don’t exploit this property of the data cache. For example, for access to `a[0][0]`

the CPU will need to load the row `c[0][*]`

. To access `a[0][1]`

, the CPU will again need to load the same row again. Loading the whole row `c[0][*]`

will happen `n`

times. If the matrix is large enough, the row `c[0][*]`

will probably not be in the data cache when needed again, so the CPU will need to refetch it from the main memory.

## Loop Tiling

**Ideally, we would like to reuse data already present in the cache as much as possible. A technique we can use to achieve this is called loop tiling or loop blocking**. It goes like this. We need the whole row

`c[0][*]`

for both `a[0][0]`

and `a[0][1]`

. Also, we need the whole row `b[0][*]`

for both `a[0][0]`

and `a[1][0]`

.While iterating over `c[0][*]`

and `b[0][*]`

, instead of going to the end of the row and using `a[0][0]`

all the time, let’s stop after processing 2 elements, and then move to use `a[0][1]`

. After processing another 2 elements, let’s process `a[1][0]`

and use another 2 elements, etc. We say that the *tile size* for this kind of partitioning is 2. The access pattern looks like this:

Block | Iterator variables | Memory access pattern for tile_size = 2 |
---|---|---|

1 | i = 0 k = 0 j = 0 .. 1 | a[0][0] <= c[0][0] b[0][0] a[0][0] <= c[0][1] b[0][1] |

2 | i = 0 k = 1 j = 0 .. 1 | a[0][1] <= c[0][0] b[1][0] a[0][1] <= c[0][1] b[1][1] |

3 | i = 1 k = 0 j = 0 .. 1 | a[1][0] <= c[1][0] b[0][0] a[1][0] <= c[1][1] b[0][1] |

4 | i = 1 k = 1 j = 0 .. 1 | a[1][1] <= c[1][0] b[1][0] a[1][1] <= c[1][1] b[1][1] |

**When accessing the matrix in this manner, we see data reuse**. Calculations in blocks 1 and 2 use the same values from array `c`

. Calculations in blocks 1 and 3 use the same values from array b. A similar observation applies to blocks 3 and 4, and, blocks 2 and 4. The second time our program accesses the data, the data will probably be in the data cache.

**To rewrite a program to use loop tiling, we pick a constant tile size or block size.** In the previous example tile size was 2, but for good performance it will be larger.

**We iterate through the matrix tile by tile**. In the case of our matrix multiplications, our iteration space is three-dimensional, and each tile has

`(tile_size)`^{3}

elements. Let’s assume the tile size is 8, then the first tile starting coordinates (i, j, k) will be (0, 0, 0) and end coordinates (7, 7, 7). There will be a tile that starts at (0, 0, 8), a tile that starts at (0, 8, 8), and a tile that starts at (8, 0, 0) and so on. Inside a tile, the values of (i, j, k) increase by one, e.g. (8, 0, 8) -> (8, 0, 9) -> (8, 0, 10) -> … -> (8, 0, 15) -> (8, 1, 8) -> … -> (8, 7, 15) -> (9, 0, 8) -> … (15, 7, 15).Let’s rewrite our program to use loop tiling^{3}. Our code looks like this^{4}:

for (int ii = 0; ii < n; ii += tile_size) { for (int kk = 0; kk < n; kk += tile_size) { for (int jj = 0; jj < n; jj += tile_size) { for (int i = ii; i < ii + tile_size; i++) { for (int k = kk; k < kk + tile_size; k++) { for (int j = jj; j < jj + tile_size; j++) { c[i][j] = c[i][j] + a[i][k] * b[k][j]; } } } } } }

**The three outermost loops (lines 1-3) share a similar appearance**. Their indexes are called `ii`

, `jj`

and `kk`

. All three of them go from zero to `n`

, and all the indexes increase by `tile_size`

.

**The three innermost loops (lines 4-6) also share a similar appearance**. All three have a constant number of iterations which is `tile_size`

. The loop over `i`

starts from `ii`

and ends with `ii + tile_size`

. The loop over `j`

starts from `jj`

and ends with `jj + tile_size`

. A similar observation applies to the loop over `k`

.

Traditionally, loop variables in the inner loops will have names `i`

, `j`

, etc, whereas the loop variables in the outer loops will have names where the letter is duplicated `ii`

, `jj`

, etc. There are also other ways of naming the variables for loop tiling, e.g. outer loop variables can have names `i0`

, `j0`

and `k0`

.

Loop tiling is traditionally done in this way: **for an outer loop with a loop variable ii, going from zero to n with increment tile_size, there is a corresponding inner loop with a loop variable i going from ii to ii + tile_size with increment 1**. In our example, you can see three such pairs: pair

`i/ii`

(lines 1 and 4), pair `j/jj`

(lines 2 and 5) and pair `k/kk`

(lines 3 and 6).Let’ compare all three versions when it comes to runtimes and the amount of data exchanged between the memory and the data cache:

Matrix size | Naive multiplication | Interchanged multiplication | Tiled multiplication (tile size = 12) |
---|---|---|---|

240 x 240 | Runtime: 0.049 s Transferred: 0.098 GB | Runtime: 0.006 s Transferred: 0.004 GB | Runtime: 0.005 s Transferred: 0.002 GB |

1200 x 1200 | Runtime: 5.84 s Transferred: 9.77 GB | Runtime: 1.22 s Transferred: 8.35 GB | Runtime: 0.59 s Transferred: 1.03 GB |

1680 x 1680 | Runtime: 38.22 s Transferred: ??? | Runtime: 3.59 s Transferred: 30.59 GB | Runtime: 1.64 s Transferred: 3.37 GB |

**The version of the algorithm we used in testing**^{5} **(with tile size 12) is more than 2x faster than the interchanged version**. **The amount of traffic between the main memory and the CPU has decreased dramatically**, in the case of 1680×1680 matrices, the **amount has decreased 9 times**.

A note regarding matrix multiplication and making this algorithm run on several CPUs: Algorithms that don’t transfer a lot of data between the memory and the data cache typically run better when modified to run on several CPU cores. The bandwidth between the memory and the CPU is a shared resource for all CPU cores and can be a bottleneck if all CPU cores use too much of it. The loop tiling algorithm in principle should scale better to multiple CPU cores than the loop interchanged algorithm since it uses less memory bandwidth.

*Like what you are reading? Follow us on LinkedIn or Twitter and get notified as soon as new content becomes available. Need help with software performance? Contact us!*

## The problems with loop tiling

I have made the story about loop tiling intentionally simple so it can be easy for you to follow it. There are several problems with loop tiling that we didn’t mention in the previous sections.

### Edge cases

In the previous section, we assumed that the matrix size is a multiple of tile size. This is, however, not always the case. **The tiles close to the edges of the matrix will be smaller than other tiles and this needs to be treated as an edge case**.

One way to fix it is to allow smaller tiles. E.g. instead of iterating from `kk`

to `kk + tile_size`

, we would iterate from `kk`

to `MIN(kk + tile_size, n)`

. The problem with this approach is that the tile size is not a compile-time constant anymore. This translates to a less efficient assembly by the compiler or less efficient manual assembly.

Another solution is to treat edge tiles as special cases. We took the original loop tiling matrix multiplication and rewrote it to work for all matrix sizes:

for (int ii = 0; ii < n; ii += tile_size) { for (int kk = 0; kk < n; kk += tile_size) { for (int jj = 0; jj < n; jj += tile_size) { for (int i = ii; i < MIN(ii + tile_size, n); i++) { for (int k = kk; k < MIN(kk + tile_size); k++) { int j_end = jj + tile_size; if (j_end <= n) { for (int j = jj; j < j_end; j++) { c[i][j] = c[i][j] + a[i][k] * b[k][j]; } } else { for (int j = jj; j < n; j++) { c[i][j] = c[i][j] + a[i][k] * b[k][j]; } } } } } } }

On line 5 we use `MIN(kk + tile_size)`

to adjust the tile size for the edge case. In the innermost loop, however, we have an if statement (line 7) that disambiguates between the cases: the regular case (line 7) which executes most often and the edge case (line 11) which executes rarely. We can optimize the regular case for speed, and keep the edge case simple.

### Efficient matrix multiplication that uses loop tiling can only be written using assembler or intrinsics.

To illustrate loop tiling, we gave an example code in C. Note, however, that we didn’t use that example for testing. Why? The compilers are really good when it comes to optimizing loops with many iterations that iterate sequentially over arrays, which is the case for the loop interchanged version. Unfortunately, they are really bad when it comes to optimizing three short nested loops with a fixed trip count, which is the case of the tiled version. Therefore, **we had to rewrite the tiling version using AVX intrinsics to get good performance for the tiling, without it, the version written in plain C is actually slower than the interchanged version**:

Matrix size | Interchanged version | Tiled version in C (tile size 12) | Tiled version using AVX intrinsics (tile size 12) |
---|---|---|---|

240×240 | 0.007 s | 0.024 s | 0.006 s |

1200×1200 | 1.26 s | 2.37 s | 0.61 s |

1680×1680 | 3.74 s | 6.47 s | 1.67 s |

For peak performance, the CPUs need to run vectorized code. The compiler easily vectorizes the loop interchanged version, but not the loop tiled version. Rewriting the code using AVX intrinsics helps fix this issue.

### Picking the good value for tile size is architecture-specific

**The choice of tile size constant is crucial for good performance**. In our example, we used one constant for all three loops: loop over

`i`

, loop over `j`

and loop over `k`

. However, you can use three different constants for three different loops. Even if you find the perfect combination of three constants for a specific matrix size, it might happen that the same constant is not the fastest for another matrix size.**The value for tile size of each loop will depend on the size of the data caches your CPU has as well as cache associativity**. It is very hard (or maybe impossible) to theoretically determine the value for

*tile size*; usually, these values are determined experimentally.

We measured the performance of loop tiling for matrix multiplication with large matrices (1680×1680) on two different CPUs using different tile sizes. Here are the results:

Tile size | Intel i5-10210U Runtimes | AMD A8-4500M Runtimes |
---|---|---|

8 | 2.88 s | 15.07 s |

12 | 1.67 s | 15.95 s |

24 | 2.23 s | 26.44 s |

48 | 1.94 s | 17.00 s |

60 | 2.28 s | 23.93 s |

80 | 2.70 s | 22.06 s |

In the above measurement, Intel Intel i5-10210U had the best runtime when tile size was 12, compared to AMD A8-4500M which had the best runtime when the tile size was 8.

**In case you wish to speed up matrix multiplication by distributing the loop tiling algorithm to multiple CPU cores**, it might happen that **you will need to pick yet another value for tile size**. The bigger the tile size, the larger the stress on the data cache; when this algorithm runs on two CPUs that share the same data cache, the two CPUs might evict data from the cache from each other. In that case, you will want to decrease the tile size to decrease the pressure on the data cache.

### Architectural details

**There are several architectural details** that most people never heard of, e.g. loop alignment, available resources in CPU, loop size, etc, **that affect the performance of matrix multiplication**. Moreover, different CPU architectures have different values for these parameters. If peak performance is needed, the code will need to be written in assembler, since only there can we explicitly control instruction placement and instruction scheduling.

For these reasons, all the libraries that implement matrix multiplication (such as OpenBLAS or Intel Math Kernel Library) have several implementations for different CPUs and use CPU dispatching to pick the best one for the current architecture.

## Summary

In this post we explored efficient matrix multiplication. It is a simple example, but the **techniques we presented can be applied in many algorithms that work on matrices**.

The rules of the efficient matrix multiplications are simple: use the data cache as optimally as possible, and reuse data already available in the data cache as much as possible. Loop interchange as a technique is very simple, and yet it can be used to speed up places in your code where you are looping through your matrix in an inefficient**, column-wise **manner.

**Loop tiling is more complex**, but if you are into peak performance, this is the way to go. It** allows you to reuse data already in the cache**. Instead of visiting your matrix row by row, you will visit the matrix in blocks. This decreases the amount of data that the CPU needs to bring from the memory to the cache and positively impacts performance.

Need help with software performance? Contact us!

- If the loops are not perfectly nested, sometimes you can make them perfectly nested by moving the commands preventing the perfect nesting into separate loops. [↩]
- There are cases where loop interchange can be done even when there are dependencies between the loop iterations. [↩]
- This is the conceptual version of loop tiling written in C. Unfortunately, the compilers won’t translate this into efficient assembly. The version we used for testing is written using AVX intrinsics, only in that case does it make sense to compare loop tiling with loop interchange. Later we explain why. [↩]
- There is an assumption in our code that
`n`

is divisible by`tile_size`

[↩] - The version we used for testing loop tiling is not the C version we just gave as an example, but we had to rewrite it using AVX intrinsic to get these numbers. We talk about why is this the case later [↩]