Link Time Optimizations: New Way to Do Compiler Optimizations

Early in their career every C/C++ developer has had an eureka moment: the discovery of optimization options in their compiler. A discovery that there is GCC compiler offers -O0 optimization lever for regular debugging and -O3 for fast release code will be one of the moment I will never forget. I just needed to compile my program with -O3 and everything would run faster without any effort on my side.

Many years have passed since and I believe everyone compiles their code in the similar way: we compile our .c or .cpp files using a compiler, with a certain optimization option (-O0 or O3) to get object files (.o). Then linker takes all our object files and merges them into one big executable or a library.

In this story, linker is the last piece of the puzzle, and compared to what the compiler is doing, its job is much simpler. But doing things this way we are missing out on some important optimization opportunities. So, ladies and gentlemen, allow me to explain.

Motivation

As I said earlier, compiler makes many optimizations on the code. Allow me to present two of them: inlining and loop merging. Have a look at the following code:

int find_min(const int* array, const int len) {
    int min = a[0]; 
    for (int i = 1; i < len; i++) {
        if (a[i] < min) { min = a[i]; }
    }
    return min;
}

int find_max(const int* array, const int len) {
    int max = a[0]; 
    for (int i = 1; i < len; i++) {
        if (a[i] > max) { max = a[i]; }
    }
    return min;
}

void main() {
    int* array, len, min, max;
    initialize_array(array, &len);
    min = find_min(array, len);
    max = find_max(array, len);
    ...
}

This is a very simple code that initializes an array and then finds its minimum and maximum. Please pay attention to functions find_min and find_max and notice they are very similar. When we compile optimized version of our program, the compiler might inline functions find_min and find_max. Inlining simply means to remove the calls to those functions and instead copy their bodies in the place of the call. After having done the inlining, the compiler sees that both find_min and find_max loop over the same range and there are no data dependencies between them so it decides to do loop merging. It merges loops from find_min and find_max into a single loop. This decreases amount of work the CPU needs to do and also increases data cache utilization. So the optimized version might look like this:

void main() {
    int* array, len, min, max;
    initialize_array(array, &len);
    min = a[0]; max = a[0];  // compiler inlined find_min and find_max
    for (int i = 0; i < len; i++) {  // compiler merged two loops int one
        if (a[i] < min) { min = a[i]; }
        if (a[i] > max) { max = a[i]; }
    }
    ...
}

In the optimized version calls to functions find_min and find_max have completely disappeared. This saves some calling time. Furthermore, the compiler merged loops from find_min and find_max and now we are looping just once.

These two optimizations alone will benefit the performance. But everything we talked about here can happen if main, find_min and find_max are in the same compilation unit (in C and C++ compilation unit is a file since compilers work file by file).

If find_min and find_max are defined in find.c and main is defined in main.c, the compiler will not be able to perform the optimization. While the compiler is compiling main.c it is aware of find_min and find_max only through their signature. It knows these two functions exist, but it doesn’t know how do they look internally or their addresses. It will leave placeholders to call these two functions, and later the linker will resolve calls to those two functions with actual addresses. If the compiler and linker work in this way, they miss out on important optimization potential.

You probably have guessed: why doesn’t the linker do some optimizations? Linker is aware of multiple compilation units and certainly could do them. Well, ladies and gentlemen, allow me to introduce:

Link Time Optimizations

Introduction

Link time optimizations (abbr. LTO) are just that, optimizations done during linking. Linker is aware of all the compilation units and can optimize more compared to what conventional compiler can do. The most important optimizations that linker does are inlining and code locality improvements. As already said, inlining just means copying body of a function in the place of the call instead of calling the function directly. Linker inlines functions from other compilation units if that makes sense. When the function is inlined, the linker has an opportunity to do other optimizations, similar to what compilers do when they compile a singe file.

Regular compilers try to figure out which functions execute often and which functions execute rarely. Depending on that they move the functions that execute often close to one another in memory in order to improve code locality. Additionally, if a function A() calls function B() it makes sense to put them close to one another in memory for the same reason. Better locality means less page faults1 and faster code. Linkers are much better aware how the functions from different compilation units call one another and therefore can put them in memory so that they optimize for better code locality.

The result of LTO is a binary which is typically a few percent faster than the original and a few percent smaller. This is not much but for performance sensitive programs can be a big deal.

There is a downside to LTO however. Compiling and linking with LTO is much slower and uses more memory, and this doesn’t scale well. For large projects, e.g Chrome browser or Firefox it requires a dedicated machine with a lot of memory to link. The reason is quite simple, it’s because linker now has to keep in memory all compilation units.

Also, total compilation time with LTO enabled is longer and I think it is one of the reasons why we haven’t seen its wide adoption (developers are not patient people). It is very simple to enable LTO, and later we will enable it on a some projects and measure the results.

Under the hood

You can skip this and move to next chapter as this is not important to work with LTO.

So how does LTO work? Before LTO, linkers were in change only of linking. Linkers would take all the generated object files, for each file they would resolve calls to functions in other object files, create a table of exported functions and finally output the executable to disk.

Now with LTO the linker takes a lot of responsibilities compilers used to have, namely the optimization. Common object files contain binary code, but it is impossible to do LTO on the binary code. So the compiler needs to be run with LTO enabled in order to produce an object file that can be used for LTO. When compiler is started in LTO mode, it writes in special .lto sections the intermediate representation of the code. Intermediate representation is the thing on which the compilers do the optimizations; now when the optimization step is moved to linking phase, this information is necessary for the linker as well.

The line between compilers and linkers gets blurry. I wouldn’t be surprised if one day it disappears and we are left with only one tool.

How to enable link time optimizations?

LTO has been supported on compilers for a long time; GCC 4.7 which is eight years old supports it and I assume that all the other popular compilers support LTO for the same time.

To enable LTO, follow these simple steps:

  • Add option -flto to the invocation of compiler.
  • Add option -flto to the invocation of the linker. Additionally, you need to add all options from the compiler invocations to the invocation of the linker. So if you called your compiler with “-march=i486 -O3 -fno-stack-protector“, you will need to pass the same options to the linker.

Now you compile your program as regular. Unless you are using a very old version of the compiler, you shouldn’t expect any problems here.

Experiments

Now let’s get our hands dirty. I already made some tests on ProjectX2, the commercial project I am working on, so I am going to recount my experiences with it. Second thing we are going to investigate is the impact of LTO on ffmpeg open source software used for media files manipulation.

ProjectX

ProjectX is a medium sized project written in C++. There is no special team which is in charge of tools or performance3 and this corresponds to what we typically see in most medium sized commercial projects. It is compiled using GCC 4.9.4 and without LTO. There is one component of the system which is performance critical and I measured how the performance of that component behaves with respect to LTO. I picked a single test since testing is slow and repeated it 10 times. Here are the results:

PhaseWithout LTOWith LTODifference
Test runtime151.7s138.7s+9.2%
Binary size
– libA
– libB
– libC
– libD

362kB
1698kB
31.6MB
1204kB

292kB
1394kB
23.8MB
1144kB

+24%
+21.8%
+32.7%
+5.2%
Compile time for
a single .cpp file
20.2s214.2s10.6 x slower
Linking time for libC
(18 .o files)
1.6s63.9s39.9x slower
Linking memory
consumption for libC
119MB709MB5.95x more
Comparison between LTO and non-LTO compilation

As the above table shows, the test runtime has decreased by 9.2%, and binaries sizes have decreased by 20% on average.

But the resources needed for compilation have increased dramatically. Compiling a .cpp file is 10 times slower, linking is 40 times slower and memory consumption needed for linking is almost 6 times higher. Please note that GCC 4.9 is really old and LTO has been correctly implemented just in that version. Let’s try a next experiment, maybe we’ll have more luck there.

ffmpeg

ffmpeg is a famous library for processing audio and video. We used ffmpeg 4.2.3 available from the web site and compiled it with GCC 8 and CLANG 9. The configure script of ffmpeg has an option to enable LTO, so that’s just what we did. ffmpeg uses some hand written assembler, we disabled this in order to measure the impact of LTO on optimizing C code. The configure line we used is:

./configure --enable-lto --disable-inline-asm --disable-x86asm

Enabling LTO on GCC went easily. On CLANG however we needed to tweak the configure command line to force it to use LLD linker4. We also wanted to try out ThinLTO, a feature of CLANG that promises the same performance gains as regular LTO and compiles in time comparable to LTO.

To test the performance we asked ffmpeg to convert a mpegts container with h264 video and aac audio to matroska container with mpeg4 video and ac3 audio. The input file was 708MB large and the output was 349MB. Here are the results of our measurements:

PhaseGCC 8 no-LTOGCC 8 LTOCLANG 9 no-LTOCLANG 9 LTO
Conversion time1198s1188s884s891s
ffmpeg size17.1MB17.7MB18.7MB20.5MB
Compilation time
(make -j4)
626s1369s646s1112s
Compiling ffmpeg.cc5.5s1.1s4.9s2.6s
Linking ffmpeg time13.1s1132s7.7s368s
Linking ffmpeg memory287MB381MB203MB988MB
LTO to no-LTO comparison on various aspects

I was expecting something else, but facts are stubborn things. First thing we see from the above table, is that CLANG generates a much faster binary compared to GCC for both non-LTO and LTO. The version of CLANG was one year younger than GCC, but the difference is still remarkable. GCC generates smaller binaries, and LTO binaries are bigger than non-LTO binaries for both compilers.

Compilation takes double the time with LTO enabled for both compilers. Linking time is huge with LTO for both compilers, on GCC it’s 86 times slower, on CLANG 48 times slower.

This experiment has shown that, for ffmpeg at least, enabling LTO doesn’t bring expected speed ups.

Final Words

Original assumption was that the turning on LTO will be followed with an improvements in speed of a few percent, decrease in binary size of a few percent, and spending more time and needing more memory for compilation and linking.

On ProjectX, with a several year old toolchain that did happen. ffmpeg tells completely different story. What I assume to have happened is that ffmpeg has already been highly optimized for speed; the developers moved all the small functions into headers so they can be easily inlined by the compiler. Also ffmpeg is written in C which is much easier to keep low-overhead than it is the case with C++. to Most of the optimization potential has been used already and linker can’t do much more.

I think that for most projects that haven’t been aggressively optimized for speed, it makes sense to give LTO a try. I would expect to see the promises made by LTO related to speed and binary size fulfilled most of the time. But at the end, the ultimate indication if LTO should be enabled or not are performance numbers on your specific project.

Like what you are reading? Follow us on LinkedIn or Twitter and get notified as soon as new content becomes available.

Further Read

Optimizing real-world applications with GCC Link Time Optimization

LLVM Blog: ThinLTO: Scalable and Incremental LTO

Honza Hubička’s Blog: GCC 9: Link-time and inter-procedural optimization improvements

  1. Page faults happen when your program calls another function or tries to access another part of memory, but this part of memory hasn’t been loaded from the disk yet. In that case the operating system needs to load the missing part from the disk which results in program slowdown []
  2. ProjectX is a made up name, since I am not allowed to disclose the details of the project []
  3. Large software project, for example Chromium or Firefox have teams that are only in charge of tweaking system performance. []
  4. LLD linker is a linker provided by LLVM; LLVM also provides CLANG. This linker understands CLANG intermediate representation out of the box and can link LTO enabled object files []

Leave a Reply

Your email address will not be published. Required fields are marked *