We at Johnny’s Software Lab LLC are experts in performance. If performance is in any way concern in your software project, feel free to contact us.
I think every engineer who has been working in software development long enough has experienced something like this. A program that used to run at acceptable speed six months ago, now doesn’t run fast enough. Sometimes, bisecting the repository can help discover the culprit, but often, the performance was degrading a little bit with each new commit or new feature. What is the explanation for this? And are there any ways to remedy it?
In this post we talk about the reasons why software becomes slower as new features are added or the data set size grows. Ideally, we would like two things:
- The performance of our program doesn’t change as the features are added and
- The runtime of the program grows linearly to the data set size.
In this post we are going to analyze factors that make your program run slower than expected when new features are added or the data set grows.
Architectural issues are issues in performance that appear because the software’s architecture hasn’t been designed with performance in mind. Architectural issues can be quite nasty because fixing them requires coordination between several teams, API rewrite and changes in many places.
One of the important considerations of API design is simplicity. But simplicity is not always the only consideration in mind. When it comes to performance, a component API design should have the following properties:
- Minimize the number of times component A has to talk to component B
- Minimize the amount of data exchanged between component A and component B
Components that talk a lot are called “chatty”. This happens when component A, instead of sending a batch of data for processing to component B, sends them one by one. Trying to keep API simple often results in API that can process only one piece of data at one moment. There are, however, several problems with chatty components:
- A component is implemented as a collection of functions. A function call takes some time, and the overhead of function call can be greater than the work done by the function, if the function body does little work.
- Functions inhibit compiler optimizations. Not only is there an overhead of function call, but in the presence of function the compiler must be very conservative and skip many useful compiler optimizations. We talk about this here.
- If the component is thread-safe, there will be a mutex protecting the critical section. Mutex locking and unlocking is also a source of overhead, even greater than function calls. This can be especially noticeable in the case of short functions.
These problems are related to the overhead of the chatting. The overhead can be acceptable if two components are in the same process, but if at some point in the future they need to be moved into different processes (because of security) or to two different computers, then the overhead of the communication can become huge.
There are also additional problems with chatty components that go deeper:
- Processing in batches, compared to processing one by one, is more efficient. A component can organize its data better if it knows in advance how many data it is going to process.
Example of malloc/free
Let’s take the example of the system allocator. System allocator is a component shared by all other components in the program. Its API is very simple:
malloc to allocate memory and
free to deallocate memory.
Let’s say we are building a large binary tree, and we need to make 1 million calls to
malloc. What happens is:
- There will be one million calls to malloc – the price of function calls
- One million calls to mutex lock and mutex unlock – the price of locking1
- After a call to
malloc, the compiler must assume that the state of the global memory can be changed, so it has to omit certain compiler optimizations.
On the system allocator side, the allocator could have organized the memory much more efficiently if it knew that it would be requested 1 million identical memory blocks.
A more efficient version of
malloc would allow us to specify how many chunks of memory do we want to allocate. This removes the overhead of function calls, mutex locks. Also, the allocator can calculate how much memory it needs and ask for enough memory from the operating system in advance.
Data Copying and Data Conversions
Neither data copying nor data conversions are doing any useful work. In the case of the former, data copying just moves data from one place to another. Data conversions convert the data from the input memory representation to a different memory representation, but without essentially modifying the data.
Usually, the component designers start off by designing the component’s API. As part of the API, they chose a certain data representation for their input and output data. They might design component A so that the component offers preallocated blocks of memory for the user to fill in. The component’s user fills the data, and the component then processes it. If you have to make component A work with component B, and component B outputs data in a preallocated buffer, then you will need to copy the output data of component B to the input buffer of component A. This action is not essential and does no useful work.
A similar story applies to data conversion as well. If two components do not share an identical data representation, some kind of data conversion will have to happen, which is time costly.
As the new components are added to the system, the price of data conversions and data copying can become substantial. Different component designers should agree in advance about the data format, in order to avoid data unneeded data conversions and data copying.
Contention is a fancy word for getting stuck in a line and waiting for something to happen. In the world of components, contention happens when an instance of the component is oversubscribed – there are just too many requests and the component cannot process them.
If a component has some internal state and is used in a multithreaded system, modification of its internal data must be protected by a mutex. With the mutex, there is some probability that a component’s user can get stuck waiting for the component to become available because it is used by someone else.
In the case of contention (oversubscription) on the mutex, the waiting times can explode. The details are expressed using Kingman’s formula, but the essence is that average waiting time depends on the percentage of time the mutex is busy. The dependence is not linear, but it looks something like this:
Although the exact parameters of the curve depend on other factors as well, the shape of the curve remains essentially the same. After the utilization grows over 85 %, the waiting time starts to grow exponentially. The mean waiting time difference between utilization 60% and 70% is very small. The mean waiting time difference between utilization 85% and 95% is huge!
Example of logger
Now, take for example a logger component in a typical software. Since all the components share the logger, and the logger writes to a single file, then its critical section has to be protected with a mutex. Everything will work fine if other components don’t overuse the logger. But if at some point the logger’s utilization grows above a certain threshold (let’s say 95%), the whole program can come to a halt. Even those components that have only one line to write to a log file will be waiting in the mutex queue for a very long time! You can enable the verbose output in the logger in your system to see this behavior yourself!
Kingman’s formula is such an important formula that its repercussions can be seen everywhere: a database with a high utilization rate is the source of endless waiting. So is the server with a high utilization rate. In a bank, if the bank teller is busy more than 90% of the time, there is a huge queue. If the road’s utilization rate is 95%, it will be congested etc.
Another type of issue that contributes to the program’s slowdown are algorithmic issues. In my experience, if you profile your program with a small data set, you will see several different functions appearing in the performance profile; when the data set is very large, typically only one or two functions will dominate the profile. Why?
In your code you have several loops that do the bulk of the work. Some of them have O(n) complexity, some of them O(n log n) and some of them have O(n2). When the data set is small, all these loops can have a similar runtime. But for a large data set, loops with O(n2) dominate the runtime profile and the influence of the loops with smaller complexity completely disappears.
So, when the data set is large enough, the loops with the highest computational complexity will consume all of the time in your program. Which brings us to a second thing.
Loops with computational complexity larger than O(n log n) scale badly. The good ol’ complexity law is ruthless. If your loop has a complexity O(n2), with a large enough data set, this loop will eventually become a bottleneck. If you need to process large data sets with algorithms that have complexity O(n2), O(n3), etc, you will have to use some form of massively parallel architectures, like supercomputers or accelerators.
System allocator (implementation of
delete) is a shared resource between all the components in a single process. If you introduce a new piece of code (a component or a library) that is heavy on the usage of the system allocator, this automatically increases data fragmentation and data cache misses for all other components in the system that have the same property. This means that after you introduce the new component, completely unrelated components will be running slower because of the increase in memory fragmentation and data cache miss rate.
A piece of code that uses raw pointers, unique pointers, shared pointers, binary trees (
std::set), linked lists (
std::list) will typically make a lot of calls to the system allocator. This kind of code has the property to make all the other components in your program run slower.
A partial way to mitigate this is either to use a better allocator (
tcmalloc, etc) or to use allocate all the memory for your critical data structure from a dedicated pool of memory. You can read more about it in our post about memory allocation.
Modern software performance is closely related to compiler optimizations. Everyone knows that software compiled without compiler optimizations can be 10 times slower than when the same software is compiled with full optimizations.
However, compilers are not almighty. Internally, compilers often rely on various heuristics and pattern matching to generate the optimal assembly. And these things break easily.
For example, the compiler might have vectorized a critical loop in order to make it run faster. Or it could have inlined a function in a critical loop to avoid a function call and enable additional optimizations. But, both vectorization and inlining are somewhat fragile. Adding a single statement can break both of them, which makes the hot loop run noticeably slower.
The break can happen when new code is added, but it can also happen between compiler versions or different compilers. The compiler testing is rigorous for functionality, but not so much for performance. What this means is that the loop that was fast in the previous version of the compiler might be slow now because of bugs or simply changes in one of the critical heuristics.
Writing simple, easy-to-understand code is the best way to go, since the compilers are best optimized for generating efficient assembly for this type of code.
Hardware issues happen because, as the software evolves, it becomes less and less efficient in using hardware resources.
More code means more instruction cache misses
The instruction cache is a small cache memory inside the CPU that is used to speed up fetching the instructions from the main memory. If the instruction is in the instruction cache, it can be executed quickly, otherwise, the CPU needs to wait for it to be fetched from the main memory.
Of course, only the instructions that the CPU is currently using or has been using recently are kept in the instruction cache. As the CPU continues executing instructions, the instructions that haven’t been executed for a longer time are evicted from the cache to make place for new instructions.
The problem with instruction cache misses is directly related to the program size (program binary + all the used libraries). The larger the program, the more it suffers from instruction cache misses. The penalty is not large, however; typically every time a program increases in size by two it will get a bit slower (let’s say 1-2 %).
Some programs are more prone to suffering from this problem compared to other problems. Programs that quickly move from one function to another will exhibit this problem more. This will be the case in many OOP programs (written with classes), but for example, will not be the problem with programs where the bulk of the time is spent inside a single loop or a few loops.
One can partially mitigate this problem through clever code arrangement: functions that call one another should be close in the memory. One of the solutions is to record the behavior of the program in a live system. After that, you can use tools to rearrange the function memory layout in a more hardware-friendly way. You can use profile-guided optimizations for this, or the BOLT tool.
Another approach is to recompile your code with
-Os (optimize for size) compilation flag. If your program’s bottleneck is the instruction cache, it can run faster when compiled with
-Os compared to when compiled with aggressive optimizations (
-O3). Profile-guided optimizations can also help, because they will compile hot code with aggressive optimizations, and cold code optimized for size.
Larger workloads mean more data cache misses
Modern CPUs are equipped with data caches – small on CPU memories that are used to quickly access the data. If the data the CPU is currently working on is in the data cache, access to it will be fast. Otherwise, it will be slow. We talk about data caches and their influence on performance in much more detail here.
If your program is using random memory access data structures, like trees, hash maps or linked lists, the speed of access to a single piece of data depends a lot on the size of the data structure. When the data structure is small, it fits in the data cache. As the data structure grows, the rate of data cache misses grows which means the performance gets worse.
Here is an experiment: we perform eight million searches on a hash map of various sizes. Here are the runtimes as a function of the hash map size:
As you can see, the decrease in performance is not linear, instead, there is a large drop at the points where the hash map doesn’t fit L1, L2 and L3 caches respectively.
There is no solution for the problem of data cache misses in very large random access data structures. The problem can be diminished to some extent, but not completely eliminated. You can find more information about it here.
Larger classes mean more data cache misses
Adding another member to a class may not seem like a big thing, but if the class is being processed in the hot loop of your code, each addition decreases the performance of the hot loop a bit.
The reason is again the data cache. The data is fetched from the memory in blocks, and typically blocks will be 64 bytes in size. If the class contains members that are not used in the hot loop, these members are nevertheless fetched to the data cache. The speed of the class processing decreases with each unused member added to the class.
We already covered the problem of class size and performance earlier, here is how the performance of class processing depends on the class size. The code of two methods,
calculate_surface_visible, did not change at all, we only added padding to emulate classes being larger:
The runtime when the class is smallest (20 bytes) is 31 milliseconds, but quickly grows as more members are added to the class.
The workaround for this is good problem decomposition. If inside your hot loop you are processing instances of the classes, make sure the hot loop uses all the members of the class. You can move unused members to other classes. This guarantees an optimum usage of the memory subsystem.
- Many allocators manage to get rid of locking most of the time by keeping per-thread cache of memory addresses
- source: https://www.allaboutlean.com/kingman-formula/utilization-and-waiting-time/