We at Johny’s Software Lab LLC are experts in performance. If performance is in any way concern in your software project, feel free to contact us.
Sometimes, when profiling your program, you can see text
Unknown as the function name. Although there can be other reasons for this (e.g. the binary is built without debugging symbols), many times this means that the operating system is doing work for the program. The profiler doesn’t know how to resolve the function names that belong to the kernel, and this is the reason for
Unknown. Sometimes you will see the kernel function name where your program is spending time. Whatever the case, if the program runtime is unacceptable you should take a peek and see what is going on.
When the operating system is executing code for the program, regardless of the type of that code, we say that the program is running in kernel space, or it is running in system mode1. System mode is a mode where the kernel code runs with elevated privileges, doing various services for your program. In this and the following post we are investigating programs that spend a lot of time in kernel space and what techniques you can use to find the bottlenecks and fix the problems.
Why programs go to the system mode?
There are two reasons why program go into system mode:
- Program made a system call: program is asking for a service from the operating system, e.g. allocating memory, reading a file or sending data over TCP connection.
- Program has somehow gotten into a state where it cannot move forward without the kernel’s help. E.g, the requested memory page is not in the memory, there was a floating-point error or an instruction is not supported and has to be emulated. The instruction your program was performing trapped, and the kernel took over to fix the issue.
In this post we will talk about how to figure out what system calls are where your program is spending time, and what can you do to fix the issues. In the next post we will talk about how to figure out if your program is slow because it keeps getting interrupted by the kernel and what to do about it.
How to figure out if your program is spending too much time in kernel space?
On Linux, it is really simple to measure what amount of time your program is spending in system mode using
time utility. Here is an example:
$ time ./tmcstepper -s 32 -i 1810 ... real 0m57,327s user 0m51,219s sys 0m6,108s
In the above output, the program was running 57 seconds, out of which it spent 6 seconds in system mode (line beginning with
Which system calls are taking the most time?
A Linux utility called
strace can help you analyze how your program interacts with the kernel through system calls. The utility is very powerful and we wrote about it in a previous post. For this post, we will use the
-fc switch to get statistical information about system calls. We run our program by prefixing it with
strace -fc, like this:
$ strace -fc ./tmcstepper -s 32 -i 1810 ... % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 98,97 0,264169 23 11032 brk 0,49 0,001315 28 46 munmap 0,31 0,000822 12 67 mmap 0,05 0,000141 28 5 openat 0,04 0,000100 12 8 mprotect 0,03 0,000089 14 6 pread64 0,03 0,000086 3 23 write 0,03 0,000068 17 4 read 0,02 0,000061 12 5 close 0,02 0,000061 10 6 fstat 0,00 0,000007 3 2 1 arch_prctl 0,00 0,000000 0 1 1 access 0,00 0,000000 0 1 execve ------ ----------- ----------- --------- --------- ---------------- 100.00 0,266919 11206 2 total
After the program finishes, it prints out the time it passed in the various system calls. Our program,
tmcstepper spent a lot of time in
brk system call. Please note that
strace reported only 0.26 seconds spent in system calls, but this is a bug in
strace, the time spent is actually much longer.
What code made the system call?
One of the tools that can tell you what are the lines where the system call was made is
perf trace. Unfortunately, the tools is not straightforward to use an explaining it would take a whole post. If you need to use it, there informations on how to do it online. Just remember, the command to record the trace is:
perf record -e 'syscalls:sys_enter_*' --call-graph dwarf ./tmcstepperfor all system calls,
perf record -e 'syscalls:sys_enter_brk' --call-graph dwarf ./tmcstepperfor
Note, however, that most of the time you don’t need this information. The regular profiling tells you the name of the function where your program spends time in kernel space. If you know what the system call does, it should be fairly easy to find the line that is making the system call.
How to fix the problem?
Fixing problem depends on the type of system call that is causing the slowdown. The idea is the following: google the system call to see what is doing, this should give you a vague idea on where to look for solutions. We talked about system calls in the post about strace, here we give a short summary on how you can resolve some of the problems related to system calls:
- System calls that work with memory (brk, mmap, munmap): these calls are typically made by the system allocator. Decrease the pressure on the system allocator, or find a better one. We talked about this in the post about price od memory allocation.
- System calls that work with files (read, write): these calls are typically made by functions that work with files, : e,g
printfin C, or file streams in C++. Make sure you are not flushing too often (e.g. happens in C++ if you write
std::endl), also your program should read and write in larger blocks. If you are using random access to read data from the file, consider memory-mapped files.
There are other groups of calls as well that we didn’t cover here, since I am not experienced enough in giving recommendations for that types of problems. However, a pattern emerges: smaller number of system calls made by your program almost always translates to faster runtimes.
In our example, the program was spending a lot of time in
brk system call. By investigation of the source code we determined that the program was allocating and deallocating memory in a short function called many times. A possible solution was to replace the system allocator. We used
LD_PRELOAD environment variable to make our program use
tcmalloc allocator. Here is the output of the
time command with the new allocator
$ export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4 $ time ./tmcstepper -s 32 -i 1810 ... real 0m52,218s user 0m52,146s sys 0m0,048s
As you can see, the time spent in system mode has reduced significantly: from 6,1 s to 0.05 s, and the program’s runtime fell as well: from 57 s to 52 s. Originally there were 11032 calls to
brk, after this change there are 22.
You can use
time utility to quickly check if your program is spending time in system mode, and, if this is unexpected, you can use
strace to figure out system calls that take the most time. From that point the story gets more complicated, figuring out how to decrease the time spent in system mode is not always obvious and will depend on the problematic system call itself.
- There is a difference between these two terms in other contexts, but here we use it interchangeably.