Debugging performance issues in kernel space: system calls

Sometimes, when profiling your program, you can see text Unknown as the function name. Although there can be other reasons for this (e.g. the binary is built without debugging symbols), many times this means that the operating system is doing work for the program. The profiler doesn’t know how to resolve the function names that belong to the kernel, and this is the reason for Unknown. Sometimes you will see the kernel function name where your program is spending time. Whatever the case, if the program runtime is unacceptable you should take a peek and see what is going on.

When the operating system is executing code for the program, regardless of the type of that code, we say that the program is running in kernel space, or it is running in system mode1. System mode is a mode where the kernel code runs with elevated privileges, doing various services for your program. In this and the following post we are investigating programs that spend a lot of time in kernel space and what techniques you can use to find the bottlenecks and fix the problems.

Why programs go to the system mode?

There are two reasons why program go into system mode:

  • Program made a system call: program is asking for a service from the operating system, e.g. allocating memory, reading a file or sending data over TCP connection.
  • Program has somehow gotten into a state where it cannot move forward without the kernel’s help. E.g, the requested memory page is not in the memory, there was a floating-point error or an instruction is not supported and has to be emulated. The instruction your program was performing trapped, and the kernel took over to fix the issue.

In this post we will talk about how to figure out what system calls are where your program is spending time, and what can you do to fix the issues. In the next post we will talk about how to figure out if your program is slow because it keeps getting interrupted by the kernel and what to do about it.

Like what you are reading? Follow us on LinkedIn or Twitter and get notified as soon as new content becomes available.

How to figure out if your program is spending too much time in kernel space?

On Linux, it is really simple to measure what amount of time your program is spending in system mode using time utility. Here is an example:

$ time ./tmcstepper -s 32 -i 1810
...
real	0m57,327s
user	0m51,219s
sys	0m6,108s

In the above output, the program was running 57 seconds, out of which it spent 6 seconds in system mode (line beginning with sys).

Which system calls are taking the most time?

A Linux utility called strace can help you analyze how your program interacts with the kernel through system calls. The utility is very powerful and we wrote about it in a previous post. For this post, we will use the -fc switch to get statistical information about system calls. We run our program by prefixing it with strace -fc, like this:

$ strace -fc ./tmcstepper -s 32 -i 1810
...
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 98,97    0,264169          23     11032           brk
  0,49    0,001315          28        46           munmap
  0,31    0,000822          12        67           mmap
  0,05    0,000141          28         5           openat
  0,04    0,000100          12         8           mprotect
  0,03    0,000089          14         6           pread64
  0,03    0,000086           3        23           write
  0,03    0,000068          17         4           read
  0,02    0,000061          12         5           close
  0,02    0,000061          10         6           fstat
  0,00    0,000007           3         2         1 arch_prctl
  0,00    0,000000           0         1         1 access
  0,00    0,000000           0         1           execve
------ ----------- ----------- --------- --------- ----------------
100.00    0,266919                 11206         2 total

After the program finishes, it prints out the time it passed in the various system calls. Our program, tmcstepper spent a lot of time in brk system call. Please note that strace reported only 0.26 seconds spent in system calls, but this is a bug in strace, the time spent is actually much longer.

What code made the system call?

One of the tools that can tell you what are the lines where the system call was made is perf trace. Unfortunately, the tools is not straightforward to use an explaining it would take a whole post. If you need to use it, there informations on how to do it online. Just remember, the command to record the trace is:

  • perf record -e 'syscalls:sys_enter_*' --call-graph dwarf ./tmcstepper for all system calls,
  • perf record -e 'syscalls:sys_enter_brk' --call-graph dwarf ./tmcstepper for brk system call.

Note, however, that most of the time you don’t need this information. The regular profiling tells you the name of the function where your program spends time in kernel space. If you know what the system call does, it should be fairly easy to find the line that is making the system call.

Like what you are reading? Follow us on LinkedIn or Twitter and get notified as soon as new content becomes available.

How to fix the problem?

Fixing problem depends on the type of system call that is causing the slowdown. The idea is the following: google the system call to see what is doing, this should give you a vague idea on where to look for solutions. We talked about system calls in the post about strace, here we give a short summary on how you can resolve some of the problems related to system calls:

  • System calls that work with memory (brk, mmap, munmap): these calls are typically made by the system allocator. Decrease the pressure on the system allocator, or find a better one. We talked about this in the post about price od memory allocation.
  • System calls that work with files (read, write): these calls are typically made by functions that work with files, : e,g fprintf and printf in C, or file streams in C++. Make sure you are not flushing too often (e.g. happens in C++ if you write std::endl), also your program should read and write in larger blocks. If you are using random access to read data from the file, consider memory-mapped files.

There are other groups of calls as well that we didn’t cover here, since I am not experienced enough in giving recommendations for that types of problems. However, a pattern emerges: smaller number of system calls made by your program almost always translates to faster runtimes.

In our example, the program was spending a lot of time in brk system call. By investigation of the source code we determined that the program was allocating and deallocating memory in a short function called many times. A possible solution was to replace the system allocator. We used LD_PRELOAD environment variable to make our program use tcmalloc allocator. Here is the output of the time command with the new allocator

$ export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4
$ time ./tmcstepper -s 32 -i 1810
...
real	0m52,218s
user	0m52,146s
sys	0m0,048s

As you can see, the time spent in system mode has reduced significantly: from 6,1 s to 0.05 s, and the program’s runtime fell as well: from 57 s to 52 s. Originally there were 11032 calls to brk, after this change there are 22.

Conclusion

You can use time utility to quickly check if your program is spending time in system mode, and, if this is unexpected, you can use strace to figure out system calls that take the most time. From that point the story gets more complicated, figuring out how to decrease the time spent in system mode is not always obvious and will depend on the problematic system call itself.

Like what you are reading? Follow us on LinkedIn or Twitter and get notified as soon as new content becomes available.

  1. There is a difference between these two terms in other contexts, but here we use it interchangeably. []

Leave a Reply

Your email address will not be published. Required fields are marked *