We at Johny’s Software Lab LLC are experts in performance. If performance is in any way concern in your software project, feel free to contact us.
When you first join an already existing project as a developer, you will often find yourself asking “what this component is doing?”, “how does this algorithm work?” or “what is the point of this function/class?”. Before you start digging into the complicated nest of functions and function calls, allow us to introduce an easier way to get a glimpse of what your program is doing and where it is speeding time: speedscope.app.
The basic idea behind speedscope
The idea behind speedscope is simple: record the execution of your program using a profiler1 and then use speedscope to visualize what it was doing. Visualizations are very nice, here is an example of one:
This graph visualizes the conversion of a .mp4 file to .ts file in
ffmpeg. For those who are unfamiliar with
ffmpeg, it is a command-line tool used to convert between video formats.
FFmpeg is written in C, and we used Linux’
perf profiler to collect the data for speedscope. The above screenshot is taken from the full example here.
The image displays call stacks taken between 242 ms and 258 ms since the program started executing. We can observe, for instance, that at time 244 ms from the program start the call stack looked like this:
main() -> transcode() -> reap_filters() -> do_video_out() -> avcodec_send_frame_() -> ...
In this example the call stacks are arranged in the order they were collected. Depending on how we arrange the call stacks, we can get different information. So what kind of information are available?
Time Order: What my program is doing in time
In the upper left corner there are three buttons, one of them is Time Order. This options arranges call stacks in time, as they were recorded. The main benefit of this option is that you can see what your program was doing as it was running. Which functions were running at the beginning when the program was started, which functions were running in the middle when the program was doing useful work and which functions were running before the program terminated.
While navigating through the chart, if you zoom out completely to have a look at the full program execution, these charts look messy (as in the image above). The devil is in the details here, you need to explore, zoom in, scroll left and right to find the repeating patterns. It requires a bit of manual investigation work, but it is much easier to do than looking at the code or at the textual profiler output.
In the example of
ffmpeg, the program spends most of its running time in the
transcode function, but most of the work is not done there, but in other functions that
transcode calls. If we zoom in, do a bit of left and right scrolling we can see the pattern:
In this image, where we zoomed in time between 630 ms and 670 ms, we see that the program is calling the same two functions in loop:
proces_input_packets. By the name, we can guess that one function produces the data for another function.
Click here to visit the example and play around with it. IMPORTANT NOTICE: Initially, when you open the example, you will see a graph where all call stacks are marked with
??? ([unknown]). This is because you are viewing the profile of
perf thread, the profiler thread that was running to record ffmpeg. To display the example you need to select the ffmpeg thread. Move your mouse to the upper-middle part of speedscope, a list will open where you can select
Left Heavy: Where is my program spending time
The Left Heavy view sorts the call stack by the frequency functions appear in it. This view allows you to see which functions are taking the most time. The functionality is very similar to the flame graphs we already talked about, but I think speedscope is easier to use and has more features.
ffmpeg example, the Left Heavy view looks like this:
The call stacks are now sorted by frequency, not the time. By looking at this graph, we see that function
transcode is doing the bulk of the work. To do it, it calls two functions
process_input_packet. If we wanted to speed up the conversion, the most obvious candidates would be functions called by
Sandwich: Execution Statistics
The Sandwich view is similar to the Left Heavy view, in that it will tell you which functions are taking the most time, but it uses a table instead of a chart. Here is the screenshot from the Sandwich mode for our
The table lists all functions with their execution times. Note that it consists of two columns Total and Self. The Total column tells you how much time your program spent in the function and its children. The Self column tells you how much time your program spent in the function, excluding the time spent in the children. In the above image, we can see that
diff_pixels_c spent 140.88 ms in itself. Since the total number is also 140.88 ms, the function didn’t call other functions. On the other hand, the function
encode_thread executed totally for 475.08 ms, but spent only 47.13 ms doing its own work; the rest of the time was spent in its children.
If you are focusing on performance improvement, you want to dig into functions that have a high Self value first.
Practical Considerations About Speedscope
Profiling and understanding multithreaded programs is more difficult than single-threaded programs2. By just looking at the graph that speedscope made, we can see which functions are taking the most time. But for multithreaded programs that is not enough. What is missing is inter-thread dependencies, a state where one thread is blocked and waiting for another thread to complete.
Profiler and data collection
We used the perf profile in Linux to collect the data for
ffmpeg and display it in Speedscope. The process is fairly simple and it is documented in the Speedscope’s manual. But, a few words need to be seed about data collection because the quality of data collection determines the quality of the graph.
First we collected the profile information using following commands:
$ perf record --call-graph lbr -F 4000 ./program_to_execute arg1 arg2 $ perf script -i perf.data | speedscope -
Notice the command line and two switches
--call-graph and -F. The command-line switch
perf how to collect call stacks during the program execution. There are basically three ways:
fp– FP is short for frame pointer. This is a low overhead way of collecting profile samples.
perfuses frame pointer register to create the function call stack, but for this to work, your program needs to be compiled with
-fno-omit-frame-pointer3. This compiler switch is not enabled by default, but if you are compiling for x86-64, I highly encourage you to enable it in all your builds. Inlined functions will not be displayed in the profile. There is no limit in the size of the call stack.
dwarf– This type of sample collection is much more computationally intensive. The profiler collects the stack samples (each stack sample is 8 kB large). This can lead to a huge profiler output file that takes a lot of time to analyze. Also, because only 8 kB of the stack is collected for each sample, if the function stack is larger than this number, parts of the call stack will be missing4. Yet, the information collected this way is the most precise since it comes from debugging information present in the file. To use it, the program should be compiled with debug symbols (
-goption in GCC anc CLANG).
lbr– LBR is short for last branch records. Last branch records is a special set of registers that
perfcan use to reconstruct the function’s call stack without debugging info and without frame pointer. It is fast, but it only works on recent Intel’s processors. It has a limit to a stack depth, but this didn’t create problems in our testing.
If your program is a short-running one, I recommend to use
dwarf since it gives the most information. If it is a long-running, you can use
lbr. If the information collected doesn’t seem good, you can revert to
dwarf with a lower sampling rate.
The second important option passed to
-F option, which is the sampling rate. The sampling rate is the number of samples taken per second. The default option is 4000, but sometimes you will need to change this. For short-running programs, you will want to increase this number to get more precise information. For long-running programs, you will want to decrease the number since the profile database can grow huge!
If you have large functions, you will want to see the amount of time your program is spending on individual lines, not only on functions. You can do this if you collected your runtime stacks using
To achieve this, you record the program runtime as already explained. Next, instead of calling
perf script, you will call like this:
$ perf script -F +srcline > out.txt $ perf-addlines.pl out.txt > out.perf $ speedscope out.perf
perf script -F +srcline add line numbers to the output file containing stack traces. Unfortunately,
speedscope cannot consume it in raw format, instead, you need to convert it to a format
speedscope actually understands. For this you use
perf-addlines.pl script, which you can download here.
My impression of speedscope.app is great and I use it all the time! If you are looking at unfamiliar code base, it will help you quickly visualize and understand it. If you want to see what is slowing down your program, the application offers that as well. It is very light and simple to use.
Of course, the quality of information will depend on the profiler. It is important to configure the profiler properly. Bad profiler data is more difficult to spot than too much data. If the profiler data looks illogical, or something is missing or not in a place where you would expect it to be, I would recommend increasing the sampling frequency or try another of perf’s call stack collection methods.
- Profilers are tools that run in the background with your program, and they collect function call stacks many times per second. The developers can use this information later to better understand what the program is doing and find possible problems related to performance. There are many profilers available, some
- In order to better illustrate the speedscope’s capabilities, we compiled
ffmpegwithout multithreading support
- This is the name of this switch on CLANG and GCC, but other compilers have a switch with a similar semantics.
- Sometimes the stack is large because of the stack-allocated data, rewriting those parts to use heap-allocated data helped us collect the good profile using this method.