Speedscope: visualize what your program is doing and where it is spending time

When you first join an already existing project as a developer, you will often find yourself asking “what this component is doing?”, “how does this algorithm work?” or “what is the point of this function/class?”. Before you start digging into the complicated nest of functions and function calls, allow us to introduce an easier way to get a glimpse of what your program is doing and where it is speeding time: speedscope.app.

Speedscope.app is a JavaScript application that runs in your browser and can visualize the behavior of your program. It can also be installed locally. It supports visualization of many programming languages: JavaScript, Ruby, Python, Go, Rust, C/C++, and .NET languages. Here we will give an overview of its features, for information on how to collect data for the visualization, we refer you to the speedscope website.

The basic idea behind speedscope

The idea behind speedscope is simple: record the execution of your program using a profiler1 and then use speedscope to visualize what it was doing. Visualizations are very nice, here is an example of one:

This graph visualizes the conversion of a .mp4 file to .ts file in ffmpeg. For those who are unfamiliar with ffmpeg, it is a command-line tool used to convert between video formats. FFmpeg is written in C, and we used Linux’ perf profiler to collect the data for speedscope. The above screenshot is taken from the full example here.

The image displays call stacks taken between 242 ms and 258 ms since the program started executing. We can observe, for instance, that at time 244 ms from the program start the call stack looked like this: main() -> transcode() -> reap_filters() -> do_video_out() -> avcodec_send_frame_() -> ...

In this example the call stacks are arranged in the order they were collected. Depending on how we arrange the call stacks, we can get different information. So what kind of information are available?

Like what you are reading? Follow us on LinkedIn or Twitter and get notified as soon as new content becomes available.

Time Order: What my program is doing in time

In the upper left corner there are three buttons, one of them is Time Order. This options arranges call stacks in time, as they were recorded. The main benefit of this option is that you can see what your program was doing as it was running. Which functions were running at the beginning when the program was started, which functions were running in the middle when the program was doing useful work and which functions were running before the program terminated.

While navigating through the chart, if you zoom out completely to have a look at the full program execution, these charts look messy (as in the image above). The devil is in the details here, you need to explore, zoom in, scroll left and right to find the repeating patterns. It requires a bit of manual investigation work, but it is much easier to do than looking at the code or at the textual profiler output.

In the example of ffmpeg, the program spends most of its running time in the transcode function, but most of the work is not done there, but in other functions that transcode calls. If we zoom in, do a bit of left and right scrolling we can see the pattern:

In this image, where we zoomed in time between 630 ms and 670 ms, we see that the program is calling the same two functions in loop: reap_filters and proces_input_packets. By the name, we can guess that one function produces the data for another function.

Click here to visit the example and play around with it. IMPORTANT NOTICE: Initially, when you open the example, you will see a graph where all call stacks are marked with ??? ([unknown]). This is because you are viewing the profile of perf thread, the profiler thread that was running to record ffmpeg. To display the example you need to select the ffmpeg thread. Move your mouse to the upper-middle part of speedscope, a list will open where you can select ffmpeg_g thread.

Left Heavy: Where is my program spending time

The Left Heavy view sorts the call stack by the frequency functions appear in it. This view allows you to see which functions are taking the most time. The functionality is very similar to the flame graphs we already talked about, but I think speedscope is easier to use and has more features.

In our ffmpeg example, the Left Heavy view looks like this:

The call stacks are now sorted by frequency, not the time. By looking at this graph, we see that function transcode is doing the bulk of the work. To do it, it calls two functions reap_filters and process_input_packet. If we wanted to speed up the conversion, the most obvious candidates would be functions called by encode_thread, estimate_motion_thread or decode_slice.

Sandwich: Execution Statistics

The Sandwich view is similar to the Left Heavy view, in that it will tell you which functions are taking the most time, but it uses a table instead of a chart. Here is the screenshot from the Sandwich mode for our ffmpeg example.

The table lists all functions with their execution times. Note that it consists of two columns Total and Self. The Total column tells you how much time your program spent in the function and its children. The Self column tells you how much time your program spent in the function, excluding the time spent in the children. In the above image, we can see that diff_pixels_c spent 140.88 ms in itself. Since the total number is also 140.88 ms, the function didn’t call other functions. On the other hand, the function encode_thread executed totally for 475.08 ms, but spent only 47.13 ms doing its own work; the rest of the time was spent in its children.

If you are focusing on performance improvement, you want to dig into functions that have a high Self value first.

Practical Considerations About Speedscope

Multithreaded programs

Profiling and understanding multithreaded programs is more difficult than single-threaded programs2. By just looking at the graph that speedscope made, we can see which functions are taking the most time. But for multithreaded programs that is not enough. What is missing is inter-thread dependencies, a state where one thread is blocked and waiting for another thread to complete.

Profiler and data collection

We used the perf profile in Linux to collect the data for ffmpeg and display it in Speedscope. The process is fairly simple and it is documented in the Speedscope’s manual. But, a few words need to be seed about data collection because the quality of data collection determines the quality of the graph.

First we collected the profile information using following commands:

$ perf record --call-graph lbr -F 4000 ./program_to_execute arg1 arg2
$ perf script -i perf.data | speedscope -

Notice the command line and two switches --call-graph and -F. The command-line switch --call-graph tells perf how to collect call stacks during the program execution. There are basically three ways:

  • fp – FP is short for frame pointer. This is a low overhead way of collecting profile samples. perf uses frame pointer register to create the function call stack, but for this to work, your program needs to be compiled with -fno-omit-frame-pointer3. This compiler switch is not enabled by default, but if you are compiling for x86-64, I highly encourage you to enable it in all your builds. Inlined functions will not be displayed in the profile. There is no limit in the size of the call stack.
  • dwarf – This type of sample collection is much more computationally intensive. The profiler collects the stack samples (each stack sample is 8 kB large). This can lead to a huge profiler output file that takes a lot of time to analyze. Also, because only 8 kB of the stack is collected for each sample, if the function stack is larger than this number, parts of the call stack will be missing4. Yet, the information collected this way is the most precise since it comes from debugging information present in the file.
  • lbr – LBR is short for last branch records. Last branch records is a special set of registers that perf can use to reconstruct the function’s call stack without debugging info and without frame pointer. It is fast, but it only works on recent Intel’s processors. It has a limit to a stack depth, but this didn’t create problems in our testing.

If your program is a short running one, I recommend to use dwarf since it gives the most information. If it is a long-running, you can use fp or lbr. If the information collected doesn’t seem good, you can revert to dwarf with a lower sampling rate.

The second important option passed to perf is -F option, which is the sampling rate. The sampling rate is the number of samples taken per second. The default option is 4000, but sometimes you will need to change this. For short-running programs, you will want to increase this number to get more precise information. For long-running programs, you will want to decrease the number since the profile database can grow huge!

Final Words

My impression of speedscope.app is great and I use it all the time! If you are looking at unfamiliar code base, it will help you quickly visualize and understand it. If you want to see what is slowing down your program, the application offers that as well. It is very light and simple to use.

Of course, the quality of information will depend on the profiler. It is important to configure the profiler properly. Bad profiler data is more difficult to spot than too much data. If the profiler data looks illogical, or something is missing or not in a place where you would expect it to be, I would recommend increasing the sampling frequency or try another of perf’s call stack collection methods.

Like what you are reading? Follow us on LinkedIn or Twitter and get notified as soon as new content becomes available.

  1. Profilers are tools that run in the background with your program, and they collect function call stacks many times per second. The developers can use this information later to better understand what the program is doing and find possible problems related to performance. There are many profilers available, some []
  2. In order to better illustrate the speedscope’s capabilities, we compiled ffmpeg without multithreading support []
  3. This is the name of this switch on CLANG and GCC, but other compilers have a switch with a similar semantics. []
  4. Sometimes the stack is large because of the stack-allocated data, rewriting those parts to use heap-allocated data helped us collect the good profile using this method. []

Leave a Reply

Your email address will not be published. Required fields are marked *