Speedscope: visualize what your program is doing and where it is spending time

Master software performance in just 16 hours!
Join our Software Optimization for the Memory Subsystem Workshop taking place from May 18th to May 21st. Click here to express interest or register.

When you first join an already existing project as a developer, you will often find yourself asking “what this component is doing?”, “how does this algorithm work?” or “what is the point of this function/class?”. Before you start digging into the complicated nest of functions and function calls, allow us to introduce an easier way to get a glimpse of what your program is doing and where it is speeding time: speedscope.app.

Speedscope.app is a JavaScript application that runs in your browser and can visualize the behavior of your program. It can also be installed locally. It supports visualization of many programming languages: JavaScript, Ruby, Python, Go, Rust, C/C++, and .NET languages. Here we will give an overview of its features, for information on how to collect data for the visualization, we refer you to the speedscope website.

The basic idea behind speedscope

The idea behind speedscope is simple: record the execution of your program using a profiler¹ and then use speedscope to visualize what it was doing. Visualizations are very nice, here is an example of one:

This graph visualizes the conversion of a .mp4 file to .ts file in ffmpeg. For those who are unfamiliar with ffmpeg, it is a command-line tool used to convert between video formats. FFmpeg is written in C, and we used Linux’ perf profiler to collect the data for speedscope. The above screenshot is taken from the full example here.

The image displays call stacks taken between 242 ms and 258 ms since the program started executing. We can observe, for instance, that at time 244 ms from the program start the call stack looked like this: main() -> transcode() -> reap_filters() -> do_video_out() -> avcodec_send_frame_() -> ...

In this example the call stacks are arranged in the order they were collected. Depending on how we arrange the call stacks, we can get different information. So what kind of information are available?

Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us

You can also subscribe to our mailing list (link top right of this page) or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.

Time Order: What my program is doing in time

In the upper left corner there are three buttons, one of them is Time Order. This options arranges call stacks in time, as they were recorded. The main benefit of this option is that you can see what your program was doing as it was running. Which functions were running at the beginning when the program was started, which functions were running in the middle when the program was doing useful work and which functions were running before the program terminated.

While navigating through the chart, if you zoom out completely to have a look at the full program execution, these charts look messy (as in the image above). The devil is in the details here, you need to explore, zoom in, scroll left and right to find the repeating patterns. It requires a bit of manual investigation work, but it is much easier to do than looking at the code or at the textual profiler output.

In the example of ffmpeg, the program spends most of its running time in the transcode function, but most of the work is not done there, but in other functions that transcode calls. If we zoom in, do a bit of left and right scrolling we can see the pattern:

In this image, where we zoomed in time between 630 ms and 670 ms, we see that the program is calling the same two functions in loop: reap_filters and proces_input_packets. By the name, we can guess that one function produces the data for another function.

Click here to visit the example and play around with it. IMPORTANT NOTICE: Initially, when you open the example, you will see a graph where all call stacks are marked with ??? ([unknown]). This is because you are viewing the profile of perf thread, the profiler thread that was running to record ffmpeg. To display the example you need to select the ffmpeg thread. Move your mouse to the upper-middle part of speedscope, a list will open where you can select ffmpeg_g thread.

Left Heavy: Where is my program spending time

The Left Heavy view sorts the call stack by the frequency functions appear in it. This view allows you to see which functions are taking the most time. The functionality is very similar to the flame graphs we already talked about, but I think speedscope is easier to use and has more features.

In our ffmpeg example, the Left Heavy view looks like this:

The call stacks are now sorted by frequency, not the time. By looking at this graph, we see that function transcode is doing the bulk of the work. To do it, it calls two functions reap_filters and process_input_packet. If we wanted to speed up the conversion, the most obvious candidates would be functions called by encode_thread, estimate_motion_thread or decode_slice.

Sandwich: Execution Statistics

The Sandwich view is similar to the Left Heavy view, in that it will tell you which functions are taking the most time, but it uses a table instead of a chart. Here is the screenshot from the Sandwich mode for our ffmpeg example.

The table lists all functions with their execution times. Note that it consists of two columns Total and Self. The Total column tells you how much time your program spent in the function and its children. The Self column tells you how much time your program spent in the function, excluding the time spent in the children. In the above image, we can see that diff_pixels_c spent 140.88 ms in itself. Since the total number is also 140.88 ms, the function didn’t call other functions. On the other hand, the function encode_thread executed totally for 475.08 ms, but spent only 47.13 ms doing its own work; the rest of the time was spent in its children.

If you are focusing on performance improvement, you want to dig into functions that have a high Self value first.

Practical Considerations About Speedscope

Multithreaded programs

Profiling and understanding multithreaded programs is more difficult than single-threaded programs². By just looking at the graph that speedscope made, we can see which functions are taking the most time. But for multithreaded programs that is not enough. What is missing is inter-thread dependencies, a state where one thread is blocked and waiting for another thread to complete.

Profiler and data collection

We used the perf profile in Linux to collect the data for ffmpeg and display it in Speedscope. The process is fairly simple and it is documented in the Speedscope’s manual. But, a few words need to be seed about data collection because the quality of data collection determines the quality of the graph.

First we collected the profile information using following commands:

$ perf record --call-graph lbr -F 4000 ./program_to_execute arg1 arg2
$ perf script -i perf.data | speedscope -

Notice the command line and two switches --call-graph and -F. The command-line switch --call-graph tells perf how to collect call stacks during the program execution. There are basically three ways:

fp – FP is short for frame pointer. This is a low overhead way of collecting profile samples. perf uses frame pointer register to create the function call stack, but for this to work, your program needs to be compiled with -fno-omit-frame-pointer³. This compiler switch is not enabled by default, but if you are compiling for x86-64, I highly encourage you to enable it in all your builds. Inlined functions will not be displayed in the profile. There is no limit in the size of the call stack.
dwarf – This type of sample collection is much more computationally intensive. The profiler collects the stack samples (each stack sample is 8 kB large). This can lead to a huge profiler output file that takes a lot of time to analyze. Also, because only 8 kB of the stack is collected for each sample, if the function stack is larger than this number, parts of the call stack will be missing⁴. Yet, the information collected this way is the most precise since it comes from debugging information present in the file. To use it, the program should be compiled with debug symbols (-g option in GCC anc CLANG).
lbr – LBR is short for last branch records. Last branch records is a special set of registers that perf can use to reconstruct the function’s call stack without debugging info and without frame pointer. It is fast, but it only works on recent Intel’s processors. It has a limit to a stack depth, but this didn’t create problems in our testing.

If your program is a short-running one, I recommend to use dwarf since it gives the most information. If it is a long-running, you can use fp or lbr. If the information collected doesn’t seem good, you can revert to dwarf with a lower sampling rate.

The second important option passed to perf is -F option, which is the sampling rate. The sampling rate is the number of samples taken per second. The default option is 4000, but sometimes you will need to change this. For short-running programs, you will want to increase this number to get more precise information. For long-running programs, you will want to decrease the number since the profile database can grow huge!

Line numbers

If you have large functions, you will want to see the amount of time your program is spending on individual lines, not only on functions. You can do this if you collected your runtime stacks using dwarf option.

To achieve this, you record the program runtime as already explained. Next, instead of calling perf script, you will call like this:

$ perf script -F +srcline > out.txt
$ perf-addlines.pl out.txt > out.perf
$ speedscope out.perf

The commend perf script -F +srcline add line numbers to the output file containing stack traces. Unfortunately, speedscope cannot consume it in raw format, instead, you need to convert it to a format speedscope actually understands. For this you use perf-addlines.pl script, which you can download here.

Final Words

My impression of speedscope.app is great and I use it all the time! If you are looking at unfamiliar code base, it will help you quickly visualize and understand it. If you want to see what is slowing down your program, the application offers that as well. It is very light and simple to use.

Of course, the quality of information will depend on the profiler. It is important to configure the profiler properly. Bad profiler data is more difficult to spot than too much data. If the profiler data looks illogical, or something is missing or not in a place where you would expect it to be, I would recommend increasing the sampling frequency or try another of perf’s call stack collection methods.

Like what you are reading? Follow us on LinkedIn or Twitter and get notified as soon as new content becomes available.

Profilers are tools that run in the background with your program, and they collect function call stacks many times per second. The developers can use this information later to better understand what the program is doing and find possible problems related to performance. There are many profilers available, some [↩]
In order to better illustrate the speedscope’s capabilities, we compiled ffmpeg without multithreading support [↩]
This is the name of this switch on CLANG and GCC, but other compilers have a switch with a similar semantics. [↩]
Sometimes the stack is large because of the stack-allocated data, rewriting those parts to use heap-allocated data helped us collect the good profile using this method. [↩]

9 comments / Add your comment below

Mark Hansen says:
October 1, 2021 at 7:36 am
> What is missing is inter-thread dependencies, a state where one thread is blocked and waiting for another thread to complete.
Hi, thanks for the post. If you like in-browser perf profile visualisers and you like speedscope, you should check out Firefox Profiler too. It runs in-browser (in any browser, not just Firefox) and you can import “perf script” format, and it has a great multi-track view to see multiple threads at a time. Enjoy!
1. Ivica Bogosavljević says:
  October 1, 2021 at 7:42 am
  Speedscope was good for me because you can fire it up simply from the command line, but Firefox Profiler is great too! One thing I do miss, however, is the profiler which prints statistics on the line level, not only on the function level. Perf command line can do that too, but the output is awful 🙁
  On Linux, there is also hotspot. There is no good profiler for Windows though 🙁
JuYi says:
January 2, 2023 at 4:06 am
I think Speedscope is a kind of tool likes [chrome://tracing/]
Cath Developer says:
June 2, 2023 at 3:49 pm
Hi Johnny,
I really enjoyed reading your blog post about Speedscope. I’ve been using it for a few weeks now and it’s been a great help in identifying performance bottlenecks in my code.
I have one question: is there a way to export the data from Speedscope so that I can import it into another tool for further analysis?
1. Ivica Bogosavljević says:
  June 2, 2023 at 7:02 pm
  The data you get as an output from perf.script can be used elsewhere. Do you have any specific tool in mind?
Anupam Kapoor says:
February 11, 2024 at 5:56 am
i generally find tracy-profiler: https://github.com/wolfpld/tracy to be much better than others. allows me to insert minor ‘instrumentation’ in the code, and then perform, an online + offline analysis of program execution profile. very highly recommended.
1. Ivica Bogosavljević says:
  February 17, 2024 at 9:10 pm
  I skimmed the profiler documentation. This seems to be a more serious profiler with instrumentation intended for game development primarily. You need to add some instrumentation code, which is fine. I think these two profilers target different needs.
  If you are familiar with this compiler and are interested in writing an introductory post to it, let me know. We could publish it on this blog, or you can publish it elsewhere and I would just add link to it.
Suresh Palaniappan says:
October 2, 2024 at 12:28 pm
Hi, How to integrate this tool for a C Based IOT Project without any RTOS. Is there a guide to follow on how to integrate this. Thanks in advance,
1. Ivica Bogosavljević says:
  October 29, 2024 at 7:36 am
  If you are thinking about bare metal devices, the main issue would be to produce the trace stacks. Do you know how to produce the trace stacks on your system?