2-minute read: the Magic Touch of Parallel Algorithms

In the previous article we made our program run faster by replacing std::endl with '\n' when writing to a file. Our flamegraph for the same program, after this modification, looks like this:

As you can see, what takes most of the time in our program now is the function sort_lines. Let’s have a look at the source code of this function:

 void sort_lines(std::vector<std::string>& lines) {
     std::sort(lines.begin(), lines.end());

This is a simple call to std::sort, so we could naively conclude there is nothing possible to optimize. However, this is not true.

Almost all modern CPUs have more than one core nowadays but the standard C++ library is somewhat stuck in the past. Sorting can be done on more than one CPU simultaneously, and there exist parallel implementations of std::sort that can take advantage of more than one core.

Parallel algorithms in the standard library

If you are C++17, the std::sort algorithm takes an additional parameter called ExecutionPolicy. Developers can use it to specify what flavor of the algorithm they prefer. Here is our modified implementation:

void sort_lines(std::vector<std::string>& lines) {
    std::sort(std::execution::par_unseq, lines.begin(), lines.end());

With std::execution::par_unseq we are instructing the standard C++ library to use the parallel version of std::sort algorithm. Many algorithms of the standard C++ library accept this parameter, e.g. std::sort, std::find, std::find_if, std::count, std::for_each, std::transform, std::transform_reduce and many other.

If you are using an older C++ version, GCC’s standard C++ library offers Parallel Mode. Parallel mode is not enabled by default, and you can enable it in two ways:

  • Recompile your program with -D_GLIBCXX_PARALLEL. This will instruct the standard library to use parallel algorithms whenever available or
  • Use the algorithms from __gnu_parallel namespace, in our case __gnu_parallel::sort.

Our current implementation looks like this:

void sort_lines(std::vector<std::string>& lines) {
#if __cplusplus >= 201703L
    std::sort(std::execution::par_unseq, lines.begin(), lines.end());
    __gnu_parallel::sort(lines.begin(), lines.end());

We are choosing the std::sort implementation depending on the version of C++ version.

The flamegraph (with GCC’s parallel mode libstdc++) looks like this:

Flamegraph with libstdc++ parallel sort

The runtime of sort_lines function went down from 440 ms to 150 ms. The whole program’s runtime went down from 650 ms to 390 ms with a very simple change!

Like what you are reading? Follow us on LinkedIn or Twitter and get notified as soon as new content becomes available.

1 comment / Add your comment below

Leave a Reply

Your email address will not be published. Required fields are marked *