 # 2-minute read: the Magic Touch of Parallel Algorithms

We at Johny’s Software Lab LLC are experts in performance. If performance is in any way concern in your software project, feel free to contact us.

In the previous article we made our program run faster by replacing `std::endl` with `'\n'` when writing to a file. Our flamegraph for the same program, after this modification, looks like this:

As you can see, what takes most of the time in our program now is the function `sort_lines`. Let’s have a look at the source code of this function:

``` void sort_lines(std::vector<std::string>& lines) {
std::sort(lines.begin(), lines.end());
}```

This is a simple call to `std::sort`, so we could naively conclude there is nothing possible to optimize. However, this is not true.

Almost all modern CPUs have more than one core nowadays but the standard C++ library is somewhat stuck in the past. Sorting can be done on more than one CPU simultaneously, and there exist parallel implementations of `std::sort` that can take advantage of more than one core.

### Parallel algorithms in the standard library

If you are C++17, the `std::sort` algorithm takes an additional parameter called `ExecutionPolicy`. Developers can use it to specify what flavor of the algorithm they prefer. Here is our modified implementation:

```void sort_lines(std::vector<std::string>& lines) {
std::sort(std::execution::par_unseq, lines.begin(), lines.end());
}```

With `std::execution::par_unseq` we are instructing the standard C++ library to use the parallel version of `std::sort` algorithm. Many algorithms of the standard C++ library accept this parameter, e.g. `std::sort`, `std::find`, `std::find_if`, `std::count`, `std::for_each`, `std::transform`, `std::transform_reduce` and many other.

If you are using an older C++ version, GCC’s standard C++ library offers Parallel Mode. Parallel mode is not enabled by default, and you can enable it in two ways:

• Recompile your program with `-D_GLIBCXX_PARALLEL`. This will instruct the standard library to use parallel algorithms whenever available or
• Use the algorithms from `__gnu_parallel` namespace, in our case `__gnu_parallel::sort`.

Our current implementation looks like this:

```void sort_lines(std::vector<std::string>& lines) {
#if __cplusplus >= 201703L
std::sort(std::execution::par_unseq, lines.begin(), lines.end());
#else
__gnu_parallel::sort(lines.begin(), lines.end());
#endif
}
```

We are choosing the `std::sort` implementation depending on the version of C++ version.

The flamegraph (with GCC’s parallel mode libstdc++) looks like this:

The runtime of `sort_lines` function went down from 440 ms to 150 ms. The whole program’s runtime went down from 650 ms to 390 ms with a very simple change!
1. Rama says: