In the previous article we made our program run faster by replacing std::endl
with '\n'
when writing to a file. Our flamegraph for the same program, after this modification, looks like this:

As you can see, what takes most of the time in our program now is the function sort_lines
. Let’s have a look at the source code of this function:
void sort_lines(std::vector<std::string>& lines) { std::sort(lines.begin(), lines.end()); }
This is a simple call to std::sort
, so we could naively conclude there is nothing possible to optimize. However, this is not true.
Almost all modern CPUs have more than one core nowadays but the standard C++ library is somewhat stuck in the past. Sorting can be done on more than one CPU simultaneously, and there exist parallel implementations of std::sort
that can take advantage of more than one core.
Parallel algorithms in the standard library
If you are C++17, the std::sort
algorithm takes an additional parameter called ExecutionPolicy
. Developers can use it to specify what flavor of the algorithm they prefer. Here is our modified implementation:
void sort_lines(std::vector<std::string>& lines) { std::sort(std::execution::par_unseq, lines.begin(), lines.end()); }
With std::execution::par_unseq
we are instructing the standard C++ library to use the parallel version of std::sort
algorithm. Many algorithms of the standard C++ library accept this parameter, e.g. std::sort
, std::find
, std::find_if
, std::count
, std::for_each
, std::transform
, std::transform_reduce
and many other.
If you are using an older C++ version, GCC’s standard C++ library offers Parallel Mode. Parallel mode is not enabled by default, and you can enable it in two ways:
- Recompile your program with
-D_GLIBCXX_PARALLEL
. This will instruct the standard library to use parallel algorithms whenever available or - Use the algorithms from
__gnu_parallel
namespace, in our case__gnu_parallel::sort
.
Our current implementation looks like this:
void sort_lines(std::vector<std::string>& lines) { #if __cplusplus >= 201703L std::sort(std::execution::par_unseq, lines.begin(), lines.end()); #else __gnu_parallel::sort(lines.begin(), lines.end()); #endif }
We are choosing the std::sort
implementation depending on the version of C++ version.
The flamegraph (with GCC’s parallel mode libstdc++) looks like this:

The runtime of sort_lines
function went down from 440 ms to 150 ms. The whole program’s runtime went down from 650 ms to 390 ms with a very simple change!
Like what you are reading? Follow us on LinkedIn or Twitter and get notified as soon as new content becomes available.
nice… thanks