2-minute read: Class Size, Member Layout and Speed

Classes are the normal ways C++ developers organize their data. Sometimes they have few data members, other times they have many, depending on how we translate our problem into classes. Inside them, developers will sort their data members according to some criteria. Criteria can be, for example, to group the data members by usage pattern (those that are used together are declared together) or to put the most important of them at the beginning.

We might ask ourselves: do these choices matter for the speed of your program? Let’s find out.

Introduction

The first thing to know about your class is that its memory footprint1 is directly proportional to the number of data members. The second thing is that the compiler will lay out data members in memory in exactly the same order you declare them. An example:

class point {
public:
    int x;
    int y;
};

class rectangle1 {
public:
    bool visible;
    point p1;
    point p2;
};

class rectangle2 {
public:
    point p1;
    point p2;
    bool visible;
};

Let’s assume that the sizeof(bool) is 4 and the sizeof(int) is 4. The size of the class point is 8, and the size of both rectangle1 and rectangle2 is 20. In the case of class rectangle1, the variable visible will be at offset 0 from the class beginning, the variable p1 will be at offset 4 and the variable p2 will be at offset 12. In the case of class rectangle2, the variable visible will be at offset 16, variable p1 at offset 0 and variable p2 at offset 8.

From the functional point of view, the class size and its layout are completely irrelevant. But from the performance point of view, they do matter. To verify this, let’s make an experiment.

Like what you are reading? Follow us on LinkedIn or Twitter and get notified as soon as new content becomes available.
Need help with software performance? Contact us!

The experiment

To test the performance of various class sizes and memory layouts, we use class rectangle from the previous example, but with slight modifications (more on that later). We wrote two functions: calculate_surface_all that sums up the surface of all rectangles in the vector, calculate_surface_visible that sums up the surface of only visible rectangles.

template <typename R>
int calculate_surface_visible(std::vector<R>& rectangles) {
    int sum = 0;
    for (int i = 0; i < rectangles.size(); i++) {
        if (rectangles[i].is_visible()) {
            sum += rectangles[i].surface();
        }
    }
    return sum;
}

template <typename R>
int calculate_surface_all(std::vector<R>& rectangles) {
    int sum = 0;
    for (int i = 0; i < rectangles.size(); i++) {
        sum += rectangles[i].surface();
    }
    return sum;
}

The difference is that for the case of calculate_surface_all we are accessing only members p1 and p2 (points of the top left and bottom right. For calculate_surface_visible we are accessing the member visible as well.

To simulate a different memory layout of the class rectangle, we added padding between the data member visible and other data members. We also wanted to keep the class size constant, so we added another padding at the end. In the real-world code, you will have other member variables instead of padding. The definition of our class rectangle looks like this:

template <int pad1_size, int pad2_size>
class rectangle {
   private:
    bool m_visible;
    int m_padding1[pad1_size];
    point m_p1;
    point m_p2;
    int m_padding2[pad2_size];
};

The results

Class size

So, does the runtime of our two functions depend on the class size? Here are the results:

For the smallest class size (20 bytes), both functions are the fastest. As the size of class rectangle grows, so the functions take more and more time to complete. Function calculate_surface_visible is two times slower in the worst case than in the best case; function calculate_surface_all is five times slower. Why is this so?

The memory accesses on modern CPUs use a caching mechanism. Every time our program accesses a single byte, the whole block of data (which is typically 64 bytes) will be brought into the cache as well. The access to memory inside the same block will be very fast. But the accesses which fall outside the block will be slow.

When the class is small (20 bytes), three instances of the class can fit a cache block. Accessing any data member in a single rectangle instance will also load the data for two additional instances. We basically get those accesses for free. As the class size grows, the hardware still loads data into the data cache, but our program never touches it. This is a lost performance opportunity: the data is in the cache, but our program never uses it. Instead, we are asking the hardware to load another batch of data from a different block. This is what is slowing down our computation with large class size.

Like what you are reading? Follow us on LinkedIn or Twitter and get notified as soon as new content becomes available.
Need help with software performance? Contact us!

Padding between member visible and members p1 and p2

What happens when we are accessing two data members in the same instance, but they are not close to each other in memory? In our case, we added padding between members visible and members p1 and p2. We control the size of that padding. Let’s measure how much time do our functions take to perform the task, depending on the size of the padding:

The runtime of function calculate_surface_all doesn’t depend on the padding between visible and other data members. The runtime of function calculate_surface_visible depends on the padding, but only when both the padding and the class size are large. A larger gap between the member visible and members p1 and p2 automatically translates to slower computations. The phenomenon becomes visible for classes larger than 128 bytes and as the class size grows becomes more and more pronounced.

Conclusion

How do these results translate to performance tips for a C++ engineer? Here are a few rules of thumb if you want to achieve good performance in your C++ program:

  • Focus on hot classes, classes that your program spends a lot of time processing.
  • Keep hot classes small. Move all rarely used members into separate classes.
  • Alternatively, extract hot code from larger classes into separate smaller classes and keep them in a dedicated vector. Don’t process them by iterating over the larger classes. Process them by iterating over the dedicated vector.
  • Group together data members you access together in the class definition.

In the case of C++, small changes go a long way and if done right can make your program run a few times faster.

NOTE: The conclusion of this post inevitably leads to the story of Entity-Component-System paradigm and Data-Oriented Design I plan to cover in one of the upcoming posts.

Like what you are reading? Follow us on LinkedIn or Twitter and get notified as soon as new content becomes available.
Need help with software performance? Contact us!

Featured image: https://www.educative.io/edpresso/what-is-a-cpp-abstract-class

  1. Memory footprint of the class is the amount of memory that a single instance of a class consumes []

Leave a Reply

Your email address will not be published. Required fields are marked *