The price of dynamic memory: Allocation

When it comes to memory usage, there are two types of programs. The first type are programs that allocate memory in large blocks. Normally, these programs know how much memory they will need and can allocate it in advance. They hold their memory in arrays or vectors and typically access it linearly (but not always so). These programs might even use dynamic memory, but when they do, typically they call malloc only a few times during the lifetime of the program. In other words, memory allocation and memory organization are not typically a limiting factor in these programs.

The second type of program uses memory differently. The program might use data structures that require an allocation of a large number of chunks using malloc or new. Or there is might be a message passing going on between different parts of the program and these messages are dynamically allocated. Whatever the case may be, the program is allocating a large number of chunks, it uses them for some time and then it returns them back to the system.

Today we are going to talk about this second type of program. When you profile such a program, you might notice that malloc and free come up as places where your program is spending time. This is not unexpected, but depending on the time spent might require that you take some action on your part if performance is important.

Please note that there are two things to consider when talking about the performance of programs that use dynamic memory. One is allocation speed, i.e. how fast can the program allocate and deallocate memory. This depends mostly on how good are implementations of malloc and free.

The other thing is access speed, i.e. how fast can you access the memory returned by the allocator. Access speed depends on hardware memory subsystems and some allocators are better at allocating memory that makes those subsystems happy.

In this post we will talk about allocation speed, and access speed will be the topic of a follow-up post.

Why are malloc and free slow?

To answer the above question, we need to first understand how malloc and free are commonly implemented. Most of the time, malloc and free implementations are provided by C standard library, although in some cases programs implement allocation using third-party libraries or implement them themselves.

Why are malloc and free slow: memory fragmentation

For most allocators, the allocator reserves one or more large blocks of memory from the operating system. Out of one such block the allocator carves out smaller chunks to service the requests made by programs. There are many algorithms on how to manage smaller chunks out of a large block and they differ in speed (will the allocation be fast) and memory consumption.

An example of a large block the allocator uses to allocate small chunks. Chunks in green are already allocated. When the allocator needs to allocate a new chunk, it traverses the list starting from Free Blocks Start. To optimize for speed the allocator returns the first chunk of appropriate size (first-fit algorithm). To optimize for memory consumption, the allocator returns the chunk whose size most closely matches the chunk size requested by the calling code (best-fit algorithm).

So one of the reasons why allocators are slow is that the allocation algorithm needs some time to find an available block of a given size. But that is not all. As time progresses it gets harder and harder to find the block of appropriate size. The reason for this is called memory fragmentation.

To illustrate memory fragmentation, consider the following scenario. The heap consists of five chunks, numbered from 1 to 5 and each 16 bytes in size. Your program allocates all five chunks, and after some time it returns chunks 1, 3 and 4.

Illustration of memory fragmentation. Each block is 16 bytes in size. Green blocks are taken, white blocks are available.

If your program would like to allocate a chunk of size 32 bytes (2 consecutive 16-byte chunks), it would need more time to find the block of that size, since it is available at positions 3 and 4. If your program would want to allocate a block of size 48 bytes, this allocation would fail even though there are 48 bytes available in the large block (blocks 1, 3 and 4), but these blocks are not contiguous.

This is a very simplified example and you can imagine what happens when the block your program takes memory from is large and there are many smaller allocations and deallocations. The malloc and free functions get slower and slower because the block is appropriate size is further down the chain of available blocks. Additionally, if your program would like to allocate a large block it might happen that this allocation would fail or your program memory requirement will grow larger and larger.

Memory fragmentation on a real-life embedded system. Green blocks are taken, white blocks are available.

Memory fragmentation is a serious problem for long-running systems. It causes programs to become slower and slower or to run out of memory. Some time ago we were working on a TV box project. There was a test that changes a channel every 10 seconds for 24 hours. At the beginning of the test, it took 1 second for the video to start running after the channel change command. After 24 hours it took 7 seconds to do the same. Reason? Memory fragmentation.

There are several ways to tackle memory fragmentation:

  • There are systems that will occasionally restart in order to avoid it. Creating a program or a system that is restartable and that will restore its state after restart can be a challenge.
  • Some programs preallocate all the needed memory at start time and then completely dispense with dynamic memory. The problem is that for some programs it is not possible to allocate all memory in advance because this information is not available at start.
  • Some programs use special allocators that promise lower heap fragmentation.
  • Another technique is to cache some memory chunks instead of returning them to the allocator with free. This avoids calls to malloc and free at the expense of somewhat increased memory requirements.

Out of techniques just mentioned, some completely eliminate memory fragmentation, but they are not applicable everywhere. Other techniques slow down the onset of memory fragmentation problems, but they cannot eliminate them completely.

Why are malloc and free slow: thread synchronization

In the case of multithreaded programs (and nowadays many programs are multithreaded), malloc and free must be thread-safe. The simplest way to make them thread-safe is to introduce mutexes to protect the critical section of those functions, but this comes with a cost. Mutex locks and unlocks are expensive operations on multiprocessor systems, and even though the allocators can quickly find a memory block of appropriate size, synchronization eats away the speed advantage.

Many allocators promise a good performance in multithreaded systems (e.g. give examples). Normally they do it:

  • They reserve some memory per-thread. If the block of memory is accessed from a single thread, there is no need for synchronization.
  • They maintain a cache of recently used memory chunks per-thread. It removes the need for synchronization for the same reason as above.

Both of the above strategies work well for most programs, but in case of programs that allocate memory in one thread and release it in another, worst-case runtime can be bad. Please notice also that these allocation strategies generally increase the program’s memory consumption.

If your program is single threaded, the allocator might unnecessarily perform thread synchronization that is not needed. In case the performance is important, investigate in allocators that don’t have such an issue.

Like what you are reading? Follow us on LinkedIn or Twitter and get notified as soon as new content becomes available.

Memory Access Patterns

If your program is running slow and this is due to dynamic memory, you will need to understand how your program allocates memory. Specifically:

When does your program allocate memory?

Does your program allocate all memory at the beginning and then keeps until the program exits? Or does it allocate and deallocate memory throughout the run time of the program? Or alternatively, are there any specific places in your code that allocate memory in bursts and keep it for a short time?

In case your program allocates all of the memory at the beginning of the program, you should consider completely dispensing with dynamic memory allocation and instead allocate everything statically. This is often done in small embedded systems to keep the executable small (introducing malloc and free increases the size of the binary). It is also done in real-time systems because real-time systems need to guarantee certain time constraints but malloc and free make this difficult.

In case your program allocates and deallocates memory equally throughout the life of the program, there is not much you can do about it except for investing in a good allocator.

In case your program allocates memory in short bursts and then deallocates it soon afterward, you should reconsider the allocation strategy if speed is important. For example, std::map (and other STL containers) allows you to specify an allocator as one of the template parameters. If you look up the full definition of std::map type, you will see the following:

template<
    class Key,
    class T,
    class Compare = std::less<Key>,
    class Allocator = std::allocator<std::pair<const Key, T> >
> class map;

One of the template parameter is Allocator and it defaults to std::allocator. If you want to replace the allocator of std::map, you will need to write a class with few methods that can serve as a replacement for std::allocator.

So, if you are using std::map to process a large number of elements and then you are destroying it soon afterward, you can provide a custom allocator that allocates from one specific block. Here is an example of how to do it:

template <typename _Tp>
class zone_allocator
{
private:
    _Tp* my_memory;
    int free_block_index;
    static constexpr int mem_size = 1000*1024*1024;
public:

    zone_allocator()
    {
        my_memory = reinterpret_cast<_Tp*>(mmap(0, mem_size, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0));
        free_block_index = 0;
    }

    ~zone_allocator()
    {
        munmap(my_memory, mem_size);
    }

    ....

    pointer allocate(size_type __n, const void* = 0)
    {
        pointer result = &my_memory[free_block_index];
        free_block_index += __n;
        return result;
    }

    void deallocate(pointer __p, size_type __n)
    {
        // We deallocate everything when destroyed
    }
...
};

std::map<int, my_class, std::less<int>, zone_allocator<std::pair<const int, my_class>>> 

In the above example, the allocator uses mmap to allocate a large block of memory from the OS (don’t worry, it will not allocate all that RAM unless the program actually uses it). The block will get returned back to the system when the instance of zone_allocator is destroyed. Method allocate just returns a first available chunk in the block and method deallocate doesn’t do anything. Allocate is fast and deallocate is a no-op.

This approach will be good if we are allocating a complete std::map at the beginning and then don’t do any removal operations on it. The tree1 will be very compact in memory, which is good for the cache-hit rate and performance. It will not cause any memory fragmentation in the rest of the system since the whole block is separate from memory blocks used by regular malloc.

But if we are removing elements from std::map, no memory is released and the memory consumption will grow. In this case we can implement deallocation in deallocate method, but allocation becomes more complex as well.

Does your program allocate memory in many small chunks or fewer large chunks?

The quality of the allocator becomes more important if your program is allocating many small chunks compared to when it allocates a few larger chunks. And even more so if your program is constantly allocating and deallocating chunks.

How does allocation and deallocation pattern look like?

Does your program allocate many memory blocks, do something with them, and then release them all? Or it most often runs in a cycle where it allocates a few chunks and then releases them?

Memory fragmentation will cause allocation and deallocation to slow down in case you keep allocating and deallocating a smaller number of chunks. To remedy this you can use a technique called memory chunk caching. The idea is to cache memory chunks when your program is releasing memory instead of calling free. Next time you need more chunks, serve them from the cache instead of allocating them. Here is an example of how to do it:

class chunk {
private:
    static int chunk_count = 0;
    static const int max_chunk_count = 30;
    static chunk* first_chunk = nullptr;

    chunk* next_chunk;
   .... 
public:
    void * operator new(size_t size) {
        if (first_chunk) {
            void* res = first_chunk;
            first_chunk = first_chunk->next_chunk;
            chunk_count--;
            return res;
        } else {
            return malloc(size);
        }
    } 
  
    void operator delete(void * p) { 
        if (chunk_count <  max_chunk_count) {
            chunk* c = reinterpret_cast<chunk>(p);
            c->next_chunk = first_chunk;
            first_chunk = c;
            chunk_count++;
        } else {
            free(p);
        }
    } 

};

In the above example, class chunk keeps a chunk cache of 30 instances. In case the cache is not empty, new operator will serve the request from the cache. If not, it will allocate additional memory. Operator delete saves the memory chunk in the cache unless the cache is full. Unless there are more than 30 instances of chunk allocated at any given time, the program will never call malloc and free except at the beginning.

Many allocators use a similar principle internally. But to benefit from it, a few conditions need to be met. To illustrate, if your program allocates 10,000 instances of a class, and then releases them all, a chunk cache of 50 instances will be useless. Only when the number of instances is limited and the chunk cache is similar to it does this make sense. Otherwise your program will waste memory without added benefit.

Is memory allocated and deallocated in the same order?

If it is possible to allocate and deallocate memory chunks in the same order, this will guarantee low memory fragmentation and fast allocation. Some times it can be achieved by additional development effort.

Does your program allocates memory in one thread and then deallocates it in another?

One of the possibilities is to allocate and deallocate all the memory in the same thread. The other is to allocate memory in one thread and then deallocate it in another. From the performance point of view, moving chunks from one thread to another lowers performance with some allocators because it incurs additional synchronization. If possible try to avoid it.

This happens because many allocators often keep per-thread list of chunks in order to avoid the need for thread synchronization. With chunks switching threads some synchronization is needed and a performance penalty needs to be paid.

Is your program a long-running program?

If your program is a long-running program, slowdowns due to memory fragmentation can be severe. There are several ways to tackle this:

Some programs are designed so that they can be safely restarted. This is one way to fix the problem, but restoring the state of the long-running program can be challenging. Nevertheless, it is often done in embedded systems; some systems even have requirements that the system needs to be able to restart within a very short time limit (e.g. few hundreds of milliseconds).

It is possible to completely fix the performance penalties related to memory fragmentation on systems with virtual memory. But in the worst case, your program will appear to consume much more memory than it actually consumes. The idea is as follows:

For each chunk size (8 bytes, 16 bytes, 24 bytes, 32 bytes etc.) the allocator reserves a block of memory. When asked to allocate memory, the allocator picks the block depending on the chunk size. If all chunks in the block have the same size, the allocator can just pick the first free block and return it. With this scheme it is possible to allocate and deallocate memory in O(1) time.

Illustration of an allocator that picks the block to allocate from depending on chunk size.

Note however that the memory gets fragmented in this way and a lot of memory can get wasted. In the worst-case scenario your program can consume a lot of memory. An example: the operating system typically allocates memory in pages of 4kB each. In the worst-case scenario there will be one chuck per memory page. This means that around 4000 bytes can go to waste. That’s very bad, but luckily it happens rarely.

A few words about the allocators

All the problems just mentioned are common to every system that allocates and deallocates memory dynamically. There are several open-source allocators available on Linux that solve these problems one way or another, but no allocator, as far as I know, can solve all the problems completely.

When you are picking an allocator for your system, there are three properties that each allocator compromises on:

  • Allocation speed: the speed at which you can get request and release new chunks from the allocator. Note that both speed of malloc and free are important.
  • Memory consumption: what percentage of memory gets wasted with each allocated chunk. The allocator needs to keep some accounting info for each block, normally this takes up some space. Additionally, if the allocator optimizes for allocation speed it can leave some memory unused. And finally, some memory is unavailable due to memory fragmentation.
  • Cache locality: when the chunk gets out of the allocator, we would like for it to be in the data cache since this increases access speed later. Allocators that pack their internal blocks tightly and avoid memory losses will generally have better cache locality properties. We will talk about this in a follow-up article, I am mentioning it just for the sake of completeness.

Allocators on Linux

When you use malloc and free in your program (or new and delete in C++), normally it is the C standard library that implements those functions. Its allocator is called GNU allocator and it is based on ptmalloc. Apart from it, there are several other open-source allocators commonly used on Linux: tcmalloc (by Google), jemalloc (by Facebook), mimalloc (by Microsoft), hoard allocator, ptmalloc and dlmalloc.

GNU allocator is not among the most efficient allocators. However it does have one advantage, the worst-case runtime and memory usage will not be too bad. But the worst case happens rarely and it is definitely worth investigating if we can do better.

Other allocators claim to be better in speed, memory usage or cache locality. Still, when you are picking the right one for you, you should consider:

  • Is your program single-thread or multi-threaded?
  • Do you need maximum allocation speed or minimum memory consumption? What kind of trade-off are you willing to make between these two?
  • Do you want to replace the allocator for the whole program or only the most critical parts?

The experiment

Well, this time there will not be any experiments. The reason is simple: real-world programs differ too much from one another. An allocator that performs well under one load might have different behavior under another load.

If you are interested in seeing the number, I suggest you visit Testing Memory Allocators: ptmalloc2 vs tcmalloc vs hoard vs jemalloc While Trying to Simulate Real-World Loads. The author has compared several allocators on a test load that tries to simulate a real-world load. We will not repeat the results of his analysis here, but his conclusion matches ours: allocators are similar to one another and testing your particular application is more important than relying on synthetic benchmarks.

A few more words on using allocators in your program

If you are running under Linux, allocators are available in your distribution’s repositories. You don’t need to recompile your application in order to benefit from the allocator. You can use environment variable LD_PRELOAD to replace the default allocator with a custom one. You can use this trick to quickly check if the allocator fits your needs:

$ LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4  ./my_program

In the above example, we use LD_PRELOAD to overwrite the GNU allocator with tcmalloc.

All the allocators can be fine-tuned to run better on a particular system, but the default configuration should be enough for most use cases. Fine-tuning can be done at compile-time or run-time, through environment variables, configuration files or compilation options.

Normally the allocators provide implementations for malloc and free and will replace the functions with the same name provided by Standard C library. This means that every dynamic allocation your program makes goes through the new allocator.

However, it is possible to keep default malloc and free implementations together with custom implementations provided by your chosen allocator. Allocators can provide prefixed versions of malloc and free for this purpose. For example, jemalloc allows you to provide a custom prefix for malloc and free. If you specify prefix je_, the allocator will provide functions je_malloc and je_free. In this case, malloc and free will be left unchanged, and you can use je_malloc and je_free to allocate memory only for some parts of your program through jemalloc.

Final Words

We presented several techniques on how to make your program allocate memory faster. Off-the-shelf allocators have the benefit of being very easy to set-up and you can see the improvements within minutes. However, you will need to introduce new libraries to your program which can sometimes pose a problem.

Other techniques presented are also very powerful. Zone allocator is very fast when applicable. Chunk caching is fast as well and easy to implement. Both techniques decrease the general memory fragmentation.

Each of these techniques solves the question of allocation a bit differently and at the end of the day, it is your program and your requirements that will help you pick the best one.

Like what you are reading? Follow us on LinkedIn or Twitter and get notified as soon as new content becomes available.

Further Read

Scalable memory allocation using jemalloc

Testing Memory Allocators: ptmalloc2 vs tcmalloc vs hoard vs jemalloc While Trying to Simulate Real-World Loads

  1. std::map is implemented as a red-black tree []

Leave a Reply

Your email address will not be published. Required fields are marked *