Avoiding Heap Allocations With Static Thread Locals


In which I show how to use static thread locals to reuse function locals between invocations, and why this can be especially important when allocating on the heap.

Let’s say you are writing parallel software to process something. It has a single Processor object, with a member function process(). This function gets called from several threads. To complete its task, it depends on a Helper object (ignore the output for now).

class Processor
{
public:
    void process(char data, int iteration)
    {
        cout << "Iteration " << iteration << " ";
        Helper h(data);
        //Use the helper object
        cout << endl;
    }
}

class Helper 
{
public:
    Helper(char id)
    {
        cout << "Helper created for thread " << id;
    }
};

The data parameter stands in for some data structure that is unique to each thread, so we need a unique Helper for each thread. We assume that the Helper object never has to change inside a thread.

In this first example, the Helper object is a local variable, but this means a new instance gets created and destroyed for each invocation. If process() is called a lot, we might want to avoid this. Here is how to solve that using static thread locals:

    void process_threadlocal(char data, int iteration)
    {
        cout << "Iteration " << iteration << " ";
        static thread_local Helper* h = nullptr; 
        if (h == nullptr)
            h = new Helper(data);
        //Use the helper object
        cout << endl;
    }

(Note that the null check is there because I am using GCC which doesn’t support proper C++11 thread_local yet. In the future (or on some other compilers) you should be able to just type static thread_local Helper h* = new Helper(data);, or even better: static thread_local Helper h(data);)

Let’s have a look at the client code. The following main simulates three threads (named A, B and C) running 3 iterations of process each, first using the regular process(), then using process_threadlocal():

int main() 
{
    Processor processor;

    cout << endl << "LOCAL" << endl;
    for (char thread_id = 'A'; thread_id <= 'C'; ++thread_id) {
        thread([&]() {
            for (size_t i = 0; i < 3; ++i) {
                processor.process(thread_id, i);
            }
        }).join();
    }

    cout << endl << "STATIC THREAD LOCAL:" << endl;
    for (char thread_id = 'A'; thread_id <= 'C'; ++thread_id) {
        thread([&]() {
            for (size_t i = 0; i < 3; ++i) {
                processor.process_threadlocal(thread_id, i);
            }
        }).join();
    }

}

(For simplicity, we wait for each thread to finish before starting the next.)

Here is the output:

LOCAL:
Iteration 0 Helper created for thread A
Iteration 1 Helper created for thread A
Iteration 2 Helper created for thread A
Iteration 3 Helper created for thread A
Iteration 4 Helper created for thread A
Iteration 0 Helper created for thread B
Iteration 1 Helper created for thread B
Iteration 2 Helper created for thread B
Iteration 3 Helper created for thread B
Iteration 4 Helper created for thread B
Iteration 0 Helper created for thread C
Iteration 1 Helper created for thread C
Iteration 2 Helper created for thread C
Iteration 3 Helper created for thread C
Iteration 4 Helper created for thread C

STATIC THREAD LOCAL:
Iteration 0 Helper created for thread A
Iteration 1
Iteration 2
Iteration 3
Iteration 4
Iteration 0 Helper created for thread B
Iteration 1
Iteration 2
Iteration 3
Iteration 4
Iteration 0 Helper created for thread C
Iteration 1
Iteration 2
Iteration 3
Iteration 4

As you can see, one Helper is created for each thread, but by using static thread locals, it is only created once per thread.

But you said heap allocations!

I did. In the first example, I allocate the Helper on the stack. This might be a performance hit in itself, at least if the Helper constructor or destructor is expensive. What is far worse however is if Helper is allocated on the heap, or if any object it creates are (or any objects created by the ones it creates etc.).

To be thread safe, operator new needs to lock the entire heap. That means all other threads that need to access the heap block every time a thread allocates something on the heap. This can ruin your parallel performance. Andrew Binstock wrote very interesting article in Dr. Dobbs Go Parallel about this, which I highly recommend.

But, but, memory leak!
Nicely spotted. There is a memory leak in my program, since I never delete the Helpers created in process_threadlocal. This might or might not matter, depending on your situation. The program will never leak more than one object per thread, so you might just ignore the problem as memory is freed at the exit of the program anyway.

If Helper needs to free some other resource however, or do something else in its destructor, one thing you could do is wrap the Helper* pointer in an unique_ptr. Then the destructor will be called for each Helper when the program terminates and the static thread_local unique_ptr<Helper> is destroyed.

The reason I am not doing that in this example is that GCC doesn’t support non trivial thread local objects yet (It doesn’t have C++11 thread locals, but uses its own __thread which I have defined thread_local to).

And finally, the usual disclaimer: Don’t go optimizing like this unless you have profiler data to back that decision. Until proven otherwise, readable code always wins!

As usual, the code for this blog post is available on GitHub. I have only tested it with GCC, but I guess it should work in Visual Studio 2011 as well.

If you enjoyed this post, you can subscribe to my blog, or follow me on Twitter.