Avoiding Heap Allocations With Static Thread Locals


In which I show how to use static thread locals to reuse function locals between invocations, and why this can be especially important when allocating on the heap.

Let’s say you are writing parallel software to process something. It has a single Processor object, with a member function process(). This function gets called from several threads. To complete its task, it depends on a Helper object (ignore the output for now).

class Processor
{
public:
    void process(char data, int iteration)
    {
        cout << "Iteration " << iteration << " ";
        Helper h(data);
        //Use the helper object
        cout << endl;
    }
}

class Helper 
{
public:
    Helper(char id)
    {
        cout << "Helper created for thread " << id;
    }
};

The data parameter stands in for some data structure that is unique to each thread, so we need a unique Helper for each thread. We assume that the Helper object never has to change inside a thread.

In this first example, the Helper object is a local variable, but this means a new instance gets created and destroyed for each invocation. If process() is called a lot, we might want to avoid this. Here is how to solve that using static thread locals:

    void process_threadlocal(char data, int iteration)
    {
        cout << "Iteration " << iteration << " ";
        static thread_local Helper* h = nullptr; 
        if (h == nullptr)
            h = new Helper(data);
        //Use the helper object
        cout << endl;
    }

(Note that the null check is there because I am using GCC which doesn’t support proper C++11 thread_local yet. In the future (or on some other compilers) you should be able to just type static thread_local Helper h* = new Helper(data);, or even better: static thread_local Helper h(data);)

Let’s have a look at the client code. The following main simulates three threads (named A, B and C) running 3 iterations of process each, first using the regular process(), then using process_threadlocal():

int main() 
{
    Processor processor;

    cout << endl << "LOCAL" << endl;
    for (char thread_id = 'A'; thread_id <= 'C'; ++thread_id) {
        thread([&]() {
            for (size_t i = 0; i < 3; ++i) {
                processor.process(thread_id, i);
            }
        }).join();
    }

    cout << endl << "STATIC THREAD LOCAL:" << endl;
    for (char thread_id = 'A'; thread_id <= 'C'; ++thread_id) {
        thread([&]() {
            for (size_t i = 0; i < 3; ++i) {
                processor.process_threadlocal(thread_id, i);
            }
        }).join();
    }

}

(For simplicity, we wait for each thread to finish before starting the next.)

Here is the output:

LOCAL:
Iteration 0 Helper created for thread A
Iteration 1 Helper created for thread A
Iteration 2 Helper created for thread A
Iteration 3 Helper created for thread A
Iteration 4 Helper created for thread A
Iteration 0 Helper created for thread B
Iteration 1 Helper created for thread B
Iteration 2 Helper created for thread B
Iteration 3 Helper created for thread B
Iteration 4 Helper created for thread B
Iteration 0 Helper created for thread C
Iteration 1 Helper created for thread C
Iteration 2 Helper created for thread C
Iteration 3 Helper created for thread C
Iteration 4 Helper created for thread C

STATIC THREAD LOCAL:
Iteration 0 Helper created for thread A
Iteration 1
Iteration 2
Iteration 3
Iteration 4
Iteration 0 Helper created for thread B
Iteration 1
Iteration 2
Iteration 3
Iteration 4
Iteration 0 Helper created for thread C
Iteration 1
Iteration 2
Iteration 3
Iteration 4

As you can see, one Helper is created for each thread, but by using static thread locals, it is only created once per thread.

But you said heap allocations!

I did. In the first example, I allocate the Helper on the stack. This might be a performance hit in itself, at least if the Helper constructor or destructor is expensive. What is far worse however is if Helper is allocated on the heap, or if any object it creates are (or any objects created by the ones it creates etc.).

To be thread safe, operator new needs to lock the entire heap. That means all other threads that need to access the heap block every time a thread allocates something on the heap. This can ruin your parallel performance. Andrew Binstock wrote very interesting article in Dr. Dobbs Go Parallel about this, which I highly recommend.

But, but, memory leak!
Nicely spotted. There is a memory leak in my program, since I never delete the Helpers created in process_threadlocal. This might or might not matter, depending on your situation. The program will never leak more than one object per thread, so you might just ignore the problem as memory is freed at the exit of the program anyway.

If Helper needs to free some other resource however, or do something else in its destructor, one thing you could do is wrap the Helper* pointer in an unique_ptr. Then the destructor will be called for each Helper when the program terminates and the static thread_local unique_ptr<Helper> is destroyed.

The reason I am not doing that in this example is that GCC doesn’t support non trivial thread local objects yet (It doesn’t have C++11 thread locals, but uses its own __thread which I have defined thread_local to).

And finally, the usual disclaimer: Don’t go optimizing like this unless you have profiler data to back that decision. Until proven otherwise, readable code always wins!

As usual, the code for this blog post is available on GitHub. I have only tested it with GCC, but I guess it should work in Visual Studio 2011 as well.

If you enjoyed this post, you can subscribe to my blog, or follow me on Twitter.

4 thoughts on “Avoiding Heap Allocations With Static Thread Locals

  1. What is it supposed to optimize exactly?
    You’re adding branching overhead by doing if (h == nullptr).

    Also, did you mean: static thread_local Helper* h = new Helper(data); ??? The static keyword is supposed to bring in the “if not initialized yet, do it” logic.

    Finally I’m not even sure anything runs in parallel, looking at the output it’s all in sequence.

    1. Thanks a lot for your comments Gregory!

      > What is it supposed to optimize exactly?

      The main point is avoiding doing heap allocations in each function invocation to avoid locking the heap, but the technique can also be useful if it is expensive to create or destroy the Helper objects or any objects indirectly created by them.

      > You’re adding branching overhead by doing if (h == nullptr).

      Yes I am. Whether this matters or not depends on the rest of the function body.

      > Also, did you mean: static thread_local Helper* h = new Helper(data); ??? The static keyword is supposed to bring in the “if not initialized yet, do it” logic.

      That doesn’t work in GCC with static thread locals:

      error: ‘h’ is thread-local and so cannot be dynamically initialized

      Do you know if it supposed to work in C++11?

      > Finally I’m not even sure anything runs in parallel, looking at the output it’s all in sequence.

      No, nothing runs in parallel here, but that is intentional. If you look at the code in the for-loops, I put a .join() at the end of each thread. I did this for two reasons:

      1: Simpler code. I don’t have to save the threads in a container to be able to join all of them later.
      2: To avoid mangling the output.

      I don’t however see how it is a problem that the threads are running in sequence?

  2. > Do you know if it supposed to work in C++11?

    Nop, really I overlooked C++11’s threading additions and I’m not so impressed. Last time I had a look I didn’t read any restrictions concerning thread_local and dynamic initialization so I believe static thread_local Helper* h = new Helper(data); is possible.

    But as you noticed, this page http://gcc.gnu.org/projects/cxx0x.html confirms GCC doesn’t actually support thread_local.
    And __thread is not C++11 thread_local otherwise you would have been able to do static thread_local Helper(data); and be done.

    > I don’t however see how it is a problem that the threads are running in sequence?

    I indeed overlooked the .join() call but… Well, why use threads in the first place? :) Output could have been out of order while still showing a single helper is created by thread. And somehow you’re validating an approach about threading code while having sequential execution.

    All in all, if Helper is stateful and requires complex construction/destruction maybe it’s not so much an helper in the first place and it shouldn’t be an optimization spot. That’s the concern I had when reading your example code and decided to share my very own opinion with you. That sample code advocates language constructs not so expert C++ programmers would use blindly without realizing what’s going on under the hood. And you’re basing your advice on GCC which doesn’t have the true C++11 thread_local support.

    1. > I indeed overlooked the .join() call but… Well, why use threads in the first place? :) Output could have been out of order while still showing a single helper is created by thread.

      If I ran everything in one thread, the result of the program would change, only one helper would be created. The output of the demonstration would be identical to a program using a normal static, without thread_local, and I think that would miss the point of the article.

      > And somehow you’re validating an approach about threading code while having sequential execution.

      My point isn’t really to validate the approach, it is to demonstrate a practical application of it. But even though I have sequential execution I think it still validates the approach. It is clear from the output that one Helper is created per thread.

      Here is the output with a function that uses static without thread_local:

      STATIC:
      Iteration 0 Helper created for thread A
      Iteration 1
      Iteration 2
      Iteration 0
      Iteration 1
      Iteration 2
      Iteration 0
      Iteration 1
      Iteration 2

      So no matter if the threads are running in parallel or not, they still demonstrate the point.

      (I updated the code on GitHub if you want to have a look.)

      > All in all, if Helper is stateful and requires complex construction/destruction maybe it’s not so much an helper in the first place and it shouldn’t be an optimization spot.

      That’s a very good point.

      > That’s the concern I had when reading your example code and decided to share my very own opinion with you. That sample code advocates language constructs not so expert C++ programmers would use blindly without realizing what’s going on under the hood.

      Hopefully they are reading the comments then! :)

      > And you’re basing your advice on GCC which doesn’t have the true C++11 thread_local support.

      The approach is valid both with GCC __thread and thread_local, but I’ll update the post with a not about the unnecessary if.

      By the way, do you know how the C++ runtime is supposed to handle the “one time only” initialization? Won’t it have to use some sort of branching to do a null check under the hood anyway?

Leave a comment