In which I show how to use static thread locals to reuse function locals between invocations, and why this can be especially important when allocating on the heap.
Let’s say you are writing parallel software to process something. It has a single Processor
object, with a member function process()
. This function gets called from several threads. To complete its task, it depends on a Helper
object (ignore the output for now).
class Processor { public: void process(char data, int iteration) { cout << "Iteration " << iteration << " "; Helper h(data); //Use the helper object cout << endl; } }
class Helper { public: Helper(char id) { cout << "Helper created for thread " << id; } };
The data
parameter stands in for some data structure that is unique to each thread, so we need a unique Helper
for each thread. We assume that the Helper
object never has to change inside a thread.
In this first example, the Helper
object is a local variable, but this means a new instance gets created and destroyed for each invocation. If process()
is called a lot, we might want to avoid this. Here is how to solve that using static thread locals:
void process_threadlocal(char data, int iteration) { cout << "Iteration " << iteration << " "; static thread_local Helper* h = nullptr; if (h == nullptr) h = new Helper(data); //Use the helper object cout << endl; }
(Note that the null check is there because I am using GCC which doesn’t support proper C++11 thread_local
yet. In the future (or on some other compilers) you should be able to just type static thread_local Helper h* = new Helper(data);
, or even better: static thread_local Helper h(data);
)
Let’s have a look at the client code. The following main
simulates three threads (named A, B and C) running 3 iterations of process each
, first using the regular process()
, then using process_threadlocal()
:
int main() { Processor processor; cout << endl << "LOCAL" << endl; for (char thread_id = 'A'; thread_id <= 'C'; ++thread_id) { thread([&]() { for (size_t i = 0; i < 3; ++i) { processor.process(thread_id, i); } }).join(); } cout << endl << "STATIC THREAD LOCAL:" << endl; for (char thread_id = 'A'; thread_id <= 'C'; ++thread_id) { thread([&]() { for (size_t i = 0; i < 3; ++i) { processor.process_threadlocal(thread_id, i); } }).join(); } }
(For simplicity, we wait for each thread to finish before starting the next.)
Here is the output:
LOCAL:
Iteration 0 Helper created for thread A
Iteration 1 Helper created for thread A
Iteration 2 Helper created for thread A
Iteration 3 Helper created for thread A
Iteration 4 Helper created for thread A
Iteration 0 Helper created for thread B
Iteration 1 Helper created for thread B
Iteration 2 Helper created for thread B
Iteration 3 Helper created for thread B
Iteration 4 Helper created for thread B
Iteration 0 Helper created for thread C
Iteration 1 Helper created for thread C
Iteration 2 Helper created for thread C
Iteration 3 Helper created for thread C
Iteration 4 Helper created for thread CSTATIC THREAD LOCAL:
Iteration 0 Helper created for thread A
Iteration 1
Iteration 2
Iteration 3
Iteration 4
Iteration 0 Helper created for thread B
Iteration 1
Iteration 2
Iteration 3
Iteration 4
Iteration 0 Helper created for thread C
Iteration 1
Iteration 2
Iteration 3
Iteration 4
As you can see, one Helper
is created for each thread, but by using static thread locals, it is only created once per thread.
But you said heap allocations!
I did. In the first example, I allocate the Helper
on the stack. This might be a performance hit in itself, at least if the Helper
constructor or destructor is expensive. What is far worse however is if Helper
is allocated on the heap, or if any object it creates are (or any objects created by the ones it creates etc.).
To be thread safe, operator new
needs to lock the entire heap. That means all other threads that need to access the heap block every time a thread allocates something on the heap. This can ruin your parallel performance. Andrew Binstock wrote very interesting article in Dr. Dobbs Go Parallel about this, which I highly recommend.
But, but, memory leak!
Nicely spotted. There is a memory leak in my program, since I never delete the Helper
s created in process_threadlocal
. This might or might not matter, depending on your situation. The program will never leak more than one object per thread, so you might just ignore the problem as memory is freed at the exit of the program anyway.
If Helper
needs to free some other resource however, or do something else in its destructor, one thing you could do is wrap the Helper*
pointer in an unique_ptr
. Then the destructor will be called for each Helper
when the program terminates and the static thread_local unique_ptr<Helper>
is destroyed.
The reason I am not doing that in this example is that GCC doesn’t support non trivial thread local objects yet (It doesn’t have C++11 thread locals, but uses its own __thread
which I have defined thread_local
to).
And finally, the usual disclaimer: Don’t go optimizing like this unless you have profiler data to back that decision. Until proven otherwise, readable code always wins!
As usual, the code for this blog post is available on GitHub. I have only tested it with GCC, but I guess it should work in Visual Studio 2011 as well.
If you enjoyed this post, you can subscribe to my blog, or follow me on Twitter.
What is it supposed to optimize exactly?
You’re adding branching overhead by doing if (h == nullptr).
Also, did you mean: static thread_local Helper* h = new Helper(data); ??? The static keyword is supposed to bring in the “if not initialized yet, do it” logic.
Finally I’m not even sure anything runs in parallel, looking at the output it’s all in sequence.
Thanks a lot for your comments Gregory!
> What is it supposed to optimize exactly?
The main point is avoiding doing heap allocations in each function invocation to avoid locking the heap, but the technique can also be useful if it is expensive to create or destroy the
Helper
objects or any objects indirectly created by them.> You’re adding branching overhead by doing if (h == nullptr).
Yes I am. Whether this matters or not depends on the rest of the function body.
> Also, did you mean: static thread_local Helper* h = new Helper(data); ??? The static keyword is supposed to bring in the “if not initialized yet, do it” logic.
That doesn’t work in GCC with static thread locals:
error: ‘h’ is thread-local and so cannot be dynamically initialized
Do you know if it supposed to work in C++11?
> Finally I’m not even sure anything runs in parallel, looking at the output it’s all in sequence.
No, nothing runs in parallel here, but that is intentional. If you look at the code in the for-loops, I put a
.join()
at the end of each thread. I did this for two reasons:1: Simpler code. I don’t have to save the threads in a container to be able to
join
all of them later.2: To avoid mangling the output.
I don’t however see how it is a problem that the threads are running in sequence?
> Do you know if it supposed to work in C++11?
Nop, really I overlooked C++11’s threading additions and I’m not so impressed. Last time I had a look I didn’t read any restrictions concerning thread_local and dynamic initialization so I believe static thread_local Helper* h = new Helper(data); is possible.
But as you noticed, this page http://gcc.gnu.org/projects/cxx0x.html confirms GCC doesn’t actually support thread_local.
And __thread is not C++11 thread_local otherwise you would have been able to do static thread_local Helper(data); and be done.
> I don’t however see how it is a problem that the threads are running in sequence?
I indeed overlooked the .join() call but… Well, why use threads in the first place? :) Output could have been out of order while still showing a single helper is created by thread. And somehow you’re validating an approach about threading code while having sequential execution.
All in all, if Helper is stateful and requires complex construction/destruction maybe it’s not so much an helper in the first place and it shouldn’t be an optimization spot. That’s the concern I had when reading your example code and decided to share my very own opinion with you. That sample code advocates language constructs not so expert C++ programmers would use blindly without realizing what’s going on under the hood. And you’re basing your advice on GCC which doesn’t have the true C++11 thread_local support.
> I indeed overlooked the .join() call but… Well, why use threads in the first place? :) Output could have been out of order while still showing a single helper is created by thread.
If I ran everything in one thread, the result of the program would change, only one helper would be created. The output of the demonstration would be identical to a program using a normal
static
, withoutthread_local
, and I think that would miss the point of the article.> And somehow you’re validating an approach about threading code while having sequential execution.
My point isn’t really to validate the approach, it is to demonstrate a practical application of it. But even though I have sequential execution I think it still validates the approach. It is clear from the output that one
Helper
is created per thread.Here is the output with a function that uses
static
withoutthread_local
:STATIC:
Iteration 0 Helper created for thread A
Iteration 1
Iteration 2
Iteration 0
Iteration 1
Iteration 2
Iteration 0
Iteration 1
Iteration 2
So no matter if the threads are running in parallel or not, they still demonstrate the point.
(I updated the code on GitHub if you want to have a look.)
> All in all, if Helper is stateful and requires complex construction/destruction maybe it’s not so much an helper in the first place and it shouldn’t be an optimization spot.
That’s a very good point.
> That’s the concern I had when reading your example code and decided to share my very own opinion with you. That sample code advocates language constructs not so expert C++ programmers would use blindly without realizing what’s going on under the hood.
Hopefully they are reading the comments then! :)
> And you’re basing your advice on GCC which doesn’t have the true C++11 thread_local support.
The approach is valid both with GCC
__thread
andthread_local
, but I’ll update the post with a not about the unnecessaryif
.By the way, do you know how the C++ runtime is supposed to handle the “one time only” initialization? Won’t it have to use some sort of branching to do a null check under the hood anyway?