In which I show how to use static thread locals to reuse function locals between invocations, and why this can be especially important when allocating on the heap.
Let’s say you are writing parallel software to process something. It has a single Processor
object, with a member function process()
. This function gets called from several threads. To complete its task, it depends on a Helper
object (ignore the output for now).
class Processor { public: void process(char data, int iteration) { cout << "Iteration " << iteration << " "; Helper h(data); //Use the helper object cout << endl; } }
class Helper { public: Helper(char id) { cout << "Helper created for thread " << id; } };
The data
parameter stands in for some data structure that is unique to each thread, so we need a unique Helper
for each thread. We assume that the Helper
object never has to change inside a thread.
In this first example, the Helper
object is a local variable, but this means a new instance gets created and destroyed for each invocation. If process()
is called a lot, we might want to avoid this. Here is how to solve that using static thread locals:
void process_threadlocal(char data, int iteration) { cout << "Iteration " << iteration << " "; static thread_local Helper* h = nullptr; if (h == nullptr) h = new Helper(data); //Use the helper object cout << endl; }
(Note that the null check is there because I am using GCC which doesn’t support proper C++11 thread_local
yet. In the future (or on some other compilers) you should be able to just type static thread_local Helper h* = new Helper(data);
, or even better: static thread_local Helper h(data);
)
Let’s have a look at the client code. The following main
simulates three threads (named A, B and C) running 3 iterations of process each
, first using the regular process()
, then using process_threadlocal()
:
int main() { Processor processor; cout << endl << "LOCAL" << endl; for (char thread_id = 'A'; thread_id <= 'C'; ++thread_id) { thread([&]() { for (size_t i = 0; i < 3; ++i) { processor.process(thread_id, i); } }).join(); } cout << endl << "STATIC THREAD LOCAL:" << endl; for (char thread_id = 'A'; thread_id <= 'C'; ++thread_id) { thread([&]() { for (size_t i = 0; i < 3; ++i) { processor.process_threadlocal(thread_id, i); } }).join(); } }
(For simplicity, we wait for each thread to finish before starting the next.)
Here is the output:
LOCAL:
Iteration 0 Helper created for thread A
Iteration 1 Helper created for thread A
Iteration 2 Helper created for thread A
Iteration 3 Helper created for thread A
Iteration 4 Helper created for thread A
Iteration 0 Helper created for thread B
Iteration 1 Helper created for thread B
Iteration 2 Helper created for thread B
Iteration 3 Helper created for thread B
Iteration 4 Helper created for thread B
Iteration 0 Helper created for thread C
Iteration 1 Helper created for thread C
Iteration 2 Helper created for thread C
Iteration 3 Helper created for thread C
Iteration 4 Helper created for thread CSTATIC THREAD LOCAL:
Iteration 0 Helper created for thread A
Iteration 1
Iteration 2
Iteration 3
Iteration 4
Iteration 0 Helper created for thread B
Iteration 1
Iteration 2
Iteration 3
Iteration 4
Iteration 0 Helper created for thread C
Iteration 1
Iteration 2
Iteration 3
Iteration 4
As you can see, one Helper
is created for each thread, but by using static thread locals, it is only created once per thread.
But you said heap allocations!
I did. In the first example, I allocate the Helper
on the stack. This might be a performance hit in itself, at least if the Helper
constructor or destructor is expensive. What is far worse however is if Helper
is allocated on the heap, or if any object it creates are (or any objects created by the ones it creates etc.).
To be thread safe, operator new
needs to lock the entire heap. That means all other threads that need to access the heap block every time a thread allocates something on the heap. This can ruin your parallel performance. Andrew Binstock wrote very interesting article in Dr. Dobbs Go Parallel about this, which I highly recommend.
But, but, memory leak!
Nicely spotted. There is a memory leak in my program, since I never delete the Helper
s created in process_threadlocal
. This might or might not matter, depending on your situation. The program will never leak more than one object per thread, so you might just ignore the problem as memory is freed at the exit of the program anyway.
If Helper
needs to free some other resource however, or do something else in its destructor, one thing you could do is wrap the Helper*
pointer in an unique_ptr
. Then the destructor will be called for each Helper
when the program terminates and the static thread_local unique_ptr<Helper>
is destroyed.
The reason I am not doing that in this example is that GCC doesn’t support non trivial thread local objects yet (It doesn’t have C++11 thread locals, but uses its own __thread
which I have defined thread_local
to).
And finally, the usual disclaimer: Don’t go optimizing like this unless you have profiler data to back that decision. Until proven otherwise, readable code always wins!
As usual, the code for this blog post is available on GitHub. I have only tested it with GCC, but I guess it should work in Visual Studio 2011 as well.
If you enjoyed this post, you can subscribe to my blog, or follow me on Twitter.