How to avoid includes in headers


I have now written twice about why you should minimize the use of include in header files. As one reader on Reddit politely put it (“crap article”), it is about time I write a post about how to do that.

There is mainly two things you can do:

  1. Reduce the number of headers you include.
  2. Reduce the contents of those headers.

1: Reducing the number of includes

Why do you need includes in the header in the first place? When you compile a file that includes header A, all the names that are used in header A need to be defined. So if header A uses an name defined in header B, header A should include header B (or else anyone who includes A will have to include B as well).

What do I mean by “using” a name then? Basically everything that mentions the name, except as a pointer or reference. Some examples of using a name:

SomeClass f; //Creating an object
some_function(); //Calling a function
class Derived : public SomeClass; //Inheriting from a class

Some examples of mentioning a name but not using it (only needs access to a declaration, not the definition):

SomeClass* f1; //Declaring a variable of pointer type
SomeClass& f2; //Declaring a variable of reference type

As soon as you want to use those though, they must be completely defined:

f1->function(); //Needs definition
f2->member; //Needs definition

Also note that the new type of enum, (enum class) only needs to be declared to be used, as long as you don’t actually mention its value:

SomeEnum e; //Declaring a variable of oldschool enum type needs definition.
SomeEnumClass e; //Declaring a variable of newschool enum type does not need definition...
SomeEnumClass e = SomeEnumClass::VALUE; //...but using a value does.

So given this knowledge, how do we reduce the number of includes in headers?

The first thing to do is to make sure you only include what is necessary to get the header file to compile. If you have a class definition in class.h and the implementation in class.cpp, the latter will typically need includes that are not needed by the header, and so should be put in the cpp. A simple way to check that you don’t have any unnecessary includes is to create an empty cpp file that only includes class.h. If it compiles, all the necessary headers are there. Then you can try commenting out includes in class.h and see which ones actually need to be there. Move the rest down to class.cpp.

The next thing to do is to check whether you actually need access to the definitions, or if a declaration will do. Try commenting out one #include at a time, and checking which names the compiler complains about. Then see if you are actually using that name, or if you are merely holding a pointer or a reference to an object of the type. If so, move the include down to the .cpp file, and add a forward declaration. Here’s an example:

#include "stuff.h"

class Ohlson
{
public:
    void doThings(Stuff* stuff);
};

The include stuff.h in not needed, as we never use stuff. Instead, we can use a forward declaration:

class Stuff; //Forward declaration instead of include

class Ohlson
{
public:
    void doThings(Stuff* stuff);
};

In the definition of Ohlson::doThings(Stuff* stuff) however (typically in ohlson.cpp), we probably need access to the definition of Stuff and need to include it.

So one of the things that will reduce the number of includes in headers is using pointers/references instead of values. There are other considerations to design as well though, so I would not advocate moving from values to pointers without considering the full picture.

Another thing that helps is to have the definition of functions in the .cpp file instead of the .h file. This however prevents inlining, so again, use caution.

A final technique that could help is the Pimpl Idiom. Briefly explained, instead of keeping your private members in the header, you keep them in a struct in the .cpp file. This means you don’t need access to their definitions in the header file, and can move their respective includes down to the .cpp file as well.

Reducing the content of includes

The other thing you can do to reduce the negative effects of includes in headers, if you cannot avoid them, is to reduce the content of those included headers. Here are some techniques:

C++ supports several paradigms, but a good recommendation for most is to depend on abstractions instead of implementations. In object oriented programming, this is done by programming to interfaces. C++ doesn’t have an explicit interface concept, but a class with only pure virtual methods is the same thing. Interfaces should also be kept small and focused. This means an interface should be very quick to compile, reducing the pain of having it included in many files. You will however still need to recompile all the files that include it when it changes, but the more focused it is, the fewer files will depend on it.

A generalization of the above is to avoid large inheritance hierarchies, even if you aren’t able to only use pure interfaces.

If you absolutely need to include a file in your header, but it is one that you own, you can reduce the content of that file by applying the techniques mentioned in the first part of this article to that file. (This is just section 1 seen from the other side of the table.)

A less straightforward technique is to be a bit clever with your templates. Templates are usually completely defined in header files, meaning that a change in the implementation triggers a recompile of everything that uses that template. There is however ways to move the implementation of templates out of header files, for instance as in this Dr. Dobbs article. C++11’s extern template can also help reduce compilation time. Avoiding the use of templates completely of course also solves the problem, but there was probably a reason why you wanted them in the first place.

One final technique that cannot go unmentioned is precompiled headers. While I don’t like what they do to dependencies, I consider them a necessary evil in medium to large projects to keep compile-times reasonable. Precompiled headers are very well explained in the first link in this paragraph, but here’s a short version: Headers can take a long time to compile. If they also change rarely, it would be nice to be able to cache them, and this it what precompiled headers does. In Visual Studio for instance, you put these includes in a file conventionally called stdafx.h, and then include stdafx.h in all your cpp files instead of the actual headers you need. The compiler then only has to compile them once. Headers from the standard library are typically placed here, and third party headers are also usually a good fit.

Those are the techniques I could think of, but I am sure I missed some. Which are you favourite ones? Which did I miss? Which of them are silly? Please use the comment section below.

If you enjoyed this post, you can subscribe to my blog, or follow me on Twitter.

Undefined Behaviour — Worse Than its Reputation?


Last week I wrote about The Difference Between Unspecified and Undefined Behaviour. This week I’d like to expand a bit more on the severity of undefined behaviour. If however you have a lot of time, instead go read A Guide to Undefined Behavior in C and C++ by John Regehr of the University of Utah, and then What Every C Programmer Should Know About Undefined Behavior by Chris Lattner of the LLVM project, as they cover this material in much more depth (and a lot more words!) than I do here.

To expand on the example from last week, what is the output of this program?

int main()
{
    int array[] = {1,2,3};
    cout << array[3] << endl;
    cout << "Goodbye, cruel world!" << endl;
}

A good guess would be a random integer on one line, then “Goodbye, cruel world!” on another line. A better guess would be that anything can happen on the first line, but then “Goodbye, cruel world!” for sure is printed. The answer is however that we can’t even know that, since If any step in a program’s execution has undefined behavior, then the entire execution is without meaning. [Regehr p.1].

This fact has two implications that I want to emphasize:

1: An optimizing compiler can move the undefined operation to a different place than it is given in the source code
[Regehr p.3] gives a good example of this:

int a;

void foo (unsigned y, unsigned z)
{
  bar();
  a = y%z; //Possible divide by zero
}

What happens if we call foo(1,0)? You would think bar() gets called, and then the program crashes. The compiler is however allowed to reorder the two lines in foo(), and [Regehr p.3] indeed shows that Clang does exactly this.

What are the implications? If you are investigating a crash in your program and never see the results of bar(), you might falsely conclude that the bug in the sourcecode must be before bar() is called, or in its very beginning. To find the real bug in this case you would have to turn off optimization, or step through the program in a debugger.

2: Seemingly unrelated code can be optimized away near a possible undefined behaviour
[Lattner p.1] presents a good example:

void contains_null_check(int *P) {
  int dead = *P;
  if (P == 0)
    return;
  *P = 4;
}

What happens if P is NULL? Maybe some garbage gets stored in int dead? Maybe dereferencing P crashes the program? At least we can be sure that we will never reach the last line, *P = 4 because of the check if (P == 0). Or can we?

An optimizing compiler applies its optimizations in series, not in one omniscient operation. Imagine two optimizations acting on this code, “Redundant Null Check Elimination” and “Dead Code Elimination” (in that order).

During Redundant Null Check Elimination, the compiler figures that if P == NULL, then int dead = *P; results in undefined behaviour, and the entire execution is undefined. The compiler can basically do whatever it wants. If P != NULL however, there is no need for the if-check. So it safley optimizes it away:

void contains_null_check(int *P) {
  int dead = *P;
  //if (P == 0)
    //return;
  *P = 4;
}

During Dead Code Elimination, the compiler figures out that dead is never used, and optimizes that line away as well. This invalidates the assumption made by Redundant Null Check Elimination, but the compiler has no way of knowing this, and we end up with this:

void contains_null_check(int *P) {
  *P = 4;
}

When we wrote this piece of code, we were sure (or so we thought) that *P = 4 would never be reached when P == NULL, but the compiler (correctly) optimized away the guard we meticulously had put in place.

Concluding notes
If you thought undefined behaviour only affected the operation in which it appears, I hope I have convinced you otherwise. And if you found the topic interesting, I really recommend reading the two articles I mentioned in the beginning (A Guide to Undefined Behavior in C and C++ and What Every C Programmer Should Know About Undefined Behavior). And the morale of the story is of course to avoid undefined behaviour like the plague.

If you enjoyed this post, you can subscribe to my blog, or follow me on Twitter.

Don’t be Afraid of Returning by Value, Know the Return Value Optimization


In which I argue you shouldn’t be afraid of returning even large objects by value.

If you have somewhat large collections of somewhat large objects in a performance-critical application, which of the following functions would you prefer?

void getObjects(vector<C>& objs);
vector<C> getObjects();

The first version looks faster, right? After all, the second one returns a copy of the vector, and to do that, all the elements have to be copied. Sounds expensive! Better then, to pass inn a reference to a vector that is filled, and avoid the expensive return.

The second version is however easier to use, since it communicates more clearly what it does, and does not require the caller to define the vector to be filled. Compare

    doSomethingWith(getObjects());

against the more cubmersome

    vector<C> temp;
    getObjects(temp);
    doSomethingWith(temp);

Sounds like a classic tradeoff between speed and clarity then. Except it isn’t! Both functions incur the exact same number of copies, even on the lowest optimization levels, and without inlining anything. How is that possible? The answer is the Return Value Optimization (RVO), which allows the compiler to optimize away the copy by having the caller and the callee use the same chunk of memory for both “copies”.

If you got the point, and take my word for it, you can stop reading now. What follows is a somewhat lengthy example demonstrating the RVO being used in several typical situations.

Example
Basically, I have a class C, which counts the times it is constructed or copy constructed, and a library of functions that demonstrate slightly different ways of returning instances of C.

Here are the getter functions:

C getTemporaryC() {
	return C();
}

C getLocalC() {
	C c;
	return c;
}

C getDelegatedC() {
	return getLocalC();
}

vector<C> getVectorOfC() {
	vector<C> v;
	v.push_back(C());
	return v;
}

I then call each of these functions, measuring the number of constructors and copy constructors called:

int main() {
	C c1;
	print_copies("1: Constructing");

	C c2(c1);
	print_copies("2: Copy constructing");

	C c3 = getTemporaryC();
	print_copies("3: Returning a temporary");

	C c4 = getLocalC();
	print_copies("4: Returning a local");

	C c5 = getDelegatedC();
	print_copies("5: Returning through a delegate");

	vector<C> v = getVectorOfC();
	print_copies("6: Returning a local vector");
}

Update: I used gcc 4.5.2 to test this. Since then, people have tested using other compilers, getting less encouraging results. Please see the comments, and the summary table near the end.

This is the result:

1: Constructing used 0 copies, 1 ctors.
2: Copy constructing used 1 copies, 0 ctors.
3: Returning a temporary used 0 copies, 1 ctors.
4: Returning a local used 0 copies, 1 ctors.
5: Returning through a delegate used 0 copies, 1 ctors.
6: Returning a local vector used 1 copies, 1 ctors.

Discussion
1 and 2 are just there to demonstrate that the counting works. In 1, the constructor is called once, and in 2 the copy constructor is called once.

Then we get to the interesting part; In 3 and 4, we see that returning a copy does not invoke the copy constructor, even when the initial C is allocated on the stack in a local variable.

Then we get to 5, which also returns by value, but where the initial object is not allocated by the function itself. Rather, it gets its object from calling yet antother function. Even this chaining of methods doesn’t defeat the RVO, there is still not a single copy being made.

Finally, in 6, we try returing a container, a vector. Aha! A copy was made! But the copy that gets counted is made by vector::push_back(), not by returning the vector. So we see that the RVO also works when returning containers.

A curious detail
The normal rule for optimization used by the C++ standard is that the compiler is free to use whatever crazy cheating tricks it can come up with, as long as the result is no different from the non-optimized code. Can you spot where this rule is broken? In my example, the copy constructor has a side effect, incrementing the counter of copies made. That means that if the copy is optimized away, the result of the program is now different with and without RVO! This it what makes the RVO different from other optimizations, in that the compiler is actually allowed to optimize away the copy constructor even if it has side effects.

Conclusion
This has been my longest post so far, but the conclusion is simple: Don’t be afraid of returning large objects by value! Your code will be simpler, and just as fast.

UPDATE: Several people have been nice enough to try the examples in various compilers, here is a summary of the number of copies made in examples 3-6:

Compiler Temporary Local Delegate Vector SUM Contributed by
Clang 3.2.1 0 0 0 1 1 Anders S. Knatten
Embarcadero RAD Studio 10.1 U. 2 (clang) bcc32c/bcc64 0 0 0 1 1 Eike
GCC 4.4.5 0 0 0 1 1 Anders S. Knatten
GCC 4.5.2 0 0 0 1 1 Anders S. Knatten
GCC 4.5.2 -std=c++0x 0 0 0 1 1 Anders S. Knatten
GCC 4.6.4 -std=c++0x 0 0 0 1 1 Anders S. Knatten
GCC 4.7.3 -std=c++0x 0 0 0 1 1 Anders S. Knatten
Visual Studio 2008 0 0 0 1 1 Anders S. Knatten
Visual Studio 2010 0 0 0 1 1 Dakota
Visual Studio 2012 0 0 0 1 1 Dakota
Visual Studio 2013 Preview 0 0 0 1 1 Dakota
Visual Studio 2005 0 0 0 2 2 Dakota
IBM XL C/C++ for AIX, V10.1 0 0 0 2 2 Olexiy Buyanskyy
IBM XL C/C++ for AIX, V11.1 (5724-X13) 0 0 0 2 2 Olexiy Buyanskyy
IBM XL C/C++ for AIX, V12.1 (5765-J02, 5725-C72) 0 0 0 2 2 Olexiy Buyanskyy
Embarcadero RAD Studio 10.1 Update 2 (prev gen) bcc32 0 1 1 2 4 Eike
Embarcadero RAD Studio XE relase build 0 1 1 2 4 Rob
Sun C++ 5.8 Patch 121017-14 2008/04/16 0 1 1 2 4 Bruce Stephens
Sun C++ 5.11 SunOS_i386 2010/08/13 0 1 1 2 4 Asgeir S. Nilsen
Sun C++ 5.12 SunOS_sparc Patch 148506-18 2014/02/11 0 1 1 2 4 Olexiy Buyanskyy
Visual C++ 6 SP6 (Version 12.00.8804) [0-3] 0 1 1 2 4 Martin Moene
HP ANSI C++ B3910B A.03.85 0 1 2 2 5 Bruce Stephens

UPDATE 2: Thomas Braun has written a similar post, including more intricate examples and move semantics. Read it here (pdf).

You can download all the example code from this post at Github.

If you enjoyed this post, you can subscribe to my blog, or follow me on Twitter.

How ++c can be Faster than c++


In for-loops, people will tell you to always increment using ++c instead of c++. Why exactly is this? And will it affect your program in any way?

The observable difference between ++c and c++ is that the former evaluates to c+1, and the latter to c:

int main() {
    int c = 1;
    cout << ++c << endl;  // 2
    cout << c++ << endl;  // still 2
    cout << c << endl;  // 3
}

First of all, to get that question out of the way: This does not mean that the first alternative will increment c at the beginning of the iteration. Both alternatives increment c at the end of each iteration:

int main() {
    for (int c = 0; c < 1; c++) {
        cout << c << endl; // 0
    }
    for (int c = 0; c < 1; ++c) {
        cout << c << endl; // still 0
    }
}

So what then, is the point? Speed is.

Consider how the ++c operator would be implemented. No one will ever use the old value of c, so the implementer is free to increment c and then provide a reference to the freshly updated object. Now consider c++. Here, the implementer needs to provide a reference to the not-updated object, but still increment it. A common way to do this is to make an unmodified copy of c, increment the original c, and then return the copy. See the following example:

struct C {
    int _i;
    C(int i = 0) : _i(i){}

    C& operator++() {
        _i++;
        return *this;
    }

    C operator++(int) {
        C tmp(_i);
        _i++;
        return tmp;
    }
};

ostream& operator<< (ostream& os, const C& c) {
    os << c._i;
    return os;
}

int main () {
    C c;
    cout << c << endl; //0
    cout << c++ << endl; //0
    cout << ++c << endl; //2
}

As you can see, this introduces unnecessary overhead, especially for complex classes.

But what about built in types? For these, there will typically be no overhead using c++, but if you get into the habit of always using ++c, you won’t forget when you are incrementing more complex classes.

And as @tfheen points out, “[cases like the above example will be optimized away by] any compiler less than 30 years old.”, so this advice is probably not that relevant anymore. But now at least you know the reasoning when you see it.