Under the Hood: warp, a fast C and C++ preprocessor

Today we are open-sourcing warp, a fast preprocessor for the C and C++ languages, written by Walter Bright in a joint project with Facebook.

Companies that work with large C++ code bases pay close attention to build times, as they have a major impact on engineers’ productivity. At Facebook, we have a sophisticated build system that uses distributed compilation in conjunction with several cache layers backed up by memcache. Part of this architecture relies on caching preprocessed files.

Conventional wisdom has it that preprocessing time is a negligible part of building a C++ binary. C++ is notoriously difficult to parse, which means that C++ parsers tend to be large and slow compared to their counterparts for other languages. Code generation is also quite the time sink, especially if you’re using optimizations. Finally, linking is a large serial step at the end of each build, and using the gold linker only helps so much. Therefore, it would seem that the lowly task of going through the included files and expanding macros would take a comparatively short time and be essentially impossible to improve upon.

Not so! Replacing gcc’s preprocessor with warp has led to significant improvements of our end-to-end build times (including linking). Depending on a variety of circumstances, we measured debug build speed improvements ranging from 10% all the way to 40%, all in complex projects with massive codebases and many dependencies. That’s not per-file speed, but rather global times measured for scenarios like “build after changing a header file.”

The reason for this, as well as the genesis story of warp, is best told by Walter Bright himself, whom Facebook commissioned to write warp.

Andrei Alexandrescu: Our experiments indicate solid overall build speed improvements by improving C++ preprocessing speed alone. As someone who spent his entire career writing compilers, do you find that surprising?

Walter Bright: Not at all, but to really figure why one would need some historical perspective. When C was created, RAM was a scarce commodity, so many programs wouldn’t fit in memory. Therefore, a common approach back then was to do multi-pass processing – a complex program would be divided into multiple, simpler batch stages, each going over the output of the previous stage and producing a file for the next stage. The specification of the C preprocessor follows that mold. For example, the very first step is to blindly replace trigraphs. The second removes each backslash followed by newline. And so on and so forth. This design allowed each of these steps to be carried by a separate program, but is poorly suited to today’s compiler with large memories and parallelism-hungry CPUs.

And then there’s #include, a primitive textual inclusion mechanism designed with multistage batch processing in mind that has scaled poorly to today’s demands. In C and especially C++, it’s common that relatively short programs inflate to staggering sizes after preprocessing due to #included files. The net result is that (a) the early stages of C and C++ compilation need to “look” at each byte in the source file many times (I managed to get it down to five but not less); and (b) there are many, many bytes to look at. Later compilation stages stash data in higher-level structures better suited for lookup (tables, trees), but simply getting parsing off the ground takes a lot of work.

AA: Speaking of #include, could you give a little detail on how you handle the include guard trick in warp?

WB: Include guards are a tremendous optimization of preprocessing speed, so warp of course detects and supports the pattern, in addition to #pragma once. What’s interesting, however, is that the analysis you and I conducted together revealed how many files in the standard library and popular libraries such as Boost do not use include guards. Sometimes that is semantically necessary, and other times the author seems to have considered the file too small to warrant include guards. However, it turns out that calls to functions such as fstat and open are time-intensive enough to make any reduction in these calls measurable. At an extreme, we found an innocuous program that ended up including the same Boost header more than 7,000 times. So warp caches files (once opened and loaded) for possible use later. That accelerates a lot of builds significantly.

AA: I’ve been looking through the warp source code closely, and I confess was quite surprised. Each language has its own approach to text transformation–for example, high performance processors written in C usually go with tight loops, unwieldy switch statements, and even the occasional assembler routine. Yet warp uses the pipes-and-filters (or in D parlance, ranges-and-algorithms) style. In many ways, warp is close in structure to the clean functional style promoted by modern Haskell libraries. What motivated this change in style, and were you cramped by that choice given that speed, not elegance, was the primary requirement for warp?

WB: It’s great that warp is open source, meaning anyone can pop the hood and see exactly how it works. There are two parts to responding this. First, I’ll explain how I wrote warp quickly (for I indeed had a working prototype in two weeks); then, I’ll explain how I made it fast.

I would have approached warp very differently 10 years ago. I would have probably used C-style hand-written code for everything, and I am not above using goto (laughs). But I am always on the lookout for improving the “sound” of my coding, as I think any programmer should be. The reality is that a C preprocessor is fundamentally a filter program, so range-based algorithms and pipe composition fit it best. For warp, then, I hoped to get a lot of mileage from composing with range-based algorithms everywhere, and that is indeed what happened.

The neat thing about ranges and algorithms is that they are highly “componentized,” meaning ranges and algorithms can be inserted, removed, and replaced at will. The customized matching of the ranges to the algorithms is then done by the compiler at compile time. The ability to swap them in and out is key to rapidly experimenting with different versions and measuring how they perform.

But there’s another great aspect to ranges and algorithms. I started work by designing little utility algorithms, like ones to scan an identifier, or lex an integer. Since the algorithms were templates, their input and output ranges could be swapped out at compile time for specialized unit test versions. Unit testing then becomes simple and well-encapsulated: input could be easily “mocked up” to test every line of the algorithm. Mock testing at its best!

D compilers have a built-in coverage analyzer, making it simple to add unit tests until every line was exercised. While that doesn’t prove the algorithm is bug-free, in my experience having 100% coverage makes for very few bugs found later. And, as we all know, the earlier bugs are found, the faster the development process proceeds. Coverage analysis and unit tests proved critically important in meeting the tight schedule. The development cycle here was:

edit-compile-coverage

Once the program is working, it’s time to start working on the speed. What’s the secret to making code fast?

AA: Knowing you, I suspect you’ll say…

WB: “Use the profiler, Luke!”

AA: …something along those lines.

WB: I can guarantee that you are wrong about where your code is spending most of its time if you haven’t run a profiler on it. Profiling is so terribly important I built support for it into the D compiler. I, of course, was unsurprised that I was surprised at where the time was being spent. The development loop at this stage became:

edit-compile-profile

It’s sad how many programmers I run into who never use a profiler, and are convinced they can optimize without one. If you are one of those people, you’re wrong. The awful truth is that your code speed is always going to be third rate, and won’t even qualify for the team. It isn’t unusual to be able to double the speed of a program with a few quick changes the first time a profiler is used.

AA: But this gets to an important related question. Clearly using better, more modular abstractions in conjunction with good testing practices helps productivity. But what about efficiency? Isn’t composing ranges and algorithms impacting the attainable performance negatively? And to ask a question that must be asked, shouldn’t you have descended down to C if performance was a paramount goal? Didn’t the profiler indicate that the abstractions you used are “lossy” when it comes to performance?

WB: Should one choose C over D for maximum performance? My answer is an emphatic no, and I bring as evidence the fact that warp is significantly faster than the preprocessor in my own older Digital Mars C/C++ compiler tool chain, which was already the fastest on the market, and on which I’ve spared no effort optimizing and tuning over months and years. Modern compiler technology working in concert with D language fundamentals (white-box templates, simple object models) takes high-level compositions of templated code and optimizes it down to code that is virtually indistinguishable from what a C programmer would write for maximum efficiency.

The neato thing about D is that by separating the code into ranges and algorithms, it becomes easy to keep trying different range implementations and different algorithms. I tried all sorts of things. Like breeding racehorses, I kept the profile winners and culled the losers, step by step, through the code. That would have been a lot more difficult to do in other languages.

One can only get so far with that, though. Then comes the real stuff that separates the programmers who medal from the ones who just show up for the party. And that is–you’ve got to look at the assembler dumps from the top 10 functions on the profiler hit list. I know what you’re thinking: “eww, I’m using a compiled language to get away from assembler.” I’ve heard all the excuses. They’re from people who think hot-rodding a car is swapping out the car computer ROM with one they got from a catalog. Real hot-rodders pry the cylinder heads off and start using a die grinder on them. Programmers who medal look at the assembler output of the compiler, and start tuning the source code to generate asm code that runs hot.

For example, getting the hot variables into registers is of paramount importance. If they’re not, figure out why and adjust them accordingly. Are the functions that must be inlined actually getting inlined? Are the functions that must not be inlined actually not being inlined? (Counter-intuitively, some functions muck up speed by being inlined. Those functions are ones that are rarely called, but if they’re inlined, the compiler enregisters its operations, starving the actual hot code of registers.)

Unfortunately, at the moment D doesn’t have a language feature to prevent inlining, but I’m not above using dirty tricks, and so wrote those functions in such a way to prevent them from being inlined.

It’s true that tuning a program for maximum performance tends to compromise readability and makes for some odd-looking code. But warp could not be successful if it didn’t improve upon its competition.

AA: So it seems certain compiler optimiztions are key to achieving good performance with idiomatic D code. What are those?

WB: warp is heavily dependent on modern compiler optimizations, and takes full advantage of them. In particular, it relies on:

Inlining.
Enregistering structs even if those structs are larger than a single register. (Unfortunately, dmd doesn’t do this, but gdc and ldc do.) Some back-and-forth experimenting is necessary to ensure the struct layout tickles that optimization.
Passing/returning POD structs in registers. Again, this is worth checking.

In particular, the Textbuf structure is designed to be enregistered in two, 64-bit registers. With optimizations (2) and (3), D ranges become every bit as efficient as C++ iterators. (D ranges contain two pieces of information, while C++ iterators contain only one.) I was initially worried about this, and even benchmarked an iterator instead of a range–it didn’t run any faster.

AA: Another interesting question is how you dealt with garbage collection. D allows alternative approaches to memory management, but its standard library still has a ways to go toward weaning itself off of GC use.

WB: Yeah, I can almost hear a bunch of readers chiming in: “But I thought GC languages were slow!” The thing is, D regards GC as just another tool. Very little of the D core language actually requires GC support. As in all high performance applications, regardless of how memory is managed, careful attention must be paid to where, when, and why memory is allocated.

The fastest memory allocation strategy is to do no allocation at all. The next fastest is to allocate on the stack. Low-level routines that need to write results to buffers don’t know who needs those buffers, how long-lived they must be, the size of them, etc. Therefore, by writing to OutputRanges, which are supplied as a template parameter, all these decisions are pushed “up” the function activation stack. Sometimes this makes the need for intermediate result buffers disappear altogether, or (more usually) they can be allocated as a local array on the stack.

warp has a custom OutputRange for this purpose called Textbuf, which is passed a local array for storage. If that storage overflows, it automatically fails over to using malloc/free.

All in all, the garbage collector hasn’t even been on the short list of efficiency-related issues in warp, though it clearly might be for other kinds of programs. The notable thing here is that D offers a variety of abstractions for allocating memory that greatly reduce both the amount of litter created and the difficulty of eliminating it.

AA: What do you think are warp‘s best areas of future improvement?

WB: warp is currently able to preprocess many source files, one after the other, in a single command. The speed savings for this are:

warp is set up/torn down only once for the entire set of files, rather than once for each source file.
warp caches all file lookups and source file contents. Since multiple source files often #include the same set of headers, those files are already loaded and ready to be reused.

Since preprocessing multiple source files is an inherently parallelizable process, a further speed improvement for warp would be to assign a thread to each source file.

I’m sure all the oil isn’t squeezed out of that olive yet. Medaling is only a temporary victory. The best thing is that warp is open-source. I appreciate that Facebook is open-sourcing major technology components, and I fully expect to be surprised by others’ ideas going forward. Mr. Sulu, ahead Warp factor 5!