Monday, April 21, 2014

Linktime optimization in GCC, part 1 - brief history

Link-time optimization (LTO) is one of the most actively developed features of GCC.
Since 2010, I am working on getting LTO to work well with real world applications and with GCC-4.9 release candidate out, perhaps it is a good time to share my experiences.

I plan to write about getting Firefox (Update: part 2 is here), Libreoffice, Chromium and other large applications to work with LTO. This will hopefully give an broader idea what kind of setup & portability issues one can hit. I will also show some benchmarks. Today I will however start with a short review of history and current status of implementation of link-time optimizations in GCC.

Update: See also Martin Liška post on buliding Gentoo with LTO.

Classical organization of the toolchain

Classical UNIX toolchain optimize at translation unit basis (that is, more or less, a source file). The compiler parses the source, produces a representation in the intermediate language (IL), optimizes it and eventually generates assembly. Assembly file is turned into an object file by the assembler. Resulting binary is produced by the linker. At runtime the dynamic linker links multiple DSOs together and executes a program. Finally dynamic linker may be used again via dlopen to load more libraries/plugins. Practically all non-trivial optimizations happens in the compiler, while some limited tricks are performed by both assembler and linker.

Early simple compilers internally worked on statement basis; often it was the case that the whole program would not fit into the memory (I will call this compilation scheme statement-at-a-time). Statement-at-a-time compilation greatly limits optimizations one can perform. Generally one can do only those transformations that are obvious from considering single statement in the isolation (such as expression simplification) though some basic inter-statement optimization is possible, too, such as basic common subexpression elimination or peephole optimization. Such optimizations can be implemented, at least in the limited form, as a single pass through the instruction stream without ever looking back.

Extending the scope of the compilation from statement to function (function-at-a-time compilation) permits most of classical optimizations including constant propagation, common subexpression elimination, loop optimization, register allocation, advanced instruction selection and instruction scheduling.

GCC started with a mixture of statement-at-a-time approach and function-at-a-time. Frontends generated intermediate code per-statement basis and immediately lowered it to very low-level (close to assembly) intermediate language called RTL (Register Transfer Language). According to Richard Stallman this was mostly done to deal with limitations of the stack size and memory of M68k machine he developed GCC on. The rest of the compiler worked at function-at-a-time. Even early GCCs was thus able to do basic function wide optimizations and until late 2.9 GCC series the optimizers has developed into strong low level optimization engine that was more portable to new architectures than any other compilers available at that time. The optimization queue consited primarily of strenghtened form of constant subexpression elimination (global value numbering), loop optimization (loop unrolling, strength reduction), jump optimization (including conditional jump generation), instruction combining, register allocation, and peephole optimization.

Cygnus was a pioneering free software company that made its living from porting GCC to new architectures. The main focus in late 90s was put into making the originally CISC centric optimizers work well with RISC and later VLIW instruction sets and to generally improve set of intra-procedural optimizations. Important development change happened in 1997 with introduction of EGCS fork which enabled some of deeper changes in the compiler.

Compiling C++

In 90s and early 2000s, the popularity of C++ was on the rise while the language itself was still a quickly moving target (C++ was standarized in 1998 with an update in 2003 and C++ ABI took few extra years to settle down). This brought challenges not only for the developers of C++ front-end, library and runtime (that is a separate topic I do not know enough about to write it), but soon enough for the backend developers, too.

In late 90's GCC was converted to function-at-a-time parsing. This was motivated mostly by a need of reasonable implementation of C++ templates. Here one want to keep close to source representation (generally called Abstract Syntax Tree, or AST) of templates until instantiation happens. Just like preprocessor macro, the template alone has no semantics that may be easily represented by classical compiler IL. This is hard to implement within a front-end that has no internal representation of the whole function body.

Dealing with abstraction penalty

C++ programmers more and more relied on compiler's ability to remove the abstraction penalty. It became essential to reliably inline functions, optimize out unreachable function bodies and do higher level transformation on function bodies (such as to replace data stored in instances of objects by scalar variables after all methods was inlined). C++ programers started to make strong assumptions about compiler's ability to cleanup their constructions while that are often not very well thought out from compiler developer's perspective.

As one of main offenders, function inlining was not a strongest point of GCC, which still worked in function-at-a-time mode: GCC backend was organized as a library that was used by frontend whenever it needed to produce something into the assembly file. For that purpose the order of things compiled was completely out of backend's control and the scope of inter-procedural optimizations in general was extremely limited.

Bottom-up inlining

This is how the impossible (interprocedural optimization within a backend that generally works at one function at a time) was implemented.

After function arrived from the frontend, backend performed following steps:
  1. Inliner inlined function calls of all functions it did previously memoized body of.
  2. Some early optimization passes performed some basic optimizations (dead code removal, jump optimization)
  3. Inliner checked if the optimized function is small enough, i.e. the number of instructions was less than 8*(8+num_parameters) and if so, it remembered the function for further inlining. Functions declared inline was remembered if their body did not exceed 600 instructions.
Such inliner implementation is called bottom-up - if you imagine callgraph as a tree growing down, the inliner starts from the bottom (leaves) and propagates inline decision upwards. Obviously such inliner was limited to inline only in forwarding direction (the function must have been defined before used) and also was believed to consume a lot of memory because it remembered the expanded function bodies with all inline decisions already applied.

Top-down inlining (GCC 3.0)

The old inliner was rewritten in 1999 by CodeSoucery (originally for C++ only and later ported to other languages) to work on the high level form. In the new implementation inlining was moved into the front-end (in fact, both inliners has coexisted for few years before the new inliner was ported to all front-ends).

At that time, C++ front-end already saved bodies of functions in the AST-like representation to handle templates. It also did not output most of functions just after it finished its parsing. Majority of functions was remembered in high-level form till very end of compilation when basic reachability analysis was performed to eliminate dead functions and to avoid unnecesary work of lowering them into the RTL representation. This was done because C++ include files tends to contain a lot of unused inline functions.

Adding inlining was conceptually simple. At the end of the compilation, the compiler started with functions that was clearly necessary (i.e. their address was taken, they was exported from current compilation unit and so on). For every function the following steps was applied:
  1. It was decided if function is inlinable and of so, its body was saved for later use.
  2. Calls to all small enough inlinable functions was inlined and recursively also all such calls within the inlined function bodies.
  3. The function was optimized and output to assembly. The expanded function body was released from memory.
  4. Every function that got mentioned in the assembly output was added into list of functions that are necessary.
This approach is called top-down because it works from root of the tree to the leaves.

One problem was how to define small functions - the high level representation did use C++ style statements and not assembler like instructions. Initial implementation simply counted every statement to have 10 instructions at average and the functions declared inline was inlined if their size was less than 500 in this estimate. Functions was autoinlined if their size was less than 100.

PR8361, tramp-3d and other compile time/memory use disasters

The new inliner has turned out to have serious issues. One of more famous examples, at that time was bug PR8361 (filled in 2002 by Gerald Peiffer) that exposed exponential explosion in memory usage of the compiler while building template heavy C++ program. This bug defeated a solution using simple hacks to GCC and nicely exposed problems of the new design.

First problem is use of inline in C++ code. Functions declared in header are often implicitly inline only because of developer's lazyness and the inline keyword in general is not as useful as one would expect. This means that compiler can not honor all inline functions in the program - inlining those alone will make compiler to explode on was majority of modern C++ sources. We had some relatively heated discussions about it in 2005 and of course making compiler to be selective about inline functions has made low level developers angry. For this reason always_inline attribute was born.

New, and much more evil, testcase was prepared by Richard Guenther (Biener), who later became one of most profilic GCC developers. His PhD research project was implemented using pooma and till today it serves as very useful benchmark for new inliner patches. SUSE's benchmarking server nicely shows how the performance was steadily improving in 2006-2009 until the most important problems was finally resolved.

Unit-at-a-time (GCC 3.4)

Originally as an experiment to do something during the first SuSE-labs conference, I started to work on reorganizing GCC to work at unit-at-a-time basis. At that time it was already obvious that function-at-a-time optimization brings limitations we did not like. Most importantly the function inlining worked only in forwarding order - if you called function before defining it, it would not be inlined.

In GCC 3.4 I merged in bare bones of the new callgraph module (described in this paper) that made compiler to first build intermediate language representation of the whole unit and then start optimizing.

The new inliner was something I put together as a simple proof-of-concept experiment of inter-procedural optimization pass. It used simple greedy algorithm that inlined all small enough functions in the order given by badness metrics (num_calls*size) until global unit growth of function growth limits was hit. The algorithm was optimizing for number of functions fully inlined into their callees. The limits was basically a security precautions to avoid exploding on inline bombs and exposing nonlinearity of the compiler on very large function bodies..

Bit more details about the early design appear in 2006 GCC summit paper and in  2007 GCC summit paper.

The inliner heuristics was the incrementally improved into the today form and also the callgarph module has grown up into a full inter-procedural optimization framework with additional optimizations implemented, such as constant-propagation, function specialization, partial inlining, inter-procedural points-to analysis (still not enabled by default) and others. The inliner still remains the most important component of it, probably as in every optimizing compiler.

Combining translation units via --combine (GCC 3.4)

GCC 3.4 was released in 2004. Since late 90's better commercial compilers and recently open-sourced Open64 was already capable of inter-module optimization via various implementations of link-time optimization. Here GCC was held back by both technical and political problems.

Clasical link-time optimization is implemented by storing the intermediate language into the "fake" object files. To make this possible, one need to have well defined representation of the whole compilation unit and the whole program. Because the backend has grown-up from the per-function compilation, it did not really had this.

At a same time, the Free Software Foundation was concerned about the consequences of adding a well-defined on-disk representation of GCC IL that could be used to hook GCC into proprietary front-ends and back-ends without violating GPL 2.

First attempt to get intermodule was implemented by Apple developer Geoff Keating. His implementation extended the C front-end to allow parsing of multiple source files at once and presenting them to backend as one translation unit. While this feature allowed some early experiments with whole program optimization, it also had serious limitations. It worked for C only and plans to extend it for C++ was never realized. It however worked well for many simpler programs and it was also able to detect programming errors that would be unnoticed with traditional scheme.

The new middle end (tree-SSA, GCC 4.0)

These optimizaitions became increasingly difficult to implement on GCC's low level intermediate program representation. For this reason a new middle-end was implemented. The tree-SSA project was probably the most important change that happened to GCC recently, but a topic for a different article ;)

LLVM (2003), Open64 (2002)

At the time Tree-SSA merge was being prepared, Chris Lattner presented at 2003 GCC Summit an proposal of itegration of LLVM into GCC as a new middle-end that would take place of the (not yet finished) Tree-SSA project. This never happened and as a result we now have two major open source compilers.

One of major trading point at that time was implementation of LTO. People made guesses if it will be harder to modify GCC architecture for LTO or if it would be harder to complete LLVM to features and richness of backends GCC had. It is definitely interesting to observe that from today point of view.

Update: Here are slides of Rafael Espindola talk about history of LTO in LLVM.

As a less known fact, Open64, was open sourced 2000 working as an alternative GCC backend with full LTO support.

Early days of LTO implementation in GCC (2005)

The political limitations to LTO and compiler plugins was lifted by GPL 3.0. In 2005 an official proposal "Link-Time Optimization in GCC: Requirements and High-Level design" was posted to GCC mailing list. Work on LTO started jointly at Google, IBM and Codesoucery.

The proposal took lessons from LTO designs in other compilers and proposed several important requirements starting from maximal transparency to user (basically LTO should require only -flto command line option), having well defined GNU Virtual Machine (GVN) and number of necessary changes across the GNU toolchain (assembler, linker and such).

WHOPR: Whole program optimization (2007)

Implementing LTO original proposal was extreme amount of work. While progress was made, some parts, like type representation in DWARF format, has shown to be progressing too slowly. Moreover Google started to consider an interesting problem on how to make LTO scale to really large applications, like one used internally by the compay.

For this reason number of changes to the project was made. New proposal WHOPR - Fast and Scalable Whole Program Optimizations in GCC was born and it dropped some of initial plans, most importantly the well defined GNU Virtual Machine while it added new. The streaming of GCC's IL (gimple) was implemented. This has performance advantages because it permits to save a lot of compiler specific imformation that does not need to be rebuilt at link time.


WHOPR introduced an approach to distributed link-time compilation, where the serial linking stage is minimalized. This fit well with the basic direction of the Intermodule optimization framework I was working on.

LIPO

LIPO is an interesting project by that apparently large difference for Google. It enables intermodule optimization based on profile feedback. Instead of doing classical and expensive whole program analysis at link-time, the basic decisions are actually done at runtime of the program being trained. The dynamic callgraph is built and the inlining decisions are made. The dynamic callgraph considers only portions of program that matters and thus it is a lot smaller. Also profile driven inlining decisions can be a lot more precise and reliable. Later, when recompiling with profile feedback, the compiler is told the inliner decisions and can parse the extra source files necessary.

While this idea seems a bit backwards to the traditional design, it makes a lot of sense in Google's environment. Compilation needs to be highly parallelized and it needs to fit in relatively small memory limits. On the other hand the runtime construction of callgrpah is less constrained. In this situation LIPO seems to easily outperform traditional LTO.

LTO merge (GCC 4.5)

LIPO's success in Google got development on WHOPR/LTO into trouble (at least that is my off-sider's understanding :). Most of WHOPR core developers were moved to other projects and they were finishing it in their free time. The project however succeeded in being merged to GCC in 2009, even if it was not in perfect shape. GCC 4.5 was released with LTO working well enough for SPEC benchmarks and bootstrap, while the actual WHOPR implmentation was not really functional. Richard Biener did huge amount of work on getting the LTO into working shape.

Update: Solving correctness issues

Having ability to store intermediate language to disk and load back is one problem a developer of link time optimization framework needs to solve. There are many other (and less understood) challenges.

Important part of the problem is to define precisely the semantics of the combined units and revisit rules and assumptions individual optimizers make. Most language standards deal with one compilation unit alone (well, at least this is mostly true for C, C++ has One Definition Rule). Consequently one has to think what happens when there are two types of the same name with different meaning coming from different units, what happens when pointer to datastructure in one language (C, C++, Fortran, Ada, Java) is accessed as datastructure in another language. One has to deal with conflicting optimization flags (like -fstrict-aliasing used in one unit, but not in another) because all the code is going to be mixed together by the inter-module inlining.

At GCC 4.5 we want long way for one language and C/C++ mixed programs. Compiler was partly cleaned up from hooks allowing front-ends to customize backend behaviour and the basic type merging was implemented. Many of the issues however was left unresolved and had to be dealt with later.Some of them, in particular the command line issues, do not have completely satisfactory solution till present day.

Binutils support

LTO optimization is major change and it requires changes thorough the whole toolchain. Linker and other utilities to work with object files needs to be told to operate with LTO object files that are not containing any actual assembly. In GNU toolchain this was realized by  plugin support (implemented by Cary Coutant). Plugins adds a simple, but flexible API, to glue compiler into the gold linker that is useful for all LTO enabled compilers. Unlike approaches working around the linker changes this provide very precise information to the compiler about what symbols may or may not be used by non-LTO code. This is very important for optimization.

The linker plugin support was later added to the traditional GNU-ld (by Dave Korn), enabling LTO support on non-ELF targets including Windows.

One interesting aspect of the linker plugin design is the fact that it allows one pass linking. Most other designs dispatch compiler from the linker and once compiler delivers object files back, the linking is re-done. Gold and GNU-LD implements one pass linking. This has however caused an issue and thus alternative, two-pass linking, implementation was done by H. J. Lu.

Final ar,nm and ranlib utilities was extended for plugin support, by Rafael Espindola, and gcc convenience wrappers added by Andi Kleen. I am still not completely happy about the need for wrappers, but I will write more about that next time.

Getting WHOPR to work (GCC 4.6)

Having basic infrastructure in shape, many GCC developers continued on necessary cleanups and improvements. The linker plugin became the default and we worked on many correctness issues.

To get an specific goal I decided to work towards making GCC to compile Firefox. This was motivated by a post from Firefox developer Taras Glek about massive performance regression in GCC 4.5 (which has turned out to be problem with compiling C++ applications with -Os). I started to look into performance issues of Firefox and found them puzzling and interesting.

In 2010 GCC summit paper we reported a success on linking Firefox in reasonable time (4 minutes on 24core machine) and 4GB of memory use. (Without the WHOPR mode linking took 19 minutes 8GB of RAM). The resulting binary did work and showed small, 1% improvement on dromaeo CSS benchamrks as well as 6% code size reduction. Not many compilers are capable of this.

Firefox is a quickly growing target - a year later I needed 8GB of memory to build it with the same compiler,so GCC 4.7 had to be optimized carefully to squeeze it into 3GB. Today firefox, compiled with GCC 4.7 needs over 20GB of RAM. Some observations on this can be found in these 2013 slides.

In gcc 4.6 -fwhopr was removed and became default with -flto. The non-whopr path can still be executed via -flto-partition=none.

Kernel compilation (GCC 4.7+)

Other long running project was Andi Kleen's effort to get Linux kernel to build with LTO. For a kernel developer, he made remarkable job on adding some missing features to the toolchain, mostly importantly the support for incremental linking. His effort is described in 2013 slides. The changes was considered for 3.15rc1 but missed the window apparently due to lack of evidence of benefits to the end-user.

GCC 4.7 also had a rewrite of interprocedural constant propagation and a new inliner analysis that is aware of the context function is being inlined to (and, for example, can anticipate what parts of function body will be optimized out).

Progress on correctly compiling non-trivial C++ program was made by reimplementing thunks support.

Symbol table, scalability and stability improvements (GCC 4.8)

One of long lasting GCC oddities was the lack of a symbol table datastructure. This made it harder to develop representation of some features that are tightly related to it, including aliases and thunks. For this reason we reorganized the callgraph module into a new symbol table datastructure that enables a better LTO optimization in programs involving those features.

GCC 4.8 also had the usual work on memory reductions, LTO speedups. Since the LTO codegen started to be in shape, it was also the first release where the inliner went through re-tunning with link time optimization being considered.

Today (GCC 4.9)

GCC 4.9 is being released this week (with release candidate 1 being out since last week). It is pretty big release with many internal changes and API cleanups.

The most important changes in LTO (besides the usual bugfixes) involve new type merging implementation by Richard Biener and many of compile time and memory use improvements. Firefox once again can be linked bellow 4GB limit despite its ongoing growth. In fact the memory use is now less than two-fold of memory needed by the linker alone that is not too bad result given how much more information GCC handle.

An important user visible change is the new default to slim LTO files instead of fat.  GCC 4.8 and earlier produced both the intermediate language as well as assembly into every object file copmiled with -flto. This made it possible to handle LTO objects with non-LTO aware tools (ar, nm, ranlib and the linker) but it also doubles compile time, since everything is compiled twice. Since the binutils, libtool and other support matured, it is finally possible to avoid this expense.

Martin Liška contributed pass for profile driven function reordering. This pass should enable faster startup of large applications by reducing the portion of binary that needs to be read into memory and making the accesses more sequential.

Another new feature is the devirtualization module that allows early removal of unreachable virtual methods (that makes a huge difference on Firefox compilation) and also improves code as described in other post.

GCC is now known to compile working Firefox, Chromium, Libreoffice, kernel and good a part of Gentoo distribution with LTO enabled.

What next?

LTO is far from being complete. One of main lacking features, by my opinion, is consistent handling of optimization options and debug info matching the non-LTO compilation. To get resonable performance benefits, the compiler needs to be re-tuned for the new whole program optimization mode and also some applications needs to be adjusted.

There is also plan to merge LTO streaming based LIPO from Google branch, Martin Liška's code unification pass. I also hope the plans for well define virtual machine will be revived. In longer term I am planning to extend WHOPR to support fast recompilation in compile-edit cycle.

Next time, i will write about series

1 comment: