Wednesday, August 20, 2014

Devirtualization in C++, part 6 - asking user for help

In previous posts of the series I described most of the new GCC devirtualization machinery. To cut the story short, GCC implements two kinds of devirtualization: the full and speculative one.

Full devirtualization replaces given polymorphic by a direct call (that can get inlined later).  Speculative devirtualization is a weaker form turning polymorphic call to a conditional testing whether the target is the predicted one and going by direct call path if it happens to be so and doing the indirect call otherwise.

Whole performing speculative devirtualization the compiler can play somehwat unsafe and make assumptions that are not valid 100% of cases. The main assumption made is that there are no new derived types introduced by code not visible to compiler. With link time optimization this is usually true, but not necessarily so - new derived types can be introduced by libraries or plugins. While compiler can generally attempt to track what instances may not be introduced by external code, this analysis is difficult and it is a question whether it can be implemented with reasonable precision within statically optimizing compiler.

The following example shows speculative devirtualization:
struct A {virtual void foo() {}};
void test (struct A *a)
{
  a->foo();
}
 Here GCC 4.9 produces:
test(A*):
        movq    (%rdi), %rax
        movq    (%rax), %rax
        cmpq    $a::foo(), %rax
        jne     .L5
        ret
.L5:
        jmp     *%rax
Instead of:
test(A*):
        .cfi_startproc
        movq    (%rdi), %rax
        movq    (%rax), %rax
        jmp     *%rax
If the target of polymorphic call happens to be a::foo(), the speculatively devirtualized code is significantly faster saving the call overhead (and enabling further optimizations).

As I measured in part 5, the speculative devirtualization can capture significant percentage of calls full devirtualization can not and is correct about its prediction in wast majority of cases.

C++ provides for ways that may be used to inform compiler that it has a complete information on derived types:
  1. declaring a type in an anonymous namespace,
  2. declaring a type in local scope (within function body),
  3. using the final specifier on a type, or,
  4. using the final specifier on a method.
Among these hints the first one is the most powerful. Declaring a type in the anonymous namespace not only ensures that no new derived types will appear in other translation units, but also makes it impossible for code in the other compilation units to access the data stored in a given type. This enables other optimizations, such as changing the memory layout. This is however not implemented yet in GCC.

In this post I will focus on C++11 final specifier and a way how compiler can aid user to annotate the code to improve code quality.

New warnings

The example given can be "improved" either by annotating struct A as final:
struct A final {virtual void foo() {}};
void test (struct A *a)
{
  a->foo();
Or by annotating the method foo as final:
struct A {virtual void foo() final {}};
void test (struct A *a)
{
  a->foo();
}
Both allows full devirtualization. Compiling both examples with GCC 4.9 will lead into an empty function test (that is a result of the full devirtualization and inlining the empty function A::foo):
test(A*):
       ret
For this reason I added two warnings: -Wsuggest-final-types and -Wsuggest-final-methods.

As name suggest, the warnings tells users about types or methods where final specifier would improve code quality. Both warnings are selective and do not buzz about all types with no derived types nor about all virtual methods with no override. They also do not warn in the cases where the compiler can easily determine the optimization itself (thus they are necessarily compiler version and optimization setting sensitive and thus not good candidates for -Werror).
$ gcc -O2 -Wsuggest-final-types -Wsuggest-final-methods t.C
t.C:1:8: warning: Declaring type ‘struct A’ final would enable devirtualization [-Wsuggest-final-types]
 struct A {virtual void foo() {}};
        ^
t.C:1:24: warning: Declaring method ‘virtual void A::foo()’ final would enable devirtualization [-Wsuggest-final-methods]
 struct A {virtual void foo() {}};
                        ^
The implementation on the top of the existing devirtualization machinery is quite straighforward. A (Speculative) devirtualization is implemented by filling in a polymorphic call context for a given call.  This context consists of outer type that is the type of instance the virtual table belongs to, offset within the outer type where the virtual table pointer is located and two flags.  First (maybe_in_construction) specify that type may be in construction and thus virtual functions of base types needs to be considered while the second (maybe_derived_type) specify that the instance actually may be of a type derived from the outer type. In most cases maybe_derived_type is set, because the type is determined from references in the source code.

Once filled in, the polymorphic call context is then turned into a list of methods that may be called by a given polymorphic call. This is done by recursive walk of the type inheritance tree as described in part 4 of the series. The resulting list is either complete or incomplete. Incomplette lists contains all possible targets within the current compilation unit, but the call may lead to virtual functions defined outside. Complete lists contains all possible targets. Incomplete lists are useful for speculation and profile prediction. Complete lists can drive full devirtualization as well as optimization propagating information across the callgraph.

When the new warnings are requested, the speculative devirtualization pass looks for contextes that have maybe_derived_type set, that contain precisely one method but they are incoplete. If such context is found then there are two options:
  1. If outer type has no derivations, -Wsuggest-final-types will output a warning that the type may be declared final
  2. If the method has no override, -Wsuggest-final-methods will suggest declaring it final.

How useful are the warnings?

The main purpose of the final keyword is not to assist code optimization. It is meant to be an syntactic sugar making APIs more robust. Unlike in Java codebases, the use of final keyword in C++ is not very widespread because it was introduced to C++ only recently. There are large code bases that was developed in pre-C++11 era and programmers are not used to the new feature.

I do not expect users to blindly annotate every type or method suggested. Instead I would advise to first try -Wsuggest-final-types and inspect individual types found.  Once suitable ones are annotated -Wsuggest-final-methods can be used to annotate suitable methods: annotating a type final renders final specifiers on its methods pointless.

The warnings can be used with both normal compilation and with link time optimization.  The second makes them a lot more precise: during the link time optimization the compiler has more information about derivations of a given type. Moreover more devirtualization and inlining happens in link time optimization builds.

Prioritizing the importantce

As an experiment I produced warnings from compilation of Firefox's libxul
library.  -Wsuggest-final-types identified 1772 types and -Wsuggest-final-methods 7215 methods (because of a non-linearity in GCC's line information lookup it takes about 5 minutes to just output them out of compiler and it would take ages to inspect all of them by hand!).  It is not a surprise that Firefox developers was unimpressed and asked me whether I can prioritize it somehow.

For this reason I implemented simple accounting into devirtualization pass and made it to output the warnings in the priority order, where priority is given by number of calls that would be fully devirtualized by the annotation either statically or dynamically when profile feedback is available.  The Firefox warnings now looks as follows:
/aux/hubicka/firefox/editor/libeditor/html/nsHTMLEditor.h:66:0: warning: Declaring type ‘struct nsHTMLEditor’ final would enable devirtualization of 575 calls [-Wsuggest-final-types]
 class nsHTMLEditor : public nsPlaintextEditor,
 ^
/aux/hubicka/firefox/docshell/base/nsDocShell.h:124:0: warning: Declaring type ‘struct nsDocShell’ final would enable devirtualization of 216 calls [-Wsuggest-final-types]
 class nsDocShell : public nsDocLoader,
 ^
/aux/hubicka/firefox/intl/icu/source/common/unicode/uniset.h:276:0: warning: Declaring type ‘struct UnicodeSet’ final would enable devirtualization of 146 calls [-Wsuggest-final-types]
 class U_COMMON_API UnicodeSet : public UnicodeFilter {
 ^
../../dist/include/mozilla/Selection.h:48:0: warning: Declaring type ‘struct Selection’ final would enable devirtualization of 131 calls [-Wsuggest-final-types]
/aux/hubicka/firefox/intl/icu/source/common/unicode/unistr.h:245:0: warning: Declaring type ‘struct UnicodeString’ final would enable devirtualization of 115 calls [-Wsuggest-final-types]
 class U_COMMON_API UnicodeString : public Replaceable
 ^
/aux/hubicka/firefox/intl/icu/source/i18n/unicode/decimfmt.h:663:0: warning: Declaring type ‘struct DecimalFormat’ final would enable devirtualization of 92 calls [-Wsuggest-final-types]
 class U_I18N_API DecimalFormat: public NumberFormat {
 ^
/aux/hubicka/firefox/netwerk/protocol/http/nsHttpChannel.h:40:7: warning: Declaring type ‘struct nsHttpChannel’ final would enable devirtualization of 65 calls [-Wsuggest-final-types]
 class nsHttpChannel : public HttpBaseChannel
 ^
...
 Followed by method warnings:
/aux/hubicka/firefox/dom/bindings/CallbackObject.cpp:32:0: warning: Declaring method ‘AddRef’ final would enable devirtualization of 148 calls [-Wsuggest-final-methods]
 NS_IMPL_CYCLE_COLLECTING_ADDREF(CallbackObject)
 ^
/aux/hubicka/firefox/content/media/MediaDecoder.cpp:1440:0: warning: Declaring method ‘GetReentrantMonitor’ final would enable devirtualization of 108 calls [-Wsuggest-final-methods]
 ReentrantMonitor& MediaDecoder::GetReentrantMonitor() {
 ^
/aux/hubicka/firefox/dom/base/nsGlobalWindow.cpp:9294:0: warning: Declaring method ‘GetExistingListenerManager’ final would enable devirtualization of 103 calls [-Wsuggest-final-methods]
 nsGlobalWindow::GetExistingListenerManager() const
 ^
/aux/hubicka/firefox/dom/base/nsGlobalWindow.cpp:9281:0: warning: Declaring method ‘GetOrCreateListenerManager’ final would enable devirtualization of 101 calls [-Wsuggest-final-methods]
 nsGlobalWindow::GetOrCreateListenerManager()
 ^
/aux/hubicka/firefox/dom/bindings/CallbackObject.cpp:33:0: warning: Declaring method ‘Release’ final would enable devirtualization of 76 calls [-Wsuggest-final-methods]
 NS_IMPL_CYCLE_COLLECTING_RELEASE(CallbackObject)
 ^
/aux/hubicka/firefox/intl/icu/source/common/unistr.cpp:398:0: warning: Declaring virtual destructor of ‘struct UnicodeString’ final would enable devirtualization of 68 calls [-Wsuggest-final-methods]
 UnicodeString::~UnicodeString()
 ^
/aux/hubicka/firefox/intl/icu/source/common/uvector.cpp:86:0: warning: Declaring virtual destructor of ‘struct UVector’ final would enable devirtualization of 63 calls [-Wsuggest-final-methods]
 UVector::~UVector() {
 ^
/aux/hubicka/firefox/editor/libeditor/html/nsHTMLEditor.cpp:3189:0: warning: Declaring method ‘DeleteNode’ final would enable devirtualization of 49 calls [-Wsuggest-final-methods]
 nsHTMLEditor::DeleteNode(nsIDOMNode* aNode)
 ^
Overall 9270 polymorphic calls (out of 68162 in libxul binary) can be devirtualized by declaring types final, while declaring top 100 calls will handle 3876 virtual calls.

13952 calls (23% of all polymorphic calls in libxul binary) can be devirtualized by declaring methods final. The distribution is however significantly more flat - declaring first 500 methods would devirtualize only 4976 calls.  As the quoted warning suggest, however many cases arise from macros and template declarations, so there is some hope this can be handled more centrally.

Firefox experiment

Wiill developers follow the warnings? Fortunately I did not need to wait too long to figure out :)

Based on warnings I posted to GCC IRC, Trevor Saunders prepared a patch adding couple 145 finals into Firefox sources and filled in PR1047696. I found interesting to read the reactions of other developers who wonder why the annotation is a good idea. Trevor found that roughly 3600 indirect calls are saved by his patch - not far from my estimate above. Reportedly the patch is on its way to upstream. 

Using profile feedback

Can we do better? Well, with profile feedback we can. I implemented this in this patch. Firefox has profiledbuild mode where the browser train itself on renderring various pages and rebuilds with precise execution counts information on every basic blocks.  With this GCC can prioritize warnings by execution counts rather by static counts. With this, only 10 types are identified as how candidates:
/aux/hubicka/trunk-install/bin/ld: warning: hidden symbol 'pixman_add_triangles' in /aux/hubicka/build-firefox5-inst7-marxin-profile-nolt/toolkit/library/build/../../../gfx/cairo/libpixman/src/pixman-trap.o is referenced by DSO /usr/lib/x86_64-linux-gnu/libcairo.so
/aux/hubicka/firefox-martin/gecko-dev/docshell/base/nsDocShell.h:125:0: warning: Declaring type ‘struct nsDocShell’ final would enable devirtualization of 576 calls executed 8840 times [-Wsuggest-final-types]
 class nsDocShell : public nsDocLoader,
 ^
/aux/hubicka/firefox-martin/gecko-dev/content/base/src/nsScriptLoader.h:38:0: warning: Declaring type ‘struct nsScriptLoader’ final would enable devirtualization of 64 calls executed 608 times [-Wsuggest-final-types]
 class nsScriptLoader : public nsIStreamLoaderObserver
 ^
/aux/hubicka/firefox-martin/gecko-dev/xpcom/base/nsCycleCollector.cpp:2019:7: warning: Declaring type ‘struct GCGraphBuilder’ final would enable devirtualization of 20 calls executed 262 times [-Wsuggest-final-types]
 class GCGraphBuilder : public nsCycleCollectionTraversalCallback,
       ^
/aux/hubicka/firefox-martin/gecko-dev/layout/base/nsCaret.h:23:0: warning: Declaring type ‘struct nsCaret’ final would enable devirtualization of 130 calls executed 228 times [-Wsuggest-final-types]
 class nsCaret : public nsISelectionListener
 ^
/aux/hubicka/firefox-martin/gecko-dev/netwerk/protocol/http/nsHttpConnectionInfo.h:32:7: warning: Declaring type ‘struct nsHttpConnectionInfo’ final would enable devirtualization of 2 calls executed 100 times [-Wsuggest-final-types]
 class nsHttpConnectionInfo
       ^
/aux/hubicka/firefox-martin/gecko-dev/layout/base/../../content/base/src/nsXMLHttpRequest.h:183:0: warning: Declaring type ‘struct nsXMLHttpRequest’ final would enable devirtualization of 124 calls executed 78 times [-Wsuggest-final-types]
 class nsXMLHttpRequest : public nsXHREventTarget,
 ^
/aux/hubicka/firefox-martin/gecko-dev/parser/html/nsHtml5StreamListener.h:31:0: warning: Declaring type ‘struct nsHtml5StreamListener’ final would enable devirtualization of 8 calls executed 70 times [-Wsuggest-final-types]
 class nsHtml5StreamListener : public nsIStreamListener,
 ^
/aux/hubicka/firefox-martin/gecko-dev/xpfe/appshell/src/nsWebShellWindow.h:26:0: warning: Declaring type ‘struct nsWebShellWindow’ final would enable devirtualization of 62 calls executed 12 times [-Wsuggest-final-types]
 class nsWebShellWindow : public nsXULWindow,
 ^
/aux/hubicka/firefox-martin/gecko-dev/dom/base/nsScreen.h:20:0: warning: Declaring type ‘struct nsScreen’ final would enable devirtualization of 16 calls executed 2 times [-Wsuggest-final-types]
 class nsScreen : public mozilla::DOMEventTargetHelper
 ^
and 31 methods with one few executed over 100 times:
/aux/hubicka/firefox-martin/gecko-dev/docshell/base/nsDocShell.h:157:0: warning: Declaring method ‘_ZThn424_N10nsDocShell17GetIsBrowserOrAppEPb’ final would enable devirtualization of 4 calls executed 7878 times [-Wsuggest-final-methods]
     NS_DECL_NSIDOCSHELL
 ^
/aux/hubicka/firefox-martin/gecko-dev/image/src/imgRequestProxy.cpp:93:0: warning: Declaring method ‘AddRef’ final would enable devirtualization of 70 calls executed 4924 times [-Wsuggest-final-methods]
 NS_IMPL_ADDREF(imgRequestProxy)
 ^
/aux/hubicka/firefox-martin/gecko-dev/dom/base/nsGlobalWindow.h:386:0: warning: Declaring method ‘_ZThn32_N14nsGlobalWindow10GetRealTopEPP12nsIDOMWindow’ final would enable devirtualization of 6 calls executed 1590 times [-Wsuggest-final-methods]
   NS_DECL_NSIDOMWINDOW
 ^
/aux/hubicka/firefox-martin/gecko-dev/layout/base/nsPresContext.cpp:332:0: warning: Declaring method ‘Release’ final would enable devirtualization of 302 calls executed 1112 times [-Wsuggest-final-methods]
 NS_IMPL_CYCLE_COLLECTING_RELEASE_WITH_LAST_RELEASE(nsPresContext, LastRelease())
 ^
/aux/hubicka/firefox-martin/gecko-dev/dom/base/nsGlobalWindow.h:386:0: warning: Declaring method ‘_ZThn32_N14nsGlobalWindow13GetRealParentEPP12nsIDOMWindow’ final would enable devirtualization of 8 calls executed 826 times [-Wsuggest-final-methods]
   NS_DECL_NSIDOMWINDOW
 ^
/aux/hubicka/firefox-martin/gecko-dev/dom/bindings/CallbackObject.cpp:33:0: warning: Declaring method ‘Release’ final would enable devirtualization of 3262 calls executed 670 times [-Wsuggest-final-methods]
 NS_IMPL_CYCLE_COLLECTING_RELEASE(CallbackObject)
 ^
/aux/hubicka/firefox-martin/gecko-dev/content/base/src/nsScriptLoader.cpp:179:0: warning: Declaring method ‘AddRef’ final would enable devirtualization of 14 calls executed 608 times [-Wsuggest-final-methods]
 NS_IMPL_ISUPPORTS(nsScriptLoader, nsIStreamLoaderObserver)
 ^
The execution counts are actually surprisingly small. Partly because developers still tends to try to keep polymorphic calls out of the hot paths and partly because the train run is actually not very computationally intensive. I would be very interested if the benefits can be measured in some practical benchmarks.

Next time I plan to write on determing dynamic types of heap allocated memory.

Monday, April 21, 2014

Linktime optimization in GCC, part 2 - Firefox

This time I will write about building Firefox with LTO. Firefox is one of largest binaries on my hard drive (beaten only by Chromium) so it is an excellent stress test. Taras Glek and I started to work on getting Firefox to build well in 2009 (see original paper from 2010), however for all those years I was mostly chasing correctness and compile time/memory usage issues. Markus Trippelsdorf was instrumental to keep the project going. In parallel Rafael Espindola did same effort of getting Firefox to work with LLVM and LTO (Update: see also his recent slides) that got into useful shape about a year later.

Update: I solved the LLVM LTO crash by commenting out all SSE3 code from SkBitmapProcState_opts_SSSE3.cpp. It seems to be re-incarnation of Firefox's PR759683 and LLVM's PR13000. I also added GCC 4.8 runtime and code size tests.

Update: With help from Firefox devevelopers I also measured tp5o benchmark. It shows rendering times of popular pages from 2011.

Looking into Firefox let me to understand that needs of large applications are different from what one sees in standard benchmarks (SPEC2006). It motivated several changes in GCC, including making -Os cope well with C++, adding code layout logic, optimization to handle better static construction and destruction and to improve dealing with PIC interposition rules.

It may be surprising, but despite my work in the area, this is first time I actually run some real benchmarks (except for Dromaeo I did two weeks ago). Benchmarking large applications is not easy to do and I never had time to poke with the setup. As I have learned recently, Firefox has actually pretty thorough benchmarking tool called talos.

I run quite few benchmarks and builds, but it would be interesting to do more. Due to unforeseen changes in my plan this weekend I however decided to publish what I have  now and followup on individual topics.

This post should be also useful guide to those who want to try other bigger projects with LTO. Firefox contains a lot of third party libraries and has very diverse code base. Problems hit here are probably good portions of problems you hit on other bigger applications.

For LTO history, see part 1 of this series.

Update: See also Martin Liška post on building Gentoo with LTO.

Prerequisities

I will not duplicate the build instructions from Mozilla site. However for successful LTO build one needs the following additional things:
  1. Binutils with plugin enabled. This can be checked by
    $ ld --help | grep plugin
      -plugin PLUGIN              Load named plugin
      -plugin-opt ARG             Send arg to last-loaded plugin
    I personally use gold from GNU Binutils 2.24.51.20140405 but any reasonably recent binutils either with GNU ld or gold should do the job. One needs to however enable the plugin while configuring since it is off by default.
  2. GCC 4.7+ that is configured with plugin support (i.e. built with the plugin enabled LD). For sane memory use, GCC 4.9 is recommended, it requires about 5 times less RAM.

    To test if GCC runs with plugin support, try to link simple hello world application and do:
    $ gcc -O2 t.c -flto --verbose 2>&1 | grep collect2.*plugin
    If the output is non-empty you are having plugin set up correctly.
  3. Clang needs to be built with gold llvm plugin enabled, too. Follow the official instructions.
  4. ar/nm/ranlib wrappers on the path that dispatch to the real ar/nm/ranlib with plugin argument.

    This is somewhat anoying issue. GCC ships with gcc-ar/gcc-nm/gcc-ld, but it is not very easy to convince build script to use them. Adding to path a simple script calling gcc-* equivalent will lead to infinite recursion. I use the following hack:
    $ cat ar
    #! /bin/sh
    command=$1
    shift
    /aux/hubicka/binutils-install/bin/ar $command --plugin /aux/hubicka/trunk-install/libexec/gcc/x86_64-unknown-linux-gnu/4.9.0/liblto_plugin.so $*

    $ cat nm
    #! /bin/sh
    /aux/hubicka/binutils-install/bin/nm --plugin /aux/hubicka/trunk-install/libexec/gcc/x86_64-unknown-linux-gnu/4.9.0/liblto_plugin.so $*

    $ cat ranlib
    #! /bin/sh
    /aux/hubicka/binutils-install/bin/ranlib $command --plugin /aux/hubicka/trunk-install/libexec/gcc/x86_64-unknown-linux-gnu/4.9.0/liblto_plugin.so $*
    In the LLVM case I use the same wrappers, just the plugin path is /aux/hubicka/llvm-install/lib/LLVMgold.so.

    Alternatively the plugin can be installed to the default plugin sear path (/usr/lib/bfd-plugins). This solution has some limitations however. One needs not-yet-released binutils to make this work with GCC and one can not easily install multiple plugins. I hope this situation will change in not so distant future.

    Forgetting these steps will give you undefined symbols as soon as you start to link static library build with -flto and GCC 4.9+ or -flto -fno-fat-lto-objects with earlier compilers.
  5. Disk space for the checkout, about 3GB
  6. Disk space for build directory: 0.9GB for normal build (3GB with debug info), 1.3GB for LTO build with GCC (1.6GB with debug info) and 0.72GB for LTO build with LLVM (2.2GB with debug info).
  7. At least 4GB of RAM (if your machine has 4GB, some extra swap is needed to pull out linker when LTO optimizer is working) for GCC LTO and 8GB for LLVM LTO.
  8. At least 3GB of /tmp space for GCC's WHOPR streaming. If you use tmpfs, you may want to redirect GCC's temporary files to a different directory by setting TMPDIR.

Be sure you actually use LTO and have plugin configured right

One of common mistakes is to use GCC -flto without linker plugin. This has three important consequences.
  1. GCC will always build fat LTO files and your builds will take twice as long then needed
  2. GCC will not have visibility information from linker and will not optimize very well across symbols exported from the binary. You may change this with -fwhole-program, but this solution is hard to use and still has some negative impact on code quality. It is a lot better to use linker plugin and the -fvisibility=hidden if you build PIC library.
  3. If you produce static library of LTO files and you link it without the plugin, the link time optimization on those will not happen. You will get the same code as without -flto. I believe many users who report no benefits and/or ability to build very complex apps actually fall into this trap.
I hope binutils will be patched to diagnose these cases more gratefully.

    Be sure you get link-time optimization flags right.

    Both GCC and LLVM perform some optimizations at compile-time and some optimizations at link-time. Ideally the compiler should pass down the flags from compile-time to link-time but GCC (and I believe LLVM, too) is not yet fully reorganized to support per-function flags (I plan improve this for next release).

    Worse yet, GCC does some magic, like trying to get -fPIC flag right, that does not really do what it should, but it hides some of obvious to detect mistakes. Richard Biener improved it a bit for 4.9, but the sollution is still not very good.

    It is important to pass proper optimization flag to the final invocation of compiler that is used to link the binary. Often it can be done by adding your optimization flags to LDFLAGS. Failing to do this will probably result in not very well optimized binary.

      Compilers

      I used unpatched GCC 4.9 release candidate 1. I configured my compiler with:
      --prefix=/aux/hubicka/49-install --disable-plugin --enable-checking=release --disable-multilib --with-build-config=bootstrap-lto
      And built with make profiledbootstrap. If you want to use LTO and FDO together, you may want to get this patch for PR47233 that did not make it into 4.9.0 and will land 4.9.1 only. It reduces LTO+FDO memory use by about 4GB.

      I also use relatively fresh LLVM checkout 206678 (Apr 19). Eric Christopher recommended me to try recent LLVMs since the memory usage with LTO was reduced. I also briefly tried 3.4 that seemed to behave similarly modulo needing more memory. The wrong code issues I mention here as well as the crash on LTO without -march=native happens in both cases. I configured with
      --enable-optimized --disable-assertions --prefix=/aux/hubicka/llvm-install 
      and rebuilt by itself. My understanding is that this should give me fastest binary. Release builds  seems to be with assertions enabled that seems to cause noticeable slowdowns.

      To both compiler paths I install the shell wrappers for ar/nm/ranlib as mentioned above.

      Firefox changes needed

      The actual changes to Firefox are tracked by our tracking PR45375. Here are changes that actually landed upstream I write about mostly to give an idea what source code changes are needed:
      1. Patches to mark symbols used by top-level asm statements by __used__ attribute. This change is needed because unlike the statement level asm statements, the toplevel asm statements have no way of annotating them with symbols used. For this reason the symbols in ASM attributes are not visible to linker and GCC prior code is generated. If you won't use explicit __used__ attribute, GCC will freely optimize out the symbols leading to link errors.

        This solution somewhat sucs. LLVM has built in ASM parser that makes this transparent in some cases, but not in others.

        The patch is linked from Firefox's PR826481
      2. Patch to libvpx to avoid grepping bytecode assembly files.

        The problem here is that libvpx produce assembly file and then greps it for specific code in it. With -flto the assembly file consists of compressed LTO bytecode and thus the grepping fails.  One needs to pass -fno-lto to the specific unit.

        More in Firefox's PR763495
      Patches not pushed upstream
      1. A hack to prevent configure script to be confused by LTO and disable support for visibilities. This is linked here. Without this patch one gets working Firefox, but code quality is lousy.

        Confusion of confiugre scripts is unfortunately common problem for LTO. Many of the scripts was written with traditional compilation model in mind.
      Martin Liška and Markus Trippelsdorf put together bare bones of GCC LTO FAQ.

        Firefox configuration

        My mozconfig is as follows:
        ac_add_options --enable-application=browser
        ac_add_options --enable-update-channel=${MOZ_UPDATE_CHANNEL}

        ac_add_options --enable-update-packaging
        ac_add_options --disable-debug-symbols
        ac_add_options --disable-optimize
        ac_add_options --disable-tests
        export MOZ_TELEMETRY_REPORTING=1
        export MOZILLA_OFFICIAL=1
        mk_add_options MOZ_MAKE_FLAGS="-j16"
        export MOZ_PACKAGE_JSSHELL=1
        MYFLAGS="-O3 -march=native -flto"
        export CFLAGS=$MYFLAGS
        export CXXFLAGS=$MYFLAGS    
        export LDFLAGS="-O3 $MYFLAGS"              
        mk_add_options MOZ_OBJDIR=/aux/hubicka/final2-gcc-firefox-49-lto-O3-native-debug
        I use --disable-optimize to be able consistently set flags to all files by MYFLAGS. For LLVM LTO build one also need in some cases --disable-elfhack (or one gets segfaults). I will discuss effect of enabling-disabling debug symbols briefly in the next section.

        LLVM LTO builds crashes in instruction selection with LTO. I suppose it is because some units are build with custom -march flag with use of SSE intrincisc but this information gets lost to LTO. To get binary built I link with -march=native. I am in the progress of looking for testcase to report the issue. The resulting binary do not work for me, so all binaries benchmarked are with generic tuning.

        To get build with profile feedback, one additionally need to enable tests.

        Getting parallelism right

        If you want to get parallel build with LTO, it is better to parallelize the linktime optimization, too. This is currently supported by GCC only. In GCC you can pass parallelism to -flto parameter. I use -flto=16 that gets me 16 processes. There is no need for passing higher values. Parlallelism increases the memory use, so you may need to use lesser parallelism than is your CPU count if you are running into swap storms.

        Other important thing to consider is correlation with make's parallelism. If you build with -j16 then make may execute 16 linking commands in parallel that may each execute 16 worker processes killing your machine with 256 parallel compilations. For Firefox it is not an issue as there is only one dominating binary, for other projects you may want to be cautious.

        GCC support -flto=jobserv (implemented by Andi Kleen) that lets make to drive the whole process. There are few downsides of this:
        1. You need to adjust your Makefile to add + in front of the rule executing link command, otherwise GNU make will not pass down the necessary jobserver information. A lot of users seem to not read the docs far enough to notice this trap and get serial links.
        2. GCC 4.9 newly parallelizes the streaming stage and because I do not know how to communicate with Make's jobserver, this stage won't get parallelized, yet. GNU Make's jobserver seems easy to deal with and patches are welcome :)
        3. If you cut&paste command from Make's output, the linking will be serial because jobserv is not run.
        We will need some cooperation with GNU make maintainers to get these problems hammered out.  In general I think -flto should default to an "auto" mode that will by default detect number of CPUs available unless it knows about Make's jobserver and Make's jobserver should be available without the extra Makefile change.

        The parallelism is reached by breaking program into fixed set of partitions after the inter-procedural analysis is completed. The number of partitions is not controlled by -flto and is 32 by default. If you want higher parallelism or smaller memory footprint, you may increase number of partitions by --param lto-partitions=n. This will affect resulting binary but it should not really decrease code quality by a considerable fraction. The number is fixed to ensure that builds on different machines gets precisely the same output.

          Building

          To build Firefox normally, one use
          make -f client.mk 
          For the profiled build one needs X server running (I use tightvnc) and use  
          make -f client.mk profiledbuild
          If X is not running, the build passes but the resulting binary is crap, because all it is optimized for is to output the error message about missing X.

          My build machine

          My machine (courtesy of IUUK) has plenty RAM (64GB) and AMD Opteron(TM) Processor 6272 with 16 threads running 2100Mhz. I use tmpfs for /tmp directory and run modified Debian Wheezy.

          Compile time

          Wall time


          One observation is that GCC's LTO is no longer coming for extreme compile time costs. The slim LTO files avoid double optimization and the build gets parallelized pretty. The extra wall time caused by -flto=16 is comparable to switch from -O1 to -O2.

          LLVM without LTO clearly builds faster. Its -O3 compilation level is about as fast as -O0 in GCC, while GCC's -O1 is just bit slower and other optimization levels are significantly slower. With LTO LLVM builds a lot slower due to lack of any parallelism in the build stage. Also LLVM's optimization queue is organized in a way that virtually all optimization are run twice; once at compile-time and then again at link-time. While this may account extra cost, I think it is not too bad due to speed and relative simplicity of the LLVM's optimization queue. It is more problematic in a way that optimization done at compile time may inhibit optimizations at link time and I expect this will be revisited in future.

          Will debug info kill my builds?

          Other interesting topic is debug info. While GCC basically carries all expenses of debug info even with -g0. LLVM doesn't. The LTO memory usage increases many times with -g and build time with -O3 -flto -march=native -g is 37m46s (2266s), so about 46% slower. GCC needs 15m20s (extra 11%). I left out these from the graph to avoid rescaling.

          For non-LTO builds GCC -O3 -g compile time is 12m41s (761s), about 18% slower than -g0, LLVM -O3 -g compile time is 9m35s (575s), about 13% slower.

          Previous releases

          To put this into further perspective, GCC 4.7 needs 35m1s (2101s) to complete the job with -O3-flto=16 -g, GCC 4.8 needs 40m10s (1810s). Firefox build with GCC 4.7 and LTO crashes on startup for me (earlier checkouts worked). I did not investigate the problem, yet.

          CPU time

          Parallelism plays important role: while compilations can be easily done in parallel via make -j16, linking often gets stuck on one binary.  Here is how system+user time compares on individual compilation:


          Here it is visible that LTO compilation does bit more work for both compilers, but WHOPR is able to keep overall build time in bounds.

          Compiler's memory use

          Here I present some data collected from vmstat plotted by a script by Martin Liška (who originally got the idea from Andi Kleen). Red line in CPU utilization graph shows the case where only one CPU is used.

          GCC 4.9

          GCC memory/cpu use at -O0


          GCC memory/cpu use at -O3


          GCC memory/cpu use at -O3 -g

          In the graphs one can actually see that relatively a lot of time (100s) is spent before actual compilation begins. It is make parsing through Makefile and executing bunch of python scripts. The same happens after 600s to 700s where the same is repeated for different make target. The little peak at the end 700s is gold linking the final libxul.so library. We will soon see this like peak to grow up :)
          GCC memory/cpu use at -O3 -flto
          Here the memory usage during actual compilation is lowered to about 4GB and compilations are finised in about 500s (faster than with -O3 in the previous graph).
          What follows is a memory mt. Everest climb in memory usage graph starting just after 500s.  One can see the serial link stage running from 500s to 600s

          Mount Everest climb in memory usage starting just after 500s is the LTO optimization libxul.so, the largest Firefox's library:
          1. WPA (whole program analysis) type merging: The steeper walk up to 550s is compiler reading summaries all of the compilation units, merging types and symbol tables.
          2. IPA optimization: The slow walk up is inter-procedural optimization, dominated by inliner decisions.
          3. WPA streaming: Just after 600s the serial stage is finished and the steep peak up is parallel streaming and shipping new object files into copmilatoin. You can see the CPU usage to go up, too.
          4. Local transformation: Shortly after the steep valley the actual compilation starts, executed in parallel for another almost 200s.
          While the memory usage in my graph peaks 16GB, you can see it all happens in the parallel stage.  If you link Firefox at 8GB machine you can either increase --param lto-partitions or decrease -flto parallelism. Both will make the mt. Everest to sink.

          The little hill just before the mt. Everest (around 500s) has about the same shape for a good reason: it is linking of the javascript interpreter shell, the second largest binary in Firefox. It is a good example that bit of build machinery reorg would probably help to hide most of the LTO build time expenses. All one would need would be to overlap the javascript shell link with the libxul.so link. Something that is not needed in the normal build, since the link time of javascript shell is almost immediate.

          An unfortunate debug info surprise

          GCC -O3 -flto=16 -g
          This graph was supposed to show, that debug info is almost for free. This is true up to about 700s. Then it shows quite large increase in peak memory use (16Gb to 25GB) for debugging enabled in the local transformation stage. Sadly this is a new problem with recent Firefox checkout I did not noticed until now. I am looking into an solution.

          If I have 8GB or 4GB of RAM?

          The following are the worst case (debug info + -O3) builds with reduced parallelizm to 8 and 4. I regularly build firefox on my notebook with 8GB of RAM and 4 threads. On 4GB machine the build is possible, but one would need to either go to -flto=1 or play curefuly with partitioning. Again, Firefox is quite a monster.
          GCC -O3 -flto=8 -g

          GCC -O3 -flto=4 -g
          GCC -O3 -flto=16 --param lto-partitions=16 -g

          Older GCC releases


          GCC 4.7 with -O3 -flto=16 -g
          To get bit of perspective, this shows improvements since GCC4.7. Fat object files accounts for increasing the compilation stage to almost 1000s. The type merging is the footstep of the hill followed by slow inliner decision and streaming (the shanky part of graph, around 1500s). The WPA stage needs 8GB instead of 4GB in GCC 4.9.

          Again this graph shows that there is a new problem with debug info with recent Firefox checkout. Again this is fixable by reducing parallelism.

          GCC 4.8 -O3 -flto=16 -g
          While GCC 4.8 links almost twice as fast compared to 4.7, the memory usage is still bad (serial stage has about the same memory use and local compilation needs 5GB more)

          LLVM

          LLVm memory/CPU use at -O0

          LLVM memory/CPU use at -O3
          LLVM memory/CPU use at -O3 -flto -march=native
          LLVM is done with compilation in about 350 seconds, that is quite surprisingly similar to GCC's time.I would expect here cleang to beat GCC's C++ FE hands down, but apparently a lot of compiler time difference nowdays is in the back-end and also by the integrated assembler. Or perhaps it is because more optimizations are performed at LLVM's compile time. Then it follows by serial link stage.

          Without LTO LLVM uses about 1/3rd to 1/4th of memory, while with LTO it needs about twice as much (given that the link stage is not parallel at all). This is the case of compilation with debug symbols disabled only.
          LLVM with -O3 -g

          LLVM with -O3 -flto -g
          This graph shows issues with debug info memory use. LLVM goes up to 35GB. LLVM developers are also working on debug info merging improvements (equivalent to what GCC's type merging is) and the situation has improved in last two releases until the current shape. Older LLVM checkouts happily run out of 60GB memory & 60GB swap on my machine.

          File Size

          I generated the file size comparison splitting the binary into text, relocations, EH and data that accounts also the other small sections.I made EH data to appear gray, since they most of time don not need to be loaded into memory and thus their size is less critical then the size of other sections.



          One thing that LTO is good for is code size optimization. This is not as clearly visible on libxul.so, because it uses gold to do some of dead code removal, but will become more clear in other applications.

          In GCC's LTO at -Os bring about 5% reduction to both code and data segment (hover over graph to get precise numbers) and 20% reduction to EH data. For -O2/-O3 the data segment reduction stays, but the code is expanded by inlining. This is just the out-of-the-box configuration. The inliner is by default not really well behaving on large applications. Some of it I already fixed for next GCC release here. For now I recommend to reduce the inliner expansion by --param inline-unit-growth=5. This brings binaries close to non-LTO -Os, but without the runtime expenses. I wanted to make this default for GCC 4.9 but did not have chance to complete the necessary bencharking, so it is something I want to address this development cycle (in early tests done in mid 2013 there was issues with SPEC2k6 xalancbmk benchmark). I will follow on this and show some tests.

          UPDATE: I rebuilt firefox without -function-sections -fdata-sections and gold's section removal + identical code merging (which require the sections). The overall binary reduction with the default configuration is 43% (in LTO build)

          For LLVM LTO usually expands the binary including the -Os compilation, at -O3 it is about 30%. I believe it is mostly by inliner seeing too many cross-module inlining cases, too. LLVM has no global growth limit on inliner. What happens with LTO is that almost all functions become inlinable and simple minded inliner is just too happy and inlines them all. I believe that this is partly reduced by fact that LLVM does code expanding inlining at compile time that consequently disables some of cross-module inlining at link-time. Did not double-check this theory, yet.

          The data segment is reduced by about 3% and it is smaller than data segment of GCC produced binaries by about 6%. I tracked down part of the problem to GCC aligning virtual tables as if they were arrays for faster access by vector instructions. This is quite pointless and will make patch to avoid it. Other problem is the lack of optimization pass to remove write only variables that I made patch for next release cycle. Last difference seems to be LLVM bit more aggressive about optimizing out C++ type infos that GCC believe are forced to stay by C++ ABI. I looked into this and I lean towards conlcusion it is LLVM bug, but I am not expert here and further analysis are needed. Some of the symbols has appeared forced in recent LLVM checkout.

          One thing to observe is that GCC seems to produce considerably smaller code segment with -Os than LLVM (14%). As it will be shown later, LLVM's -Os code is however faster than GCC's -Os. It is the nature of the switch to trade all performance for code size however. -O2 code segment is smaller in LLVM (5%). I did not look into this in detail, yet, but it is my long time feeling that -O2 may need bit of revisiting from large application developer point of view. A lot of code generation decisions are tuned based on benchmarks of medium sized apps (SPEC2k6) that makes code expanding optimizations to look cool. I did bit work on this for GCC 4.9 and generic tuning, but more needs to be done. The size difference further increase at -O3 that I consider OK, since -O3 is about producing fast binary at any cost. As it is shown in benchmarks, -O3 in GCC does bring benefits, while for LLVM it seems to be close to -O2 performance.

          GCC also produce bigger EH tables (27%) at -O2. Partly this is concious change in GCC 4.9 where push/pop instructions are used to pass on-stack arguments. This produce shorter code and work well with modern CPUs having stack engines but the unwind info bloats. LLVM also seems to have bug where it optimizes away the EH region in:
          #include <stdio.h>
          void
          t()
          {
          }

          void
          q()
          {
            try {t();} catch(int) {printf ("catch\n");}
          }
          Even with -fPIC -O2. This is not valid, because t can be interposed by dynamic linker to different implementation that does throw. I filled in PR for this and will double check if those two are the sole reasons for the difference.

          Runtime benchmarks

          I used talos to collect some runtime performance data. Sadly I do not have LLVM LTO data, because the binary crashes (and moreover it is native build rather than generic). I plan to followup on this and do the comparison once I hammer out the problems.

          Startup time tests

          First benchmark is ts_paint that starts the browser and waits for the window to be displayed (in my understanding). I tried also bechmark called ts, but it does not produce results.

          Before each talos run I flushed the cache. Talos is running the benchmark several times, so it gave me one cold startup run and several hot startups (I hope). In the graph bellow is the cold startup and average of hot startups. Cold startup have more noise in it, because it is just one run and I suppose it also depends how well the binary landed on the hdd.

          I am not entirely happy about this benchmark and will write more later. The main thing is that Firefox uses madvise call to fully load every library disabling the kernel's page demand loading. Martin Liška implemented function reordering pass that, with profile feedback, should allow to load just small portion of the binary. I will followup on this later. See also Taras's blog post on this topic.

          The startup tests seems to be insensitive to LTO except for case when profile feedback was used (FDO saves 7%, LTO additional 4%). Cold startup shows some sensitivity to file size, but apparently my HDD is fast.

          Opening window


          Opening window is a nice test to show compiler flags, since it should fit in cache. Here LTO wins bt 5% at -O2, 2.3% at -O3 and 9.8% with FDO (that by itself brings 6.4%). One of two (Update: three) benchmarks where LLVM performs better than GCC -O2, I am not sure if it is within a noise factor.

          Scrolling


          Scrolling has similar properties to window opening: just bunch of C++ code that fits in RAM and thus is sensitive to compiler optimization.

          SVG rendering




          Here I used tsvgx test (that is said to be better than tsvg) and since it tests multiple images, I made geometric average.


          Canvas mark


          Again a gemoetric average, I did not investigate yet, why some runs has failed. Will update on it.

            Dromaeo CSS


            A geometric average of individual benchmarks.

            Dromaeo DOM


            This is the only benchmark where LLVM is not approximately in GCC -O1 to -O2 range and in fact LLVM -O2 outperforms GCC -O2 by 7%. I profiled it and again it was caused by GCC not inlining by PIC interposition rules, so the difference should disappear if the problem is fixed at LLVM side.

            Allowing the inline by adding explicit inline keyword makes GCC -O3 to match score of GCC -O3 -flto=16 where the inline happens then based on linker plugin feedback. Again someting I plan to followup on. There is bit more going on that I need to investigate deeper. For GCC, the crictical inline happens only at -O3 -flto, while LLVM gets it right at -O2 despite the fact that its -O2 binary is shorter than GCC's. The inline decisions are significantly different making it a bit harder to track down.

            Update: By mistake, I reported large improvement for -flto -O3 with LLVM. This was actually caused by fact that I pulled in some changes into firefox tree while looking for workaround for the compiler crash. One of them was patch improving DOM. I rerun the tests on original tree.

            Update: Loading common pages


            This test seems to be only one that seems to show that -O3 is not always a win due to code size bloat.

            So what we can improve?

            LTO in GCC has quite large room for improvement. I have patches for next release to improve inliner's behaviour and reduce code size considerably without giving up on the speed. There is also still a low hanging fruit on imroving memory usage and compile time. Re-tunning backend and applications themselves for the new environment will need more work.

            Comparing to LLVM it seems we should try to reduce amount of information stored into LTO object files that is almost twice of what LLVM streams. Part of the extra information is used for optimization and debug info, but the streaming was not very curefuly analyzed for what really is important where: it is more or less pickled in-memory representation. There is ongoing effort to cleanup GCC, modernize the source base that will hopefully enable some deeper cleanups, like simplifying the type representation, that will in turn improve memory footprint. Clearly speeding up C++ frontend and putting our datastructures on the diet can do wonders. (I would be curious to see clang retargeted to GCC backend).

            I also have quite few plans for profile feedback. I plan to reorgnize instrumentation so with LTO one will not need to recompile everything twice, only need to relink and hopefully AutoFDO and LIPO will be merged in.

            Next

            Hope you found the tests informative and I will try to followup on the unresolved questions in the near future. I also plan to write post on Libreoffice and for that I would welcome suggestions for benchmarking procedure.

            Linktime optimization in GCC, part 1 - brief history

            Link-time optimization (LTO) is one of the most actively developed features of GCC.
            Since 2010, I am working on getting LTO to work well with real world applications and with GCC-4.9 release candidate out, perhaps it is a good time to share my experiences.

            I plan to write about getting Firefox (Update: part 2 is here), Libreoffice, Chromium and other large applications to work with LTO. This will hopefully give an broader idea what kind of setup & portability issues one can hit. I will also show some benchmarks. Today I will however start with a short review of history and current status of implementation of link-time optimizations in GCC.

            Update: See also Martin Liška post on buliding Gentoo with LTO.

            Classical organization of the toolchain

            Classical UNIX toolchain optimize at translation unit basis (that is, more or less, a source file). The compiler parses the source, produces a representation in the intermediate language (IL), optimizes it and eventually generates assembly. Assembly file is turned into an object file by the assembler. Resulting binary is produced by the linker. At runtime the dynamic linker links multiple DSOs together and executes a program. Finally dynamic linker may be used again via dlopen to load more libraries/plugins. Practically all non-trivial optimizations happens in the compiler, while some limited tricks are performed by both assembler and linker.

            Early simple compilers internally worked on statement basis; often it was the case that the whole program would not fit into the memory (I will call this compilation scheme statement-at-a-time). Statement-at-a-time compilation greatly limits optimizations one can perform. Generally one can do only those transformations that are obvious from considering single statement in the isolation (such as expression simplification) though some basic inter-statement optimization is possible, too, such as basic common subexpression elimination or peephole optimization. Such optimizations can be implemented, at least in the limited form, as a single pass through the instruction stream without ever looking back.

            Extending the scope of the compilation from statement to function (function-at-a-time compilation) permits most of classical optimizations including constant propagation, common subexpression elimination, loop optimization, register allocation, advanced instruction selection and instruction scheduling.

            GCC started with a mixture of statement-at-a-time approach and function-at-a-time. Frontends generated intermediate code per-statement basis and immediately lowered it to very low-level (close to assembly) intermediate language called RTL (Register Transfer Language). According to Richard Stallman this was mostly done to deal with limitations of the stack size and memory of M68k machine he developed GCC on. The rest of the compiler worked at function-at-a-time. Even early GCCs was thus able to do basic function wide optimizations and until late 2.9 GCC series the optimizers has developed into strong low level optimization engine that was more portable to new architectures than any other compilers available at that time. The optimization queue consited primarily of strenghtened form of constant subexpression elimination (global value numbering), loop optimization (loop unrolling, strength reduction), jump optimization (including conditional jump generation), instruction combining, register allocation, and peephole optimization.

            Cygnus was a pioneering free software company that made its living from porting GCC to new architectures. The main focus in late 90s was put into making the originally CISC centric optimizers work well with RISC and later VLIW instruction sets and to generally improve set of intra-procedural optimizations. Important development change happened in 1997 with introduction of EGCS fork which enabled some of deeper changes in the compiler.

            Compiling C++

            In 90s and early 2000s, the popularity of C++ was on the rise while the language itself was still a quickly moving target (C++ was standarized in 1998 with an update in 2003 and C++ ABI took few extra years to settle down). This brought challenges not only for the developers of C++ front-end, library and runtime (that is a separate topic I do not know enough about to write it), but soon enough for the backend developers, too.

            In late 90's GCC was converted to function-at-a-time parsing. This was motivated mostly by a need of reasonable implementation of C++ templates. Here one want to keep close to source representation (generally called Abstract Syntax Tree, or AST) of templates until instantiation happens. Just like preprocessor macro, the template alone has no semantics that may be easily represented by classical compiler IL. This is hard to implement within a front-end that has no internal representation of the whole function body.

            Dealing with abstraction penalty

            C++ programmers more and more relied on compiler's ability to remove the abstraction penalty. It became essential to reliably inline functions, optimize out unreachable function bodies and do higher level transformation on function bodies (such as to replace data stored in instances of objects by scalar variables after all methods was inlined). C++ programers started to make strong assumptions about compiler's ability to cleanup their constructions while that are often not very well thought out from compiler developer's perspective.

            As one of main offenders, function inlining was not a strongest point of GCC, which still worked in function-at-a-time mode: GCC backend was organized as a library that was used by frontend whenever it needed to produce something into the assembly file. For that purpose the order of things compiled was completely out of backend's control and the scope of inter-procedural optimizations in general was extremely limited.

            Bottom-up inlining

            This is how the impossible (interprocedural optimization within a backend that generally works at one function at a time) was implemented.

            After function arrived from the frontend, backend performed following steps:
            1. Inliner inlined function calls of all functions it did previously memoized body of.
            2. Some early optimization passes performed some basic optimizations (dead code removal, jump optimization)
            3. Inliner checked if the optimized function is small enough, i.e. the number of instructions was less than 8*(8+num_parameters) and if so, it remembered the function for further inlining. Functions declared inline was remembered if their body did not exceed 600 instructions.
            Such inliner implementation is called bottom-up - if you imagine callgraph as a tree growing down, the inliner starts from the bottom (leaves) and propagates inline decision upwards. Obviously such inliner was limited to inline only in forwarding direction (the function must have been defined before used) and also was believed to consume a lot of memory because it remembered the expanded function bodies with all inline decisions already applied.

            Top-down inlining (GCC 3.0)

            The old inliner was rewritten in 1999 by CodeSoucery (originally for C++ only and later ported to other languages) to work on the high level form. In the new implementation inlining was moved into the front-end (in fact, both inliners has coexisted for few years before the new inliner was ported to all front-ends).

            At that time, C++ front-end already saved bodies of functions in the AST-like representation to handle templates. It also did not output most of functions just after it finished its parsing. Majority of functions was remembered in high-level form till very end of compilation when basic reachability analysis was performed to eliminate dead functions and to avoid unnecesary work of lowering them into the RTL representation. This was done because C++ include files tends to contain a lot of unused inline functions.

            Adding inlining was conceptually simple. At the end of the compilation, the compiler started with functions that was clearly necessary (i.e. their address was taken, they was exported from current compilation unit and so on). For every function the following steps was applied:
            1. It was decided if function is inlinable and of so, its body was saved for later use.
            2. Calls to all small enough inlinable functions was inlined and recursively also all such calls within the inlined function bodies.
            3. The function was optimized and output to assembly. The expanded function body was released from memory.
            4. Every function that got mentioned in the assembly output was added into list of functions that are necessary.
            This approach is called top-down because it works from root of the tree to the leaves.

            One problem was how to define small functions - the high level representation did use C++ style statements and not assembler like instructions. Initial implementation simply counted every statement to have 10 instructions at average and the functions declared inline was inlined if their size was less than 500 in this estimate. Functions was autoinlined if their size was less than 100.

            PR8361, tramp-3d and other compile time/memory use disasters

            The new inliner has turned out to have serious issues. One of more famous examples, at that time was bug PR8361 (filled in 2002 by Gerald Peiffer) that exposed exponential explosion in memory usage of the compiler while building template heavy C++ program. This bug defeated a solution using simple hacks to GCC and nicely exposed problems of the new design.

            First problem is use of inline in C++ code. Functions declared in header are often implicitly inline only because of developer's lazyness and the inline keyword in general is not as useful as one would expect. This means that compiler can not honor all inline functions in the program - inlining those alone will make compiler to explode on was majority of modern C++ sources. We had some relatively heated discussions about it in 2005 and of course making compiler to be selective about inline functions has made low level developers angry. For this reason always_inline attribute was born.

            New, and much more evil, testcase was prepared by Richard Guenther (Biener), who later became one of most profilic GCC developers. His PhD research project was implemented using pooma and till today it serves as very useful benchmark for new inliner patches. SUSE's benchmarking server nicely shows how the performance was steadily improving in 2006-2009 until the most important problems was finally resolved.

            Unit-at-a-time (GCC 3.4)

            Originally as an experiment to do something during the first SuSE-labs conference, I started to work on reorganizing GCC to work at unit-at-a-time basis. At that time it was already obvious that function-at-a-time optimization brings limitations we did not like. Most importantly the function inlining worked only in forwarding order - if you called function before defining it, it would not be inlined.

            In GCC 3.4 I merged in bare bones of the new callgraph module (described in this paper) that made compiler to first build intermediate language representation of the whole unit and then start optimizing.

            The new inliner was something I put together as a simple proof-of-concept experiment of inter-procedural optimization pass. It used simple greedy algorithm that inlined all small enough functions in the order given by badness metrics (num_calls*size) until global unit growth of function growth limits was hit. The algorithm was optimizing for number of functions fully inlined into their callees. The limits was basically a security precautions to avoid exploding on inline bombs and exposing nonlinearity of the compiler on very large function bodies..

            Bit more details about the early design appear in 2006 GCC summit paper and in  2007 GCC summit paper.

            The inliner heuristics was the incrementally improved into the today form and also the callgarph module has grown up into a full inter-procedural optimization framework with additional optimizations implemented, such as constant-propagation, function specialization, partial inlining, inter-procedural points-to analysis (still not enabled by default) and others. The inliner still remains the most important component of it, probably as in every optimizing compiler.

            Combining translation units via --combine (GCC 3.4)

            GCC 3.4 was released in 2004. Since late 90's better commercial compilers and recently open-sourced Open64 was already capable of inter-module optimization via various implementations of link-time optimization. Here GCC was held back by both technical and political problems.

            Clasical link-time optimization is implemented by storing the intermediate language into the "fake" object files. To make this possible, one need to have well defined representation of the whole compilation unit and the whole program. Because the backend has grown-up from the per-function compilation, it did not really had this.

            At a same time, the Free Software Foundation was concerned about the consequences of adding a well-defined on-disk representation of GCC IL that could be used to hook GCC into proprietary front-ends and back-ends without violating GPL 2.

            First attempt to get intermodule was implemented by Apple developer Geoff Keating. His implementation extended the C front-end to allow parsing of multiple source files at once and presenting them to backend as one translation unit. While this feature allowed some early experiments with whole program optimization, it also had serious limitations. It worked for C only and plans to extend it for C++ was never realized. It however worked well for many simpler programs and it was also able to detect programming errors that would be unnoticed with traditional scheme.

            The new middle end (tree-SSA, GCC 4.0)

            These optimizaitions became increasingly difficult to implement on GCC's low level intermediate program representation. For this reason a new middle-end was implemented. The tree-SSA project was probably the most important change that happened to GCC recently, but a topic for a different article ;)

            LLVM (2003), Open64 (2002)

            At the time Tree-SSA merge was being prepared, Chris Lattner presented at 2003 GCC Summit an proposal of itegration of LLVM into GCC as a new middle-end that would take place of the (not yet finished) Tree-SSA project. This never happened and as a result we now have two major open source compilers.

            One of major trading point at that time was implementation of LTO. People made guesses if it will be harder to modify GCC architecture for LTO or if it would be harder to complete LLVM to features and richness of backends GCC had. It is definitely interesting to observe that from today point of view.

            Update: Here are slides of Rafael Espindola talk about history of LTO in LLVM.

            As a less known fact, Open64, was open sourced 2000 working as an alternative GCC backend with full LTO support.

            Early days of LTO implementation in GCC (2005)

            The political limitations to LTO and compiler plugins was lifted by GPL 3.0. In 2005 an official proposal "Link-Time Optimization in GCC: Requirements and High-Level design" was posted to GCC mailing list. Work on LTO started jointly at Google, IBM and Codesoucery.

            The proposal took lessons from LTO designs in other compilers and proposed several important requirements starting from maximal transparency to user (basically LTO should require only -flto command line option), having well defined GNU Virtual Machine (GVN) and number of necessary changes across the GNU toolchain (assembler, linker and such).

            WHOPR: Whole program optimization (2007)

            Implementing LTO original proposal was extreme amount of work. While progress was made, some parts, like type representation in DWARF format, has shown to be progressing too slowly. Moreover Google started to consider an interesting problem on how to make LTO scale to really large applications, like one used internally by the compay.

            For this reason number of changes to the project was made. New proposal WHOPR - Fast and Scalable Whole Program Optimizations in GCC was born and it dropped some of initial plans, most importantly the well defined GNU Virtual Machine while it added new. The streaming of GCC's IL (gimple) was implemented. This has performance advantages because it permits to save a lot of compiler specific imformation that does not need to be rebuilt at link time.


            WHOPR introduced an approach to distributed link-time compilation, where the serial linking stage is minimalized. This fit well with the basic direction of the Intermodule optimization framework I was working on.

            LIPO

            LIPO is an interesting project by that apparently large difference for Google. It enables intermodule optimization based on profile feedback. Instead of doing classical and expensive whole program analysis at link-time, the basic decisions are actually done at runtime of the program being trained. The dynamic callgraph is built and the inlining decisions are made. The dynamic callgraph considers only portions of program that matters and thus it is a lot smaller. Also profile driven inlining decisions can be a lot more precise and reliable. Later, when recompiling with profile feedback, the compiler is told the inliner decisions and can parse the extra source files necessary.

            While this idea seems a bit backwards to the traditional design, it makes a lot of sense in Google's environment. Compilation needs to be highly parallelized and it needs to fit in relatively small memory limits. On the other hand the runtime construction of callgrpah is less constrained. In this situation LIPO seems to easily outperform traditional LTO.

            LTO merge (GCC 4.5)

            LIPO's success in Google got development on WHOPR/LTO into trouble (at least that is my off-sider's understanding :). Most of WHOPR core developers were moved to other projects and they were finishing it in their free time. The project however succeeded in being merged to GCC in 2009, even if it was not in perfect shape. GCC 4.5 was released with LTO working well enough for SPEC benchmarks and bootstrap, while the actual WHOPR implmentation was not really functional. Richard Biener did huge amount of work on getting the LTO into working shape.

            Update: Solving correctness issues

            Having ability to store intermediate language to disk and load back is one problem a developer of link time optimization framework needs to solve. There are many other (and less understood) challenges.

            Important part of the problem is to define precisely the semantics of the combined units and revisit rules and assumptions individual optimizers make. Most language standards deal with one compilation unit alone (well, at least this is mostly true for C, C++ has One Definition Rule). Consequently one has to think what happens when there are two types of the same name with different meaning coming from different units, what happens when pointer to datastructure in one language (C, C++, Fortran, Ada, Java) is accessed as datastructure in another language. One has to deal with conflicting optimization flags (like -fstrict-aliasing used in one unit, but not in another) because all the code is going to be mixed together by the inter-module inlining.

            At GCC 4.5 we want long way for one language and C/C++ mixed programs. Compiler was partly cleaned up from hooks allowing front-ends to customize backend behaviour and the basic type merging was implemented. Many of the issues however was left unresolved and had to be dealt with later.Some of them, in particular the command line issues, do not have completely satisfactory solution till present day.

            Binutils support

            LTO optimization is major change and it requires changes thorough the whole toolchain. Linker and other utilities to work with object files needs to be told to operate with LTO object files that are not containing any actual assembly. In GNU toolchain this was realized by  plugin support (implemented by Cary Coutant). Plugins adds a simple, but flexible API, to glue compiler into the gold linker that is useful for all LTO enabled compilers. Unlike approaches working around the linker changes this provide very precise information to the compiler about what symbols may or may not be used by non-LTO code. This is very important for optimization.

            The linker plugin support was later added to the traditional GNU-ld (by Dave Korn), enabling LTO support on non-ELF targets including Windows.

            One interesting aspect of the linker plugin design is the fact that it allows one pass linking. Most other designs dispatch compiler from the linker and once compiler delivers object files back, the linking is re-done. Gold and GNU-LD implements one pass linking. This has however caused an issue and thus alternative, two-pass linking, implementation was done by H. J. Lu.

            Final ar,nm and ranlib utilities was extended for plugin support, by Rafael Espindola, and gcc convenience wrappers added by Andi Kleen. I am still not completely happy about the need for wrappers, but I will write more about that next time.

            Getting WHOPR to work (GCC 4.6)

            Having basic infrastructure in shape, many GCC developers continued on necessary cleanups and improvements. The linker plugin became the default and we worked on many correctness issues.

            To get an specific goal I decided to work towards making GCC to compile Firefox. This was motivated by a post from Firefox developer Taras Glek about massive performance regression in GCC 4.5 (which has turned out to be problem with compiling C++ applications with -Os). I started to look into performance issues of Firefox and found them puzzling and interesting.

            In 2010 GCC summit paper we reported a success on linking Firefox in reasonable time (4 minutes on 24core machine) and 4GB of memory use. (Without the WHOPR mode linking took 19 minutes 8GB of RAM). The resulting binary did work and showed small, 1% improvement on dromaeo CSS benchamrks as well as 6% code size reduction. Not many compilers are capable of this.

            Firefox is a quickly growing target - a year later I needed 8GB of memory to build it with the same compiler,so GCC 4.7 had to be optimized carefully to squeeze it into 3GB. Today firefox, compiled with GCC 4.7 needs over 20GB of RAM. Some observations on this can be found in these 2013 slides.

            In gcc 4.6 -fwhopr was removed and became default with -flto. The non-whopr path can still be executed via -flto-partition=none.

            Kernel compilation (GCC 4.7+)

            Other long running project was Andi Kleen's effort to get Linux kernel to build with LTO. For a kernel developer, he made remarkable job on adding some missing features to the toolchain, mostly importantly the support for incremental linking. His effort is described in 2013 slides. The changes was considered for 3.15rc1 but missed the window apparently due to lack of evidence of benefits to the end-user.

            GCC 4.7 also had a rewrite of interprocedural constant propagation and a new inliner analysis that is aware of the context function is being inlined to (and, for example, can anticipate what parts of function body will be optimized out).

            Progress on correctly compiling non-trivial C++ program was made by reimplementing thunks support.

            Symbol table, scalability and stability improvements (GCC 4.8)

            One of long lasting GCC oddities was the lack of a symbol table datastructure. This made it harder to develop representation of some features that are tightly related to it, including aliases and thunks. For this reason we reorganized the callgraph module into a new symbol table datastructure that enables a better LTO optimization in programs involving those features.

            GCC 4.8 also had the usual work on memory reductions, LTO speedups. Since the LTO codegen started to be in shape, it was also the first release where the inliner went through re-tunning with link time optimization being considered.

            Today (GCC 4.9)

            GCC 4.9 is being released this week (with release candidate 1 being out since last week). It is pretty big release with many internal changes and API cleanups.

            The most important changes in LTO (besides the usual bugfixes) involve new type merging implementation by Richard Biener and many of compile time and memory use improvements. Firefox once again can be linked bellow 4GB limit despite its ongoing growth. In fact the memory use is now less than two-fold of memory needed by the linker alone that is not too bad result given how much more information GCC handle.

            An important user visible change is the new default to slim LTO files instead of fat.  GCC 4.8 and earlier produced both the intermediate language as well as assembly into every object file copmiled with -flto. This made it possible to handle LTO objects with non-LTO aware tools (ar, nm, ranlib and the linker) but it also doubles compile time, since everything is compiled twice. Since the binutils, libtool and other support matured, it is finally possible to avoid this expense.

            Martin Liška contributed pass for profile driven function reordering. This pass should enable faster startup of large applications by reducing the portion of binary that needs to be read into memory and making the accesses more sequential.

            Another new feature is the devirtualization module that allows early removal of unreachable virtual methods (that makes a huge difference on Firefox compilation) and also improves code as described in other post.

            GCC is now known to compile working Firefox, Chromium, Libreoffice, kernel and good a part of Gentoo distribution with LTO enabled.

            What next?

            LTO is far from being complete. One of main lacking features, by my opinion, is consistent handling of optimization options and debug info matching the non-LTO compilation. To get resonable performance benefits, the compiler needs to be re-tuned for the new whole program optimization mode and also some applications needs to be adjusted.

            There is also plan to merge LTO streaming based LIPO from Google branch, Martin Liška's code unification pass. I also hope the plans for well define virtual machine will be revived. In longer term I am planning to extend WHOPR to support fast recompilation in compile-edit cycle.

            Next time, i will write about series