Thursday, September 18, 2014

Devirtualization in C++, part 7 (Enforcing One Definition Rule)

This time I am going to write about a feature that I implemented as a side effect of the devirtualization machinery: the verification of One Definition Rule on types.

One Definition Rule (ODR) is one of more interesting cases where C and C++ differs in very subtle way. Wikipedia describes it as follows:
In short, the ODR states that:
  1. In any translation unit, a template, type, function, or object can have no more than one definition. Some of these can have any number of declarations. A definition provides an instance.
  2. In the entire program, an object or non-inline function cannot have more than one definition; if an object or function is used, it must have exactly one definition. You can declare an object or function that is never used, in which case you don't have to provide a definition. In no event can there be more than one definition.
  3. Some things, like types, templates, and extern inline functions, can be defined in more than one translation unit. For a given entity, each definition must be the same. Non-extern objects and functions in different translation units are different entities, even if their names and types are the same.
Some violations of the ODR must be diagnosed by the compiler. Other violations, particularly those that span translation units, are not required to be diagnosed.[1]
This means that C++ type names are global and should not be re-used in different ways in different source files. (In C doing so is perfectly valid and common, types across units are considered equivalent if their structure is.)

One of motivations for ODR is to make name mangling possible. Type names are used, for example, as program wide identifiers to distinguish functions and methods of same name but taking different types of arguments (function overloading):
struct A {int a;};
int getval (struct A a) {return a.a;}
struct B {char a;};
int getval (struct B a) { return a.a;}
This compilation unit will result in exporting two function symbols:  _Z6getval1A and _Z6getval1B. A and B comes from name of the argument's type.

Now, obviously, if another unit defines completely different structure A and completely different getval taking it as argument, the function will also be called _Z6getval1A and things will fail miserably. General violations of ODR can not be diagnosed by compiler working on single translation unit in isolation. If one is lucky, ODR violation turns into linker error about duplicated symbol name. With presence of inline functions however this will not happen and one of the two functions will be eliminated from program. This may result in invocation of a function on wrong data crashing program in random ways.

-detect-odr-violations in gold

One ODR checking tool I am aware of is gold's -detect-odr-violations:
gold uses a heuristic to find potential ODR violations: if the same symbol is seen defined in two different input files, and the two symbols have different sizes, then gold looks at the debugging information in the input objects. If the debugging information suggests that the symbols were defined in different source files, gold reports a potential ODR violation. This approach has both false negatives and false positives. However, it is reasonably reliable at detecting problems when linking unoptimized code. It is much easier to find these problems at link time than to debug cases where the wrong symbol.

Ian Lance Taylor: New ELF linker, Proceedings of the GCC Developer's Summit 2008
This is very simplistic analysis, but linker can hardly do a lot better; it lacks knowledge of types. Moreover there are some valid cases where the difference can happen and thus the warning can get false positives.

-Wodr: detecting ODR violations in GCC

As discussed in earlier post about construction of type inheritance graph, establishing type equivalency based on their names is important step in building cross-compilation-unit type inheritance graph. It is tricky to do for templates and thus in GCC 4.8 I used simple trick of identifying types based on mangled names of their associated virtual tables. This works only on types with virtual methods and thus with help of our C++ maintainer, Jason Merill, I extended the mehanizm to all class and union types: Before streaming out LTO data I simply invoke C++ frontend langhook to compute mangled name of a type and save it into the LTO object file. At linktime I can use the names to establish equivalence of all named types. This way middle-end does not need to know details of C++ namespaces and templates with the exception of being able to identify types in anonymous namespace and disable merging for those.

This streaming is not free, it seems to add about 2% of extra memory use and comile time for the mangled names. One day I plan to use the mechanism to strengthen the type based alias analysis for C++ and also to reduce redundancies in debug info produced (the second is actually done by LLVM to help fight memory usage explosion).

Once ODR based type equivalency is established, I can actually compare the types that become equivalent as part of LTO linking process. Doing so is a simple recursive walk over the GCC's type representation (that is sufficiently close to parse tree of declarations) and compare it. This step is not completely trivial because one needs to get past some differences that are considered valid. In particular complete and incomplete types of the same name are equivalent and there are few cases that we want to tolerate - such as use of attributes and perhaps differences caused by command line options, such as -fsigned-char/-funsigned-char even if those are ABI breaking and could lead to wrong code.

Making the warning useful

My first attempt resulted in plenty of warning looking as follows:
/aux/hubicka/firefox4/modules/libjar/zipstruct.h:38:16: warning: type ‘struct ZipCentral_’ violates one definition rule [-Wodr]
 typedef struct ZipCentral_
                ^
../../dist/include/zipstruct.h:38:0: note: a different type is defined in another translation unit
 typedef struct ZipCentral_
 ^
My best guess of average Joe's attempt to understand the message is "those types are the same! Stupid GCC!". It is a common case that types declared same way and in the same header are actually different because the difference is carried in from earlier includes. Without being able to report why the types are different, the warning would probably be ignored and declared bogus by most developers. For this reason I ended up adding about 200 lines of code just doing kind of structural diff and trying to output useful information about the difference.

In this particular cases it winds down to:
/aux/hubicka/firefox4/modules/libjar/zipstruct.h:38:16: warning: type ‘struct ZipCentral_’ violates one definition rule [-Wodr]
 typedef struct ZipCentral_
                ^
../../dist/include/zipstruct.h:38:0: note: a different type is defined in another translation unit
 typedef struct ZipCentral_
 ^
/aux/hubicka/firefox4/modules/libjar/zipstruct.h:47:25: note: the first difference of corresponding definitions is field ‘MOZ_Z_crc32’
   unsigned char crc32 [4];
                         ^
../../dist/include/zipstruct.h:47:0: note: a field with different name is defined in another translation unit
   unsigned char crc32 [4];
 ^
A fair try, but still not perfect :). Luckily, the problem is now not hard to guess. There are two zlib implementations linked into the Firefox and one of them redefine crc32 to MOZ_Z_crc32 (apparently to avoid ODR conflicts identified earlier by Gold). I thus decided that such warnings are verbose enough and did not add further information into this particular case. I expect to improve the diagnostic as we get more experience about individual errors reported.

Limitations of ODR detection

Even my ODR verifier is not a complete ODR nazi - it verifies only violations for types that survive into actual program and only for types where name matters (i.e. is part of type mangling). For example, if one unit define
typedef int mytype;
and the other:
typedef char mytype;
the violation will not be reported. This is because typedefs are ignored for type mangling purposes and thus I have no way to figure out that those two types are intended to be equivalent.

Other problem is, of course, that detection is limited to one binary. More conflicts can be introduced when linking with non-LTO library either statically, dynamically or at runtime via dlopen.

Detecting virtual table ODR violations

ODR relates not only to types, but also to functions and methods. While it is very hard to compare two methods coming from different unit because the body may differ in representation (because each is compiled with different flags or optimized in different context) but it may have the same semanticsI can however easily do is to compare virtual table. Mismatches in those are especially deadly because they may result in dispatch to a wrong virtual function.

Here is an example of ODR violation in Firefox:
../../dist/include/mozilla/a11y/DocAccessible.h:40:0: warning: type ïstruct DocAccessibleï violates one definition rule [enabled by default]
 class DocAccessible : public HyperTextAccessibleWrap,
 ^
/aux/hubicka/firefox/accessible/src/generic/DocAccessible.h:40:0: note: a type with the same name but different virtual table is defined in another translation unit
 class DocAccessible : public HyperTextAccessibleWrap,
 ^
/aux/hubicka/firefox/accessible/src/generic/DocAccessible.cpp:1232:0: note: the first different method is ïHandleAccEventï
 DocAccessible::HandleAccEvent(AccEvent* aEvent)
 ^
/aux/hubicka/firefox/accessible/src/atk/AccessibleWrap.cpp:956:0: note: a conflicting method is ïHandleAccEventï
 AccessibleWrap::HandleAccEvent(AccEvent* aEvent)
 ^

How common are ODR violations?

Fairly common :) Out of 3 programs (GCC, Firefox and Libreoffice) I tried to build with ODR checking, each of them had type ODR violations one of them (Firefox) had violation in virtual table. Today I leant that LLVM team has fixed ODR violations based on new warning, too. What are the most common causes?

Random type name clashes

Each time you create type with short name in global namespace (say hash_t), you risk clash with other units doing the same for completely unrelated reason. This is a good motivation to use namespaces and anonymous namespaces (though the anonymous namespaces are causing problems with GDB debugging, see PR16874). A firefox example looks as follows:
/aux/hubicka/firefox4/rdf/base/nsInMemoryDataSource.cpp:149:0: warning: type ‘struct Entry’ violates one definition rule [-Wodr]
 struct Entry {
 ^
/aux/hubicka/firefox4/gfx/skia/trunk/src/core/SkFlattenable.cpp:66:0: note: a different type is defined in another translation unit
 struct Entry {
 ^
/aux/hubicka/firefox4/rdf/base/nsInMemoryDataSource.cpp:150:0: note: the first difference of corresponding definitions is field ‘mHdr’
     PLDHashEntryHdr mHdr;
 ^
/aux/hubicka/firefox4/gfx/skia/trunk/src/core/SkFlattenable.cpp:67:0: note: a field with different name is defined in another translation unit
     const char*             fName;
 ^

Some examples come in this thread: http://nabble.documentfoundation.org/Re-Performance-samples-for-LibreOffice-td4119856.html.

Cut&paste

Copying code from one unit to another and adjusting the datastructures for new purpose is a common way to introduce name clashes. For example in GCC we have plenty of structures expr and occr that all originate from gcse.c that implemented the first dataflow based global optimizer.

Again, a Firefox example:
/aux/hubicka/firefox4/netwerk/sctp/datachannel/DataChannel.h:57:0: warning: type ‘struct BufferedMsg’ violates one definition rule [-Wodr]
 class BufferedMsg
 ^
../../../../dist/include/mozilla/net/DataChannel.h:57:0: note: a different type is defined in another translation unit
/aux/hubicka/firefox4/netwerk/sctp/datachannel/DataChannel.h:64:26: note: the first difference of corresponding definitions is field ‘mSpa’
   struct sctp_sendv_spa *mSpa;
                          ^
../../../../dist/include/mozilla/net/DataChannel.h:64:0: note: a field of same name but different type is defined in another translation unit
/aux/hubicka/firefox4/netwerk/sctp/datachannel/../src/usrsctp.h:198:8: note: type ‘struct sctp_sendv_spa’ should match type ‘struct sctp_sendv_spa’ but is defined in different namespace  
 struct sctp_sendv_spa {
        ^
../../../../dist/include/mozilla/net/DataChannel.h:60:0: note: the incompatible type is defined here
Here the type declaration is unchanged, but one of types it is built from has changed. This case shows also a namespace clash.

"Upgrading" codebase from C to C++

Because C does not have ODR, updating compiling code that was originally written as a C code and later upgraded to C++ needs audit for ODR violations. This is what the warning found for GCC: https://gcc.gnu.org/ml/gcc-patches/2014-09/msg01402.html

Linking multiple versions of given library into the program

This is particularly common case for large programs. For example:
/aux/hubicka/firefox4/content/media/fmp4/ffmpeg/libav53/../FFmpegDecoderModule.h:18:0: note: a type with different memory representation is defined in another translation unit
 class FFmpegDecoderModule : public PlatformDecoderModule
 ^
/aux/hubicka/firefox4/content/media/fmp4/ffmpeg/libav53/include/libavcodec/avcodec.h:985:0: warning: type ‘struct AVFrame’ violates one definition rule [-Wodr]
 typedef struct AVFrame {
 ^
/aux/hubicka/firefox4/content/media/fmp4/ffmpeg/libav54/include/libavcodec/avcodec.h:989:0: note: a different type is defined in another translation unit
 typedef struct AVFrame {
 ^
/aux/hubicka/firefox4/content/media/fmp4/ffmpeg/libav53/include/libavcodec/avcodec.h:997:0: note: the first difference of corresponding definitions is field ‘data’
     uint8_t *data[AV_NUM_DATA_POINTERS];
 ^
is one of many warnings issued during Firefox build.

Conditional compilation

In some cases, that looked particularly dangerous for me, there are fields that are compiled conditionally (such #ifdef DEBUG) and the condition is not set consistently across the whole program.This is also quite common case in Firefox. It usually shows itself as a warning at same position but complaining about mismatch of field names.

Non-LTO and whole program ODR checker

It seems that ODR violations are very common bugs in large C++ programs. Majority of the ODR problems found are "safe" in a way that they will not result in wrong code: the types in question are never used in overloaded functions or do not have methods attached. There are was however few real bugs and many of those random clashes have a potential to become real bugs later.

Maintainers of the programs usually welcome the reported issues and fixed them quite promptly. Especially welcome response came from Michael Meeks:
Jan - wow - that is a nice error =) are there any other ODR issues ?
they habitually bite us hard so ... great to get libmerged debugged even
more. CC'ing the list too.
It seems to me it would make sense to implement ODR checking that is not dependent on LTO and do not suffer from limitations of gold in the following way:
  1. Extend DWARF info by mangled names of types.
  2. Implement ODR based debug info merging in linker and warn on mismatches.

    Merging of ODR types would likely reduce debug info size considerably (judging from GCC and LLVM memory footprint issues).

    Linker has also chance to examine shared libraries linked into the program as long as they do have the annotated debug info available. This is not possible in my LTO implementation.
  3. Implement dynamic library that will optionally verify types at dynamic linking time. This way one can report issues caused by dlopen.
I am quite sure more interesting cases would be found. Sadly I do not think I will have time to implement it :(

Tuesday, September 9, 2014

Linktime optimization in GCC, part 3 - LibreOffice

After Firefox, I decided to look into LibreOffice performance with link time optimizations (LTO) and profile feedback (FDO). This post was in works since April, so my setup is not exactly bleeding edge - GCC 4.9.1 prerelease, LLVM 3.5 and LibreOffice checkout from April. Some real world issues (like a wedding) interrupted my work, however the data presented should be still relevant and interesting.

Update: I fixed the permissions for google drive, so you can see the charts.

Prerequisities

In addition to all dependencies described in Building LibreOffice on Linux: Tips and Tricks the following is needed to get LTO build (copied from Firefox report):

  1. Binutils with plugin enabled. This can be checked by
    $ ld --help | grep plugin
      -plugin PLUGIN              Load named plugin
      -plugin-opt ARG             Send arg to last-loaded plugin
    I personally use gold from GNU Binutils 2.24.51.20140405 but any reasonably recent binutils either with GNU ld or gold should do the job. One needs to however enable the plugin while configuring since it is off by default.
  2. GCC 4.7+ that is configured with plugin support (i.e. built with the plugin enabled LD). For sane memory use, GCC 4.9 is recommended, it requires about 5 times less RAM.

    To test if GCC runs with plugin support, try to link simple hello world application and do:
    $ gcc -O2 t.c -flto --verbose 2>&1 | grep collect2.*plugin
    If the output is non-empty you are having plugin set up correctly.
  3. Clang needs to be built with gold llvm plugin enabled, too. Follow the official instructions.
  4. ar/nm/ranlib wrappers on the path that dispatch to the real ar/nm/ranlib with plugin argument.

    This is somewhat anoying issue. GCC ships with gcc-ar/gcc-nm/gcc-ld, but it is not very easy to convince build script to use them. Adding to path a simple script calling gcc-* equivalent will lead to infinite recursion. I use the following hack:
    $ cat ar
    #! /bin/sh
    command=$1
    shift
    /aux/hubicka/binutils-install/bin/ar $command --plugin /aux/hubicka/trunk-install/libexec/gcc/x86_64-unknown-linux-gnu/4.9.0/liblto_plugin.so $*

    $ cat nm
    #! /bin/sh
    /aux/hubicka/binutils-install/bin/nm --plugin /aux/hubicka/trunk-install/libexec/gcc/x86_64-unknown-linux-gnu/4.9.0/liblto_plugin.so $*

    $ cat ranlib
    #! /bin/sh
    /aux/hubicka/binutils-install/bin/ranlib $command --plugin /aux/hubicka/trunk-install/libexec/gcc/x86_64-unknown-linux-gnu/4.9.0/liblto_plugin.so $*
    In the LLVM case I use the same wrappers, just the plugin path is /aux/hubicka/llvm-install/lib/LLVMgold.so.

    Alternatively the plugin can be installed to the default plugin sear path (/usr/lib/bfd-plugins). This solution has some limitations however. One needs not-yet-released binutils to make this work with GCC and one can not easily install multiple plugins. I hope this situation will change in not so distant future.


    Forgetting these steps will give you undefined symbols as soon as you start to link static library build with -flto and GCC 4.9+ or -flto -fno-fat-lto-objects with earlier compilers. 

GCC 4.9.0 -Os -flto bug

GCC 4.9.0 will segfault building LibreOffice with -Os -flto. I fixed it by this patch that got only into GCC 4.9.1.

    LibreOffice configuration

    I used trunk checkout from Apr 28 2014. I had to use --without-doxygen --enable-python=no for builds to pass. I built with --disable-symbols to get builds faster, especially LLVM LTO needs a lot of extra memory and compile time for debug info.

    In some builds I enabled LTO by --enable-lto and merged libts by --enable-mergelibs.

    I had to disable some unit tests because they were failing in some configurations (the writer tests failed in LTO builds only, while the rest failed in non-LTO, too): CppunitTest_cppcanvas_emfplus, CppunitTest_dbaccess_firebird_test, CppunitTest_dbaccess_embeddeddb_performancetest, CppunitTest_services, CppunitTest_writerperfect_impress, CppunitTest_writerperfect_writer.

    Building with profile feedback

    It seems that never has tried it before. After some experimentation I got LibreOffice building and working with profile feedback. This is not completely trivial:
    1. First configure with profile collection enabled:
      CFLAGS="-O2 -fprofile-generate -fprofile-correction" CXXFLAGS="-O2 -fprofile-generate -fprofile-correction" LDFLAGS="-O2 -fprofile-generate -fprofile-correction" <srcdir>autogen.sh --enable-mergelibs --disable-symbols --without-doxygen --enable-python=no --enable-lto
    2. Unfortunately the tree fails to build, this is because dyamic linker fails with "cannot load any more object with static TLS". While profiling uses TLS, it uses global dynamic model one and I failed to figure out what introduces static TLS into the libraries. In my limited understanding, this problem ought to happen only when objects built without -fPIC are linked into dynamic library and the dynamic library is loaded by dlopen. I build on x86-64 and in this case i would get warnings about wrong relocations used and I am not getting these.

      Anyway, there is ugly workaround to preload the largest library. This will get dynamic linker to allocate enough of space for the tables before application starts:

      CPPUNITTRACE="LD_PRELOAD=<builddir>/instdir/program/libmergedlo.so" make
      I still had to disable couple more unit tests to let the build finish. Something to debug later ;)

      It is a good idea to have DISPLAY environment variable set to working X server. In this case more testing of interactive code is done. I use vncserver to get things trained.
    3. To train the application, I started LibreOffice, went to writer and quit. I assume that the unit testing has pretty good coverage and thus the application is reasonably trained once build finishes. It is still good to be sure that the functions executed at startup are triggered. It also may be good idea to do make check (with the LD_PRELOAD hack) for some additional testing if it was not failing with even more unit tests.

      In fact there is interesting problem with this - the code layout produced by LTO build is likely more optimized for unit testing than the actual application. It may make sense to start LibreOffice enough times so the normal startup prevails or avoid the unit testing and do some more extensive training like Firefox does.
    4. Archive profile files (this step is needed, since make clean will remove them):
      tar czvf `find . -name *.gcda` /tmp/profile.tgz
    5. Build with profile feedback:
      make clean
      sed s/profile-generate/profile-use/g -i config_host.mk
      tar xzvf /tmp/profile.tgz
      make

    Compile time


    LibreOffice can build in two modes - default and mergedlibs. In mergedlibs most of shared libraries are built into one library (libmerged) to reduce dynamic linking overhead. Of course mergedlibs is very interesting for LTO and thus I compiled in most cases LibreOffice in both setting and did most of testing only in mergedlib mode.  Since build takes about an hour, i did not complete the full matrix and chose only combinations I consider interesting.

    There seems to be no major surprises: LTO got significantly faster since GCC 4.8 times (by 32%). This is by disabling fat LTO files and thus not compiling everything twice. This time I did not enable LTO parallelizm (-flto=16), because it is bit hard to glue into the Makefile and as shown later, it is not causing that major difference. Clearly there is thus low hanging fruit to get LTO builds faster by getting the parallelism right. Good news is that even with this simple setup LTO is not significantly slower than normal compilation.

    GCC 4.7 can not build LibreOffice with LTO.

    LLVM at -O2 is about 32% faster than GCC 4.9. Sadly GCC has noticeably slowed down since GCC 4.7 (by 11%). This was not so visible for Firefox compilation and I believe it is caused, by large part, by more expensive initialization at startup for IRA (new register allocator introduced to 4.8) and by bazillions of the new AVX/SSE builtins introduced. GCC 4.8 was a busy release. I did some work on this problem for GCC 5. I will publish update on the build times once I get LibreOffice building reliably with that compiler.


    This graph shows system and user times. Clearly GCC consume more of system time - by requiring separate assembler and by making more pages dirty. Again cost of LTO is not really extreme. I did not investigate why GCC 4.9 mergedbuild got better than others. Seem bit odd.

    A good improvement to LTO builds would probably be to avoid the completely useless assembler stage at compile time - at the moment we produce LTO bytecode and dump it in ascii format for assembler to produce the slim LTO object. GCC has all the code to avoid this and go directly into object files that is used during the linking stage. All that needs to be done is to enable it for slim LTO compilation.

    Memory use during compilation

    GCC 4.9
    GCC 4.9 mergedlibs


    GCC 4.9 LTO

    GCC 4.9 LTO mergedlibs
    Here some extra memory (about 2GB) is needed, but it can be tweaked by reducing parallelism as in the case of Firefox builds. In my setup up to 180 parallel LTO optimization processes was run, something that is definitely not needed or optimal.

    Largest LibreOffice library is just a fraction of size of largest Firefox library, so memory usage do not change that significantly as I reported last time. Obviously the linking occupy time after 2900s and CPU utilization drops that can be tuned at Makefile side. While the LTO builds are not faster than normal builds, I think this milestone can be reached with little bit of extra tuning.

    LLVM

    LLVM LTO, mergedlibs

    As seen in the second graph, GCC 4.9 needs about 1000 seconds for final link stage at LTO. (Linking is easily visible by seeing drop in CPU usage by lost parallelism.) This is about 50% faster than LLVM's.

    LLVM use a lot less memory while compiling and also its compilation stage is faster. For GCC 5 I reduced startup cost of LTO compilation, so hopefully these differences will get smaller. 

    GCC 4.8 LTO, mergedlibs

    While GCC 4.9 got faster at linktime, it got slower at compile time. Again something I hope to track in GCC 5.

    Binary size



    LTO with merged library save about 30MB from the installation directory. Not bad! FDO saves 14MB. Oddly enough these two do are not additive and LTO+FDO build is about the same size as LTO build. Perhaps there are better ways how to train LibreOffice than by the default unit testing.

    GCC clearly needs some tuning for code size - LLVM builds smaller binaries with default setting and bigger with -Os. Also observe that code has grown in size for GCC 4.9. This may be partly my fault - for 4.9 I did not re-tune inliner heuristics (because I left SUSE where I have the setup). It is a good idea to re-tune each release - GCC measure function sizes after early optimizations and as early optimizations get more aggressive each release, the functions get smaller (in GCC's perception) and thus better inline candidates. With each release it is usually possible to trim down the limits somewhat without losing performance on the benchmarks. Doing so is however tedious process, because large body of benchmarking needs to be done (SPEC is not enough) and usually takes me weeks to do. The difference is however big enough so there may be some other bug that went unnoticed into the release :(.


    This graph breaks down libmergedlo (the largest library) size. LTO leads to 17% code size reduction and 15% reduction of the resulting binary. This is definitely nice. The reduction is 7MB overall. This code size difference is a lot better than one I reported for Firefox (however smaller to one reported here). I suspect this is because release builds of firefox are done with identical code folding and section garbage collection in the linker than hits a lot of unreachable code removal. Perhaps something to try with LibreOffice, too.

    FDO works even better with 23% overall binary reduction and 36% of code size.

    Finally LTO+FDO has slightly bigger code size, while overall binary size shrinks by reductions in data segment, relocations and EH.

    LLVM behaves somewhat differently - by default it produces smaller code and LTO makes the code bigger. Also LLVM -Os needs tuning - the binary size is bigger than GCC FDO optimized and definitely slower.

    Memory footprint after startup




    For this I simply opened LibreOffice writer and measured memory in TOP. The memory usage corresponds roughly to the libmergedlo size with exception of -Os builds that consume more memory than FDO builds. This is expected because FDO enables optimized code layout requiring smaller potion of library to be read. Also -Os is not good for code locality by skipping a lot of inlinng and alignment.

    GCC 4.9 mergedlibs saves 5MB, LTO additional 2MB and FDO+LTO additional 6MB. Overall 13MB (18%). Not bad. FDO alone gets close to FDO+LTO build. I also plan to work on static code layout improvements for GCC hopefully getting LTO closer to FDO+LTO scores.

    LLVM with mergedlibs is 3MB smaller than GCC 4.9 with mergedlib and 3MB bigger than GCC 4.9 with -Os and mergedlib (again the code size is probably to blame). LLVM with LTO and mergedlibs beats all GCC builds except one with FDO.

    Runtime

    I am not expert on LibreOffice benchmarking and I plan to update this section with more reliable data. I did very simple test of converting Open document specification into PDF:
    time program/soffice --headless --convert-to pdf --outdir /tmp ~/OpenDocument-v1.2-os.odt

    Judging from the stdout output, this test consists of about 50% of startup time and 50% of conversion time. I measured both hot runs (after first run) and cold runs (with disk cache flushed)


    (Note that the graph does not start with 0)

    Overall I found LibreOffice less sensitive to compiler optimization than Firefox, probably because I found no way to measure any interesting internal loops.

    The mergedlibs build shows small improvement from GCC 4.7 to 4.8. LTO seems to make no difference and FDO leads to just small improvement. LTO+FDO together seems to do a job improving the performance by 5%. The graphs also shows that -Os is not a good way for overall performance.

    LLVM seems to be slightly slower than GCC 4.7 while LTO build is same as GCC 4.8+ non-LTO.


    I also measured cold startup times. This graph has enough noise to be only able to tell that -Os is not making the startup faster.

    Beyond GCC 4.9

    As usual, when looking on a given problem for first time, there is a lot of low hanging fruit. Here are few things implemented for GCC 5.

    Removing write only global/static variables

    I never consider this important - having write-only variables in your program is stupid and no developer would do that? Well, it is not the case of large C++ projects. Both Firefox and LibreOffice seems to have quite few write only variables (in case of Firefox they were accessible only with debugging mode) that occupy BSS segment.  Implementing this simple transformation saved about 12% of BSS space.

    Reducing alignment of virtual tables

    x86-64 ABI requires to increase alignment of all arrays greater than 16 bytes to 16 byte boundary (as opposed to the natural alignment of the data). This decision was made in early 2000s in the anticipation that autovectorization will become more important in future. As an extension, GCC now also aligns every array greater than 32bytes to 32byte boundary and handles structures same way as arrays.

    This turns out to be extremely wasteful for C++ virtual tables (that are arrays of pointers but their accesses will never be vectorized, rather they are kind of random access structures) and for RTTI.

    LLVM aligns to 16 bytes in both cases that seems to cause over 10% difference in data segment size. I updated GCC 5 to align virtual tables to pointer size only.

    Localizing symbols into comdat groups

    C++ inline functions are often duplicated into every translation unit that uses them. These are later merged by linker to avoid duplicates in binaries.  What is however not handled by linker is the case where inline function access other function that is used purely by it.  Such functions may end up duplicated. For this reason I added a simple pass that makes such functions shared, too. This saves additional 0.5%

    Reducing default data alignment

    Data alignment was bumped up to 16 byte in early 2000s to make it possible to safely use SSE accesses to them. This was done for all datastructures above 32 bytes. With arrival of larger vector types (AVX), this was bumped up and in turn caused noticeable bloat. As was benchmarked by Intel guys, this however do not cause any performance improvement - even a decade and half later accessing data by vector register is quite rare event not worth to be optimizing for by default. Moreover vectorizer now knows how to increase data alignment on demand. For this reason this alignment was trimmed down.

    Code and data placement optimizations

    By default GCC puts little effort to code and data placement. Situation is better with FDO, where code is split into hot and cold sections and moreover with FDO+LTO the code gets ordered by approximate execution order. I believe that LTO builds ought to start considerably faster and consume less memory than non-LTO builds (but does not at the moment) and that we should look into an code placement optimizations.

    Identical code folding

    Use of templates in C++ tends to produce a lot of identical functions. Martin Liška works on identical code folding pass in GCC that should be more effetive than linker implementation. Hope it will land into mainline soon.

    Summary

    Both LTO and FDO shows a good improvements on LibreOffice: 36% reduction of code size, 23% reduction in the overall binary, 18% in memory use after startup and 5% in rather lame runtime benchmark. Build times and memory requirements seems to be in reasonable bounds, but some unit testing issues needs to be analyzed before this can be declared production ready. Also FDO needs ugly hack because of static TLS issues that definitely needs to be fixed. It may be interesting to invest some effort into finding decent train run for LibreOffice even though the current unit testing seems to do quite well.

    I hope situation will get better with GCC 5 and I also hope LibreOffice build system will be updated to make FDO builds easy. LibreOffice has definitely turned out to be interesting target to optimize for and I will try to keep eye one it during the GCC 5 development cycle.

    GCC LTO alone is a good tool to reduce code size, but runtime performance and memory use does not change much. At least I saw no off noise changes in very limited benchmarking I did. With LLVM the situation is oposite: LTO leads to small code size increase but also to measurable performance improvements (so the performance gets similar to GCC without LTO).

    More direct comparison of GCC and LLVM produced code would be interesting - LLVM seems to do better on code size at -O2 (by about 3%) and subsequently needs less memory for application to load. Even if the resulting code runs a bit slower, there may be more stupid and easy to fix issues that makes GCC code bigger. Quick look at the produced binaries identified few issues I fixed for GCC 5. We also may want to revisit our tuning tradeoffs because most of them was set over a decade ago and tested on much smaller benchmarks. While GCC 5 produced binary is now smaller than GCC 4.8 one (resolving the regression) it is still bigger than LLVM's.

    LLVM also does traditionally better at compile time (by about 35%) and memory usage. Sadly there is very noticeable regression in GCC 4.8 (increasing compile time by 11%). I am in progress of analyzing and fixing the compile time issues for GCC 5. On the other hand, GCC gets faster LTO linktimes when parallelism is enabled (or with debug symbols that I did not analyze in detail in this post) making it more useful for day to day development.

    I will try to update on GCC 5 performance once I get all trees updated and working again.

    Wednesday, August 20, 2014

    Devirtualization in C++, part 6 - asking user for help

    In previous posts of the series I described most of the new GCC devirtualization machinery. To cut the story short, GCC implements two kinds of devirtualization: the full and speculative one.

    Full devirtualization replaces given polymorphic by a direct call (that can get inlined later).  Speculative devirtualization is a weaker form turning polymorphic call to a conditional testing whether the target is the predicted one and going by direct call path if it happens to be so and doing the indirect call otherwise.

    Whole performing speculative devirtualization the compiler can play somehwat unsafe and make assumptions that are not valid 100% of cases. The main assumption made is that there are no new derived types introduced by code not visible to compiler. With link time optimization this is usually true, but not necessarily so - new derived types can be introduced by libraries or plugins. While compiler can generally attempt to track what instances may not be introduced by external code, this analysis is difficult and it is a question whether it can be implemented with reasonable precision within statically optimizing compiler.

    The following example shows speculative devirtualization:
    struct A {virtual void foo() {}};
    void test (struct A *a)
    {
      a->foo();
    }
     Here GCC 4.9 produces:
    test(A*):
            movq    (%rdi), %rax
            movq    (%rax), %rax
            cmpq    $a::foo(), %rax
            jne     .L5
            ret
    .L5:
            jmp     *%rax
    Instead of:
    test(A*):
            .cfi_startproc
            movq    (%rdi), %rax
            movq    (%rax), %rax
            jmp     *%rax
    If the target of polymorphic call happens to be a::foo(), the speculatively devirtualized code is significantly faster saving the call overhead (and enabling further optimizations).

    As I measured in part 5, the speculative devirtualization can capture significant percentage of calls full devirtualization can not and is correct about its prediction in wast majority of cases.

    C++ provides for ways that may be used to inform compiler that it has a complete information on derived types:
    1. declaring a type in an anonymous namespace,
    2. declaring a type in local scope (within function body),
    3. using the final specifier on a type, or,
    4. using the final specifier on a method.
    Among these hints the first one is the most powerful. Declaring a type in the anonymous namespace not only ensures that no new derived types will appear in other translation units, but also makes it impossible for code in the other compilation units to access the data stored in a given type. This enables other optimizations, such as changing the memory layout. This is however not implemented yet in GCC.

    In this post I will focus on C++11 final specifier and a way how compiler can aid user to annotate the code to improve code quality.

    New warnings

    The example given can be "improved" either by annotating struct A as final:
    struct A final {virtual void foo() {}};
    void test (struct A *a)
    {
      a->foo();
    Or by annotating the method foo as final:
    struct A {virtual void foo() final {}};
    void test (struct A *a)
    {
      a->foo();
    }
    Both allows full devirtualization. Compiling both examples with GCC 4.9 will lead into an empty function test (that is a result of the full devirtualization and inlining the empty function A::foo):
    test(A*):
           ret
    For this reason I added two warnings: -Wsuggest-final-types and -Wsuggest-final-methods.

    As name suggest, the warnings tells users about types or methods where final specifier would improve code quality. Both warnings are selective and do not buzz about all types with no derived types nor about all virtual methods with no override. They also do not warn in the cases where the compiler can easily determine the optimization itself (thus they are necessarily compiler version and optimization setting sensitive and thus not good candidates for -Werror).
    $ gcc -O2 -Wsuggest-final-types -Wsuggest-final-methods t.C
    t.C:1:8: warning: Declaring type ‘struct A’ final would enable devirtualization [-Wsuggest-final-types]
     struct A {virtual void foo() {}};
            ^
    t.C:1:24: warning: Declaring method ‘virtual void A::foo()’ final would enable devirtualization [-Wsuggest-final-methods]
     struct A {virtual void foo() {}};
                            ^
    The implementation on the top of the existing devirtualization machinery is quite straighforward. A (Speculative) devirtualization is implemented by filling in a polymorphic call context for a given call.  This context consists of outer type that is the type of instance the virtual table belongs to, offset within the outer type where the virtual table pointer is located and two flags.  First (maybe_in_construction) specify that type may be in construction and thus virtual functions of base types needs to be considered while the second (maybe_derived_type) specify that the instance actually may be of a type derived from the outer type. In most cases maybe_derived_type is set, because the type is determined from references in the source code.

    Once filled in, the polymorphic call context is then turned into a list of methods that may be called by a given polymorphic call. This is done by recursive walk of the type inheritance tree as described in part 4 of the series. The resulting list is either complete or incomplete. Incomplette lists contains all possible targets within the current compilation unit, but the call may lead to virtual functions defined outside. Complete lists contains all possible targets. Incomplete lists are useful for speculation and profile prediction. Complete lists can drive full devirtualization as well as optimization propagating information across the callgraph.

    When the new warnings are requested, the speculative devirtualization pass looks for contextes that have maybe_derived_type set, that contain precisely one method but they are incoplete. If such context is found then there are two options:
    1. If outer type has no derivations, -Wsuggest-final-types will output a warning that the type may be declared final
    2. If the method has no override, -Wsuggest-final-methods will suggest declaring it final.

    How useful are the warnings?

    The main purpose of the final keyword is not to assist code optimization. It is meant to be an syntactic sugar making APIs more robust. Unlike in Java codebases, the use of final keyword in C++ is not very widespread because it was introduced to C++ only recently. There are large code bases that was developed in pre-C++11 era and programmers are not used to the new feature.

    I do not expect users to blindly annotate every type or method suggested. Instead I would advise to first try -Wsuggest-final-types and inspect individual types found.  Once suitable ones are annotated -Wsuggest-final-methods can be used to annotate suitable methods: annotating a type final renders final specifiers on its methods pointless.

    The warnings can be used with both normal compilation and with link time optimization.  The second makes them a lot more precise: during the link time optimization the compiler has more information about derivations of a given type. Moreover more devirtualization and inlining happens in link time optimization builds.

    Prioritizing the importantce

    As an experiment I produced warnings from compilation of Firefox's libxul
    library.  -Wsuggest-final-types identified 1772 types and -Wsuggest-final-methods 7215 methods (because of a non-linearity in GCC's line information lookup it takes about 5 minutes to just output them out of compiler and it would take ages to inspect all of them by hand!).  It is not a surprise that Firefox developers was unimpressed and asked me whether I can prioritize it somehow.

    For this reason I implemented simple accounting into devirtualization pass and made it to output the warnings in the priority order, where priority is given by number of calls that would be fully devirtualized by the annotation either statically or dynamically when profile feedback is available.  The Firefox warnings now looks as follows:
    /aux/hubicka/firefox/editor/libeditor/html/nsHTMLEditor.h:66:0: warning: Declaring type ‘struct nsHTMLEditor’ final would enable devirtualization of 575 calls [-Wsuggest-final-types]
     class nsHTMLEditor : public nsPlaintextEditor,
     ^
    /aux/hubicka/firefox/docshell/base/nsDocShell.h:124:0: warning: Declaring type ‘struct nsDocShell’ final would enable devirtualization of 216 calls [-Wsuggest-final-types]
     class nsDocShell : public nsDocLoader,
     ^
    /aux/hubicka/firefox/intl/icu/source/common/unicode/uniset.h:276:0: warning: Declaring type ‘struct UnicodeSet’ final would enable devirtualization of 146 calls [-Wsuggest-final-types]
     class U_COMMON_API UnicodeSet : public UnicodeFilter {
     ^
    ../../dist/include/mozilla/Selection.h:48:0: warning: Declaring type ‘struct Selection’ final would enable devirtualization of 131 calls [-Wsuggest-final-types]
    /aux/hubicka/firefox/intl/icu/source/common/unicode/unistr.h:245:0: warning: Declaring type ‘struct UnicodeString’ final would enable devirtualization of 115 calls [-Wsuggest-final-types]
     class U_COMMON_API UnicodeString : public Replaceable
     ^
    /aux/hubicka/firefox/intl/icu/source/i18n/unicode/decimfmt.h:663:0: warning: Declaring type ‘struct DecimalFormat’ final would enable devirtualization of 92 calls [-Wsuggest-final-types]
     class U_I18N_API DecimalFormat: public NumberFormat {
     ^
    /aux/hubicka/firefox/netwerk/protocol/http/nsHttpChannel.h:40:7: warning: Declaring type ‘struct nsHttpChannel’ final would enable devirtualization of 65 calls [-Wsuggest-final-types]
     class nsHttpChannel : public HttpBaseChannel
     ^
    ...
    
     Followed by method warnings:
    /aux/hubicka/firefox/dom/bindings/CallbackObject.cpp:32:0: warning: Declaring method ‘AddRef’ final would enable devirtualization of 148 calls [-Wsuggest-final-methods]
     NS_IMPL_CYCLE_COLLECTING_ADDREF(CallbackObject)
     ^
    /aux/hubicka/firefox/content/media/MediaDecoder.cpp:1440:0: warning: Declaring method ‘GetReentrantMonitor’ final would enable devirtualization of 108 calls [-Wsuggest-final-methods]
     ReentrantMonitor& MediaDecoder::GetReentrantMonitor() {
     ^
    /aux/hubicka/firefox/dom/base/nsGlobalWindow.cpp:9294:0: warning: Declaring method ‘GetExistingListenerManager’ final would enable devirtualization of 103 calls [-Wsuggest-final-methods]
     nsGlobalWindow::GetExistingListenerManager() const
     ^
    /aux/hubicka/firefox/dom/base/nsGlobalWindow.cpp:9281:0: warning: Declaring method ‘GetOrCreateListenerManager’ final would enable devirtualization of 101 calls [-Wsuggest-final-methods]
     nsGlobalWindow::GetOrCreateListenerManager()
     ^
    /aux/hubicka/firefox/dom/bindings/CallbackObject.cpp:33:0: warning: Declaring method ‘Release’ final would enable devirtualization of 76 calls [-Wsuggest-final-methods]
     NS_IMPL_CYCLE_COLLECTING_RELEASE(CallbackObject)
     ^
    /aux/hubicka/firefox/intl/icu/source/common/unistr.cpp:398:0: warning: Declaring virtual destructor of ‘struct UnicodeString’ final would enable devirtualization of 68 calls [-Wsuggest-final-methods]
     UnicodeString::~UnicodeString()
     ^
    /aux/hubicka/firefox/intl/icu/source/common/uvector.cpp:86:0: warning: Declaring virtual destructor of ‘struct UVector’ final would enable devirtualization of 63 calls [-Wsuggest-final-methods]
     UVector::~UVector() {
     ^
    /aux/hubicka/firefox/editor/libeditor/html/nsHTMLEditor.cpp:3189:0: warning: Declaring method ‘DeleteNode’ final would enable devirtualization of 49 calls [-Wsuggest-final-methods]
     nsHTMLEditor::DeleteNode(nsIDOMNode* aNode)
     ^
    
    Overall 9270 polymorphic calls (out of 68162 in libxul binary) can be devirtualized by declaring types final, while declaring top 100 calls will handle 3876 virtual calls.

    13952 calls (23% of all polymorphic calls in libxul binary) can be devirtualized by declaring methods final. The distribution is however significantly more flat - declaring first 500 methods would devirtualize only 4976 calls.  As the quoted warning suggest, however many cases arise from macros and template declarations, so there is some hope this can be handled more centrally.

    Firefox experiment

    Wiill developers follow the warnings? Fortunately I did not need to wait too long to figure out :)

    Based on warnings I posted to GCC IRC, Trevor Saunders prepared a patch adding couple 145 finals into Firefox sources and filled in PR1047696. I found interesting to read the reactions of other developers who wonder why the annotation is a good idea. Trevor found that roughly 3600 indirect calls are saved by his patch - not far from my estimate above. Reportedly the patch is on its way to upstream. 

    Using profile feedback

    Can we do better? Well, with profile feedback we can. I implemented this in this patch. Firefox has profiledbuild mode where the browser train itself on renderring various pages and rebuilds with precise execution counts information on every basic blocks.  With this GCC can prioritize warnings by execution counts rather by static counts. With this, only 10 types are identified as how candidates:
    /aux/hubicka/trunk-install/bin/ld: warning: hidden symbol 'pixman_add_triangles' in /aux/hubicka/build-firefox5-inst7-marxin-profile-nolt/toolkit/library/build/../../../gfx/cairo/libpixman/src/pixman-trap.o is referenced by DSO /usr/lib/x86_64-linux-gnu/libcairo.so
    /aux/hubicka/firefox-martin/gecko-dev/docshell/base/nsDocShell.h:125:0: warning: Declaring type ‘struct nsDocShell’ final would enable devirtualization of 576 calls executed 8840 times [-Wsuggest-final-types]
     class nsDocShell : public nsDocLoader,
     ^
    /aux/hubicka/firefox-martin/gecko-dev/content/base/src/nsScriptLoader.h:38:0: warning: Declaring type ‘struct nsScriptLoader’ final would enable devirtualization of 64 calls executed 608 times [-Wsuggest-final-types]
     class nsScriptLoader : public nsIStreamLoaderObserver
     ^
    /aux/hubicka/firefox-martin/gecko-dev/xpcom/base/nsCycleCollector.cpp:2019:7: warning: Declaring type ‘struct GCGraphBuilder’ final would enable devirtualization of 20 calls executed 262 times [-Wsuggest-final-types]
     class GCGraphBuilder : public nsCycleCollectionTraversalCallback,
           ^
    /aux/hubicka/firefox-martin/gecko-dev/layout/base/nsCaret.h:23:0: warning: Declaring type ‘struct nsCaret’ final would enable devirtualization of 130 calls executed 228 times [-Wsuggest-final-types]
     class nsCaret : public nsISelectionListener
     ^
    /aux/hubicka/firefox-martin/gecko-dev/netwerk/protocol/http/nsHttpConnectionInfo.h:32:7: warning: Declaring type ‘struct nsHttpConnectionInfo’ final would enable devirtualization of 2 calls executed 100 times [-Wsuggest-final-types]
     class nsHttpConnectionInfo
           ^
    /aux/hubicka/firefox-martin/gecko-dev/layout/base/../../content/base/src/nsXMLHttpRequest.h:183:0: warning: Declaring type ‘struct nsXMLHttpRequest’ final would enable devirtualization of 124 calls executed 78 times [-Wsuggest-final-types]
     class nsXMLHttpRequest : public nsXHREventTarget,
     ^
    /aux/hubicka/firefox-martin/gecko-dev/parser/html/nsHtml5StreamListener.h:31:0: warning: Declaring type ‘struct nsHtml5StreamListener’ final would enable devirtualization of 8 calls executed 70 times [-Wsuggest-final-types]
     class nsHtml5StreamListener : public nsIStreamListener,
     ^
    /aux/hubicka/firefox-martin/gecko-dev/xpfe/appshell/src/nsWebShellWindow.h:26:0: warning: Declaring type ‘struct nsWebShellWindow’ final would enable devirtualization of 62 calls executed 12 times [-Wsuggest-final-types]
     class nsWebShellWindow : public nsXULWindow,
     ^
    /aux/hubicka/firefox-martin/gecko-dev/dom/base/nsScreen.h:20:0: warning: Declaring type ‘struct nsScreen’ final would enable devirtualization of 16 calls executed 2 times [-Wsuggest-final-types]
     class nsScreen : public mozilla::DOMEventTargetHelper
     ^
    
    and 31 methods with one few executed over 100 times:
    /aux/hubicka/firefox-martin/gecko-dev/docshell/base/nsDocShell.h:157:0: warning: Declaring method ‘_ZThn424_N10nsDocShell17GetIsBrowserOrAppEPb’ final would enable devirtualization of 4 calls executed 7878 times [-Wsuggest-final-methods]
         NS_DECL_NSIDOCSHELL
     ^
    /aux/hubicka/firefox-martin/gecko-dev/image/src/imgRequestProxy.cpp:93:0: warning: Declaring method ‘AddRef’ final would enable devirtualization of 70 calls executed 4924 times [-Wsuggest-final-methods]
     NS_IMPL_ADDREF(imgRequestProxy)
     ^
    /aux/hubicka/firefox-martin/gecko-dev/dom/base/nsGlobalWindow.h:386:0: warning: Declaring method ‘_ZThn32_N14nsGlobalWindow10GetRealTopEPP12nsIDOMWindow’ final would enable devirtualization of 6 calls executed 1590 times [-Wsuggest-final-methods]
       NS_DECL_NSIDOMWINDOW
     ^
    /aux/hubicka/firefox-martin/gecko-dev/layout/base/nsPresContext.cpp:332:0: warning: Declaring method ‘Release’ final would enable devirtualization of 302 calls executed 1112 times [-Wsuggest-final-methods]
     NS_IMPL_CYCLE_COLLECTING_RELEASE_WITH_LAST_RELEASE(nsPresContext, LastRelease())
     ^
    /aux/hubicka/firefox-martin/gecko-dev/dom/base/nsGlobalWindow.h:386:0: warning: Declaring method ‘_ZThn32_N14nsGlobalWindow13GetRealParentEPP12nsIDOMWindow’ final would enable devirtualization of 8 calls executed 826 times [-Wsuggest-final-methods]
       NS_DECL_NSIDOMWINDOW
     ^
    /aux/hubicka/firefox-martin/gecko-dev/dom/bindings/CallbackObject.cpp:33:0: warning: Declaring method ‘Release’ final would enable devirtualization of 3262 calls executed 670 times [-Wsuggest-final-methods]
     NS_IMPL_CYCLE_COLLECTING_RELEASE(CallbackObject)
     ^
    /aux/hubicka/firefox-martin/gecko-dev/content/base/src/nsScriptLoader.cpp:179:0: warning: Declaring method ‘AddRef’ final would enable devirtualization of 14 calls executed 608 times [-Wsuggest-final-methods]
     NS_IMPL_ISUPPORTS(nsScriptLoader, nsIStreamLoaderObserver)
     ^
    
    The execution counts are actually surprisingly small. Partly because developers still tends to try to keep polymorphic calls out of the hot paths and partly because the train run is actually not very computationally intensive. I would be very interested if the benefits can be measured in some practical benchmarks.

    Next time I plan to write on determing dynamic types of heap allocated memory.

    Monday, April 21, 2014

    Linktime optimization in GCC, part 2 - Firefox

    This time I will write about building Firefox with LTO. Firefox is one of largest binaries on my hard drive (beaten only by Chromium) so it is an excellent stress test. Taras Glek and I started to work on getting Firefox to build well in 2009 (see original paper from 2010), however for all those years I was mostly chasing correctness and compile time/memory usage issues. Markus Trippelsdorf was instrumental to keep the project going. In parallel Rafael Espindola did same effort of getting Firefox to work with LLVM and LTO (Update: see also his recent slides) that got into useful shape about a year later.

    Update: reports in GCC bugzilla:
    just a short Update: Firefox since version 30 as well as Thunderbird since version 31 both compile fine with LTO enabled without the need of any additional patches. The package size was reduced by 51% (firefox ~420MB -> ~207MB) and 59% (thunderbird ~480MB -> ~200MB). Both programs work as intended, no crashes or unexpected behaviour so far.
    Update: I solved the LLVM LTO crash by commenting out all SSE3 code from SkBitmapProcState_opts_SSSE3.cpp. It seems to be re-incarnation of Firefox's PR759683 and LLVM's PR13000. I also added GCC 4.8 runtime and code size tests.

    Update: With help from Firefox devevelopers I also measured tp5o benchmark. It shows rendering times of popular pages from 2011.

    Looking into Firefox let me to understand that needs of large applications are different from what one sees in standard benchmarks (SPEC2006). It motivated several changes in GCC, including making -Os cope well with C++, adding code layout logic, optimization to handle better static construction and destruction and to improve dealing with PIC interposition rules.

    It may be surprising, but despite my work in the area, this is first time I actually run some real benchmarks (except for Dromaeo I did two weeks ago). Benchmarking large applications is not easy to do and I never had time to poke with the setup. As I have learned recently, Firefox has actually pretty thorough benchmarking tool called talos.

    I run quite few benchmarks and builds, but it would be interesting to do more. Due to unforeseen changes in my plan this weekend I however decided to publish what I have  now and followup on individual topics.

    This post should be also useful guide to those who want to try other bigger projects with LTO. Firefox contains a lot of third party libraries and has very diverse code base. Problems hit here are probably good portions of problems you hit on other bigger applications.

    For LTO history, see part 1 of this series.

    Update: See also Martin Liška post on building Gentoo with LTO.

    Prerequisities

    I will not duplicate the build instructions from Mozilla site. However for successful LTO build one needs the following additional things:
    1. Binutils with plugin enabled. This can be checked by
      $ ld --help | grep plugin
        -plugin PLUGIN              Load named plugin
        -plugin-opt ARG             Send arg to last-loaded plugin
      I personally use gold from GNU Binutils 2.24.51.20140405 but any reasonably recent binutils either with GNU ld or gold should do the job. One needs to however enable the plugin while configuring since it is off by default.
    2. GCC 4.7+ that is configured with plugin support (i.e. built with the plugin enabled LD). For sane memory use, GCC 4.9 is recommended, it requires about 5 times less RAM.

      To test if GCC runs with plugin support, try to link simple hello world application and do:
      $ gcc -O2 t.c -flto --verbose 2>&1 | grep collect2.*plugin
      If the output is non-empty you are having plugin set up correctly.
    3. Clang needs to be built with gold llvm plugin enabled, too. Follow the official instructions.
    4. ar/nm/ranlib wrappers on the path that dispatch to the real ar/nm/ranlib with plugin argument.

      This is somewhat anoying issue. GCC ships with gcc-ar/gcc-nm/gcc-ld, but it is not very easy to convince build script to use them. Adding to path a simple script calling gcc-* equivalent will lead to infinite recursion. I use the following hack:
      $ cat ar
      #! /bin/sh
      command=$1
      shift
      /aux/hubicka/binutils-install/bin/ar $command --plugin /aux/hubicka/trunk-install/libexec/gcc/x86_64-unknown-linux-gnu/4.9.0/liblto_plugin.so $*

      $ cat nm
      #! /bin/sh
      /aux/hubicka/binutils-install/bin/nm --plugin /aux/hubicka/trunk-install/libexec/gcc/x86_64-unknown-linux-gnu/4.9.0/liblto_plugin.so $*

      $ cat ranlib
      #! /bin/sh
      /aux/hubicka/binutils-install/bin/ranlib $command --plugin /aux/hubicka/trunk-install/libexec/gcc/x86_64-unknown-linux-gnu/4.9.0/liblto_plugin.so $*
      In the LLVM case I use the same wrappers, just the plugin path is /aux/hubicka/llvm-install/lib/LLVMgold.so.

      Alternatively the plugin can be installed to the default plugin sear path (/usr/lib/bfd-plugins). This solution has some limitations however. One needs not-yet-released binutils to make this work with GCC and one can not easily install multiple plugins. I hope this situation will change in not so distant future.

      Forgetting these steps will give you undefined symbols as soon as you start to link static library build with -flto and GCC 4.9+ or -flto -fno-fat-lto-objects with earlier compilers.
    5. Disk space for the checkout, about 3GB
    6. Disk space for build directory: 0.9GB for normal build (3GB with debug info), 1.3GB for LTO build with GCC (1.6GB with debug info) and 0.72GB for LTO build with LLVM (2.2GB with debug info).
    7. At least 4GB of RAM (if your machine has 4GB, some extra swap is needed to pull out linker when LTO optimizer is working) for GCC LTO and 8GB for LLVM LTO.
    8. At least 3GB of /tmp space for GCC's WHOPR streaming. If you use tmpfs, you may want to redirect GCC's temporary files to a different directory by setting TMPDIR.

    Be sure you actually use LTO and have plugin configured right

    One of common mistakes is to use GCC -flto without linker plugin. This has three important consequences.
    1. GCC will always build fat LTO files and your builds will take twice as long then needed
    2. GCC will not have visibility information from linker and will not optimize very well across symbols exported from the binary. You may change this with -fwhole-program, but this solution is hard to use and still has some negative impact on code quality. It is a lot better to use linker plugin and the -fvisibility=hidden if you build PIC library.
    3. If you produce static library of LTO files and you link it without the plugin, the link time optimization on those will not happen. You will get the same code as without -flto. I believe many users who report no benefits and/or ability to build very complex apps actually fall into this trap.
    I hope binutils will be patched to diagnose these cases more gratefully.

      Be sure you get link-time optimization flags right.

      Both GCC and LLVM perform some optimizations at compile-time and some optimizations at link-time. Ideally the compiler should pass down the flags from compile-time to link-time but GCC (and I believe LLVM, too) is not yet fully reorganized to support per-function flags (I plan improve this for next release).

      Worse yet, GCC does some magic, like trying to get -fPIC flag right, that does not really do what it should, but it hides some of obvious to detect mistakes. Richard Biener improved it a bit for 4.9, but the sollution is still not very good.

      It is important to pass proper optimization flag to the final invocation of compiler that is used to link the binary. Often it can be done by adding your optimization flags to LDFLAGS. Failing to do this will probably result in not very well optimized binary.

        Compilers

        I used unpatched GCC 4.9 release candidate 1. I configured my compiler with:
        --prefix=/aux/hubicka/49-install --disable-plugin --enable-checking=release --disable-multilib --with-build-config=bootstrap-lto
        And built with make profiledbootstrap. If you want to use LTO and FDO together, you may want to get this patch for PR47233 that did not make it into 4.9.0 and will land 4.9.1 only. It reduces LTO+FDO memory use by about 4GB.

        I also use relatively fresh LLVM checkout 206678 (Apr 19). Eric Christopher recommended me to try recent LLVMs since the memory usage with LTO was reduced. I also briefly tried 3.4 that seemed to behave similarly modulo needing more memory. The wrong code issues I mention here as well as the crash on LTO without -march=native happens in both cases. I configured with
        --enable-optimized --disable-assertions --prefix=/aux/hubicka/llvm-install 
        and rebuilt by itself. My understanding is that this should give me fastest binary. Release builds  seems to be with assertions enabled that seems to cause noticeable slowdowns.

        To both compiler paths I install the shell wrappers for ar/nm/ranlib as mentioned above.

        Firefox changes needed

        The actual changes to Firefox are tracked by our tracking PR45375. Here are changes that actually landed upstream I write about mostly to give an idea what source code changes are needed:
        1. Patches to mark symbols used by top-level asm statements by __used__ attribute. This change is needed because unlike the statement level asm statements, the toplevel asm statements have no way of annotating them with symbols used. For this reason the symbols in ASM attributes are not visible to linker and GCC prior code is generated. If you won't use explicit __used__ attribute, GCC will freely optimize out the symbols leading to link errors.

          This solution somewhat sucs. LLVM has built in ASM parser that makes this transparent in some cases, but not in others.

          The patch is linked from Firefox's PR826481
        2. Patch to libvpx to avoid grepping bytecode assembly files.

          The problem here is that libvpx produce assembly file and then greps it for specific code in it. With -flto the assembly file consists of compressed LTO bytecode and thus the grepping fails.  One needs to pass -fno-lto to the specific unit.

          More in Firefox's PR763495
        Patches not pushed upstream
        1. A hack to prevent configure script to be confused by LTO and disable support for visibilities. This is linked here. Without this patch one gets working Firefox, but code quality is lousy.

          Confusion of confiugre scripts is unfortunately common problem for LTO. Many of the scripts was written with traditional compilation model in mind.
        Martin Liška and Markus Trippelsdorf put together bare bones of GCC LTO FAQ.

          Firefox configuration

          My mozconfig is as follows:
          ac_add_options --enable-application=browser
          ac_add_options --enable-update-channel=${MOZ_UPDATE_CHANNEL}

          ac_add_options --enable-update-packaging
          ac_add_options --disable-debug-symbols
          ac_add_options --disable-optimize
          ac_add_options --disable-tests
          export MOZ_TELEMETRY_REPORTING=1
          export MOZILLA_OFFICIAL=1
          mk_add_options MOZ_MAKE_FLAGS="-j16"
          export MOZ_PACKAGE_JSSHELL=1
          MYFLAGS="-O3 -march=native -flto"
          export CFLAGS=$MYFLAGS
          export CXXFLAGS=$MYFLAGS    
          export LDFLAGS="-O3 $MYFLAGS"              
          mk_add_options MOZ_OBJDIR=/aux/hubicka/final2-gcc-firefox-49-lto-O3-native-debug
          I use --disable-optimize to be able consistently set flags to all files by MYFLAGS. For LLVM LTO build one also need in some cases --disable-elfhack (or one gets segfaults). I will discuss effect of enabling-disabling debug symbols briefly in the next section.

          LLVM LTO builds crashes in instruction selection with LTO. I suppose it is because some units are build with custom -march flag with use of SSE intrincisc but this information gets lost to LTO. To get binary built I link with -march=native. I am in the progress of looking for testcase to report the issue. The resulting binary do not work for me, so all binaries benchmarked are with generic tuning.

          To get build with profile feedback, one additionally need to enable tests.

          Getting parallelism right

          If you want to get parallel build with LTO, it is better to parallelize the linktime optimization, too. This is currently supported by GCC only. In GCC you can pass parallelism to -flto parameter. I use -flto=16 that gets me 16 processes. There is no need for passing higher values. Parlallelism increases the memory use, so you may need to use lesser parallelism than is your CPU count if you are running into swap storms.

          Other important thing to consider is correlation with make's parallelism. If you build with -j16 then make may execute 16 linking commands in parallel that may each execute 16 worker processes killing your machine with 256 parallel compilations. For Firefox it is not an issue as there is only one dominating binary, for other projects you may want to be cautious.

          GCC support -flto=jobserv (implemented by Andi Kleen) that lets make to drive the whole process. There are few downsides of this:
          1. You need to adjust your Makefile to add + in front of the rule executing link command, otherwise GNU make will not pass down the necessary jobserver information. A lot of users seem to not read the docs far enough to notice this trap and get serial links.
          2. GCC 4.9 newly parallelizes the streaming stage and because I do not know how to communicate with Make's jobserver, this stage won't get parallelized, yet. GNU Make's jobserver seems easy to deal with and patches are welcome :)
          3. If you cut&paste command from Make's output, the linking will be serial because jobserv is not run.
          We will need some cooperation with GNU make maintainers to get these problems hammered out.  In general I think -flto should default to an "auto" mode that will by default detect number of CPUs available unless it knows about Make's jobserver and Make's jobserver should be available without the extra Makefile change.

          The parallelism is reached by breaking program into fixed set of partitions after the inter-procedural analysis is completed. The number of partitions is not controlled by -flto and is 32 by default. If you want higher parallelism or smaller memory footprint, you may increase number of partitions by --param lto-partitions=n. This will affect resulting binary but it should not really decrease code quality by a considerable fraction. The number is fixed to ensure that builds on different machines gets precisely the same output.

            Building

            To build Firefox normally, one use
            make -f client.mk 
            For the profiled build one needs X server running (I use tightvnc) and use  
            make -f client.mk profiledbuild
            If X is not running, the build passes but the resulting binary is crap, because all it is optimized for is to output the error message about missing X.

            My build machine

            My machine (courtesy of IUUK) has plenty RAM (64GB) and AMD Opteron(TM) Processor 6272 with 16 threads running 2100Mhz. I use tmpfs for /tmp directory and run modified Debian Wheezy.

            Compile time

            Wall time


            One observation is that GCC's LTO is no longer coming for extreme compile time costs. The slim LTO files avoid double optimization and the build gets parallelized pretty. The extra wall time caused by -flto=16 is comparable to switch from -O1 to -O2.

            LLVM without LTO clearly builds faster. Its -O3 compilation level is about as fast as -O0 in GCC, while GCC's -O1 is just bit slower and other optimization levels are significantly slower. With LTO LLVM builds a lot slower due to lack of any parallelism in the build stage. Also LLVM's optimization queue is organized in a way that virtually all optimization are run twice; once at compile-time and then again at link-time. While this may account extra cost, I think it is not too bad due to speed and relative simplicity of the LLVM's optimization queue. It is more problematic in a way that optimization done at compile time may inhibit optimizations at link time and I expect this will be revisited in future.

            Will debug info kill my builds?

            Other interesting topic is debug info. While GCC basically carries all expenses of debug info even with -g0. LLVM doesn't. The LTO memory usage increases many times with -g and build time with -O3 -flto -march=native -g is 37m46s (2266s), so about 46% slower. GCC needs 15m20s (extra 11%). I left out these from the graph to avoid rescaling.

            For non-LTO builds GCC -O3 -g compile time is 12m41s (761s), about 18% slower than -g0, LLVM -O3 -g compile time is 9m35s (575s), about 13% slower.

            Previous releases

            To put this into further perspective, GCC 4.7 needs 35m1s (2101s) to complete the job with -O3-flto=16 -g, GCC 4.8 needs 40m10s (1810s). Firefox build with GCC 4.7 and LTO crashes on startup for me (earlier checkouts worked). I did not investigate the problem, yet.

            CPU time

            Parallelism plays important role: while compilations can be easily done in parallel via make -j16, linking often gets stuck on one binary.  Here is how system+user time compares on individual compilation:


            Here it is visible that LTO compilation does bit more work for both compilers, but WHOPR is able to keep overall build time in bounds.

            Compiler's memory use

            Here I present some data collected from vmstat plotted by a script by Martin Liška (who originally got the idea from Andi Kleen). Red line in CPU utilization graph shows the case where only one CPU is used.

            GCC 4.9

            GCC memory/cpu use at -O0


            GCC memory/cpu use at -O3


            GCC memory/cpu use at -O3 -g

            In the graphs one can actually see that relatively a lot of time (100s) is spent before actual compilation begins. It is make parsing through Makefile and executing bunch of python scripts. The same happens after 600s to 700s where the same is repeated for different make target. The little peak at the end 700s is gold linking the final libxul.so library. We will soon see this like peak to grow up :)
            GCC memory/cpu use at -O3 -flto
            Here the memory usage during actual compilation is lowered to about 4GB and compilations are finised in about 500s (faster than with -O3 in the previous graph).
            What follows is a memory mt. Everest climb in memory usage graph starting just after 500s.  One can see the serial link stage running from 500s to 600s

            Mount Everest climb in memory usage starting just after 500s is the LTO optimization libxul.so, the largest Firefox's library:
            1. WPA (whole program analysis) type merging: The steeper walk up to 550s is compiler reading summaries all of the compilation units, merging types and symbol tables.
            2. IPA optimization: The slow walk up is inter-procedural optimization, dominated by inliner decisions.
            3. WPA streaming: Just after 600s the serial stage is finished and the steep peak up is parallel streaming and shipping new object files into copmilatoin. You can see the CPU usage to go up, too.
            4. Local transformation: Shortly after the steep valley the actual compilation starts, executed in parallel for another almost 200s.
            While the memory usage in my graph peaks 16GB, you can see it all happens in the parallel stage.  If you link Firefox at 8GB machine you can either increase --param lto-partitions or decrease -flto parallelism. Both will make the mt. Everest to sink.

            The little hill just before the mt. Everest (around 500s) has about the same shape for a good reason: it is linking of the javascript interpreter shell, the second largest binary in Firefox. It is a good example that bit of build machinery reorg would probably help to hide most of the LTO build time expenses. All one would need would be to overlap the javascript shell link with the libxul.so link. Something that is not needed in the normal build, since the link time of javascript shell is almost immediate.

            An unfortunate debug info surprise

            GCC -O3 -flto=16 -g
            This graph was supposed to show, that debug info is almost for free. This is true up to about 700s. Then it shows quite large increase in peak memory use (16Gb to 25GB) for debugging enabled in the local transformation stage. Sadly this is a new problem with recent Firefox checkout I did not noticed until now. I am looking into an solution.

            If I have 8GB or 4GB of RAM?

            The following are the worst case (debug info + -O3) builds with reduced parallelizm to 8 and 4. I regularly build firefox on my notebook with 8GB of RAM and 4 threads. On 4GB machine the build is possible, but one would need to either go to -flto=1 or play curefuly with partitioning. Again, Firefox is quite a monster.
            GCC -O3 -flto=8 -g

            GCC -O3 -flto=4 -g
            GCC -O3 -flto=16 --param lto-partitions=16 -g

            Older GCC releases


            GCC 4.7 with -O3 -flto=16 -g
            To get bit of perspective, this shows improvements since GCC4.7. Fat object files accounts for increasing the compilation stage to almost 1000s. The type merging is the footstep of the hill followed by slow inliner decision and streaming (the shanky part of graph, around 1500s). The WPA stage needs 8GB instead of 4GB in GCC 4.9.

            Again this graph shows that there is a new problem with debug info with recent Firefox checkout. Again this is fixable by reducing parallelism.

            GCC 4.8 -O3 -flto=16 -g
            While GCC 4.8 links almost twice as fast compared to 4.7, the memory usage is still bad (serial stage has about the same memory use and local compilation needs 5GB more)

            LLVM

            LLVm memory/CPU use at -O0

            LLVM memory/CPU use at -O3
            LLVM memory/CPU use at -O3 -flto -march=native
            LLVM is done with compilation in about 350 seconds, that is quite surprisingly similar to GCC's time.I would expect here cleang to beat GCC's C++ FE hands down, but apparently a lot of compiler time difference nowdays is in the back-end and also by the integrated assembler. Or perhaps it is because more optimizations are performed at LLVM's compile time. Then it follows by serial link stage.

            Without LTO LLVM uses about 1/3rd to 1/4th of memory, while with LTO it needs about twice as much (given that the link stage is not parallel at all). This is the case of compilation with debug symbols disabled only.
            LLVM with -O3 -g

            LLVM with -O3 -flto -g
            This graph shows issues with debug info memory use. LLVM goes up to 35GB. LLVM developers are also working on debug info merging improvements (equivalent to what GCC's type merging is) and the situation has improved in last two releases until the current shape. Older LLVM checkouts happily run out of 60GB memory & 60GB swap on my machine.

            File Size

            I generated the file size comparison splitting the binary into text, relocations, EH and data that accounts also the other small sections.I made EH data to appear gray, since they most of time don not need to be loaded into memory and thus their size is less critical then the size of other sections.



            One thing that LTO is good for is code size optimization. This is not as clearly visible on libxul.so, because it uses gold to do some of dead code removal, but will become more clear in other applications.

            In GCC's LTO at -Os bring about 5% reduction to both code and data segment (hover over graph to get precise numbers) and 20% reduction to EH data. For -O2/-O3 the data segment reduction stays, but the code is expanded by inlining. This is just the out-of-the-box configuration. The inliner is by default not really well behaving on large applications. Some of it I already fixed for next GCC release here. For now I recommend to reduce the inliner expansion by --param inline-unit-growth=5. This brings binaries close to non-LTO -Os, but without the runtime expenses. I wanted to make this default for GCC 4.9 but did not have chance to complete the necessary bencharking, so it is something I want to address this development cycle (in early tests done in mid 2013 there was issues with SPEC2k6 xalancbmk benchmark). I will follow on this and show some tests.

            UPDATE: I rebuilt firefox without -function-sections -fdata-sections and gold's section removal + identical code merging (which require the sections). The overall binary reduction with the default configuration is 43% (in LTO build)

            For LLVM LTO usually expands the binary including the -Os compilation, at -O3 it is about 30%. I believe it is mostly by inliner seeing too many cross-module inlining cases, too. LLVM has no global growth limit on inliner. What happens with LTO is that almost all functions become inlinable and simple minded inliner is just too happy and inlines them all. I believe that this is partly reduced by fact that LLVM does code expanding inlining at compile time that consequently disables some of cross-module inlining at link-time. Did not double-check this theory, yet.

            The data segment is reduced by about 3% and it is smaller than data segment of GCC produced binaries by about 6%. I tracked down part of the problem to GCC aligning virtual tables as if they were arrays for faster access by vector instructions. This is quite pointless and will make patch to avoid it. Other problem is the lack of optimization pass to remove write only variables that I made patch for next release cycle. Last difference seems to be LLVM bit more aggressive about optimizing out C++ type infos that GCC believe are forced to stay by C++ ABI. I looked into this and I lean towards conlcusion it is LLVM bug, but I am not expert here and further analysis are needed. Some of the symbols has appeared forced in recent LLVM checkout.

            One thing to observe is that GCC seems to produce considerably smaller code segment with -Os than LLVM (14%). As it will be shown later, LLVM's -Os code is however faster than GCC's -Os. It is the nature of the switch to trade all performance for code size however. -O2 code segment is smaller in LLVM (5%). I did not look into this in detail, yet, but it is my long time feeling that -O2 may need bit of revisiting from large application developer point of view. A lot of code generation decisions are tuned based on benchmarks of medium sized apps (SPEC2k6) that makes code expanding optimizations to look cool. I did bit work on this for GCC 4.9 and generic tuning, but more needs to be done. The size difference further increase at -O3 that I consider OK, since -O3 is about producing fast binary at any cost. As it is shown in benchmarks, -O3 in GCC does bring benefits, while for LLVM it seems to be close to -O2 performance.

            GCC also produce bigger EH tables (27%) at -O2. Partly this is concious change in GCC 4.9 where push/pop instructions are used to pass on-stack arguments. This produce shorter code and work well with modern CPUs having stack engines but the unwind info bloats. LLVM also seems to have bug where it optimizes away the EH region in:
            #include <stdio.h>
            void
            t()
            {
            }

            void
            q()
            {
              try {t();} catch(int) {printf ("catch\n");}
            }
            Even with -fPIC -O2. This is not valid, because t can be interposed by dynamic linker to different implementation that does throw. I filled in PR for this and will double check if those two are the sole reasons for the difference.

            Runtime benchmarks

            I used talos to collect some runtime performance data. Sadly I do not have LLVM LTO data, because the binary crashes (and moreover it is native build rather than generic). I plan to followup on this and do the comparison once I hammer out the problems.

            Startup time tests

            First benchmark is ts_paint that starts the browser and waits for the window to be displayed (in my understanding). I tried also bechmark called ts, but it does not produce results.

            Before each talos run I flushed the cache. Talos is running the benchmark several times, so it gave me one cold startup run and several hot startups (I hope). In the graph bellow is the cold startup and average of hot startups. Cold startup have more noise in it, because it is just one run and I suppose it also depends how well the binary landed on the hdd.

            I am not entirely happy about this benchmark and will write more later. The main thing is that Firefox uses madvise call to fully load every library disabling the kernel's page demand loading. Martin Liška implemented function reordering pass that, with profile feedback, should allow to load just small portion of the binary. I will followup on this later. See also Taras's blog post on this topic.

            The startup tests seems to be insensitive to LTO except for case when profile feedback was used (FDO saves 7%, LTO additional 4%). Cold startup shows some sensitivity to file size, but apparently my HDD is fast.

            Opening window


            Opening window is a nice test to show compiler flags, since it should fit in cache. Here LTO wins bt 5% at -O2, 2.3% at -O3 and 9.8% with FDO (that by itself brings 6.4%). One of two (Update: three) benchmarks where LLVM performs better than GCC -O2, I am not sure if it is within a noise factor.

            Scrolling


            Scrolling has similar properties to window opening: just bunch of C++ code that fits in RAM and thus is sensitive to compiler optimization.

            SVG rendering




            Here I used tsvgx test (that is said to be better than tsvg) and since it tests multiple images, I made geometric average.


            Canvas mark


            Again a gemoetric average, I did not investigate yet, why some runs has failed. Will update on it.

              Dromaeo CSS


              A geometric average of individual benchmarks.

              Dromaeo DOM


              This is the only benchmark where LLVM is not approximately in GCC -O1 to -O2 range and in fact LLVM -O2 outperforms GCC -O2 by 7%. I profiled it and again it was caused by GCC not inlining by PIC interposition rules, so the difference should disappear if the problem is fixed at LLVM side.

              Allowing the inline by adding explicit inline keyword makes GCC -O3 to match score of GCC -O3 -flto=16 where the inline happens then based on linker plugin feedback. Again someting I plan to followup on. There is bit more going on that I need to investigate deeper. For GCC, the crictical inline happens only at -O3 -flto, while LLVM gets it right at -O2 despite the fact that its -O2 binary is shorter than GCC's. The inline decisions are significantly different making it a bit harder to track down.

              Update: By mistake, I reported large improvement for -flto -O3 with LLVM. This was actually caused by fact that I pulled in some changes into firefox tree while looking for workaround for the compiler crash. One of them was patch improving DOM. I rerun the tests on original tree.

              Update: Loading common pages


              This test seems to be only one that seems to show that -O3 is not always a win due to code size bloat.

              So what we can improve?

              LTO in GCC has quite large room for improvement. I have patches for next release to improve inliner's behaviour and reduce code size considerably without giving up on the speed. There is also still a low hanging fruit on imroving memory usage and compile time. Re-tunning backend and applications themselves for the new environment will need more work.

              Comparing to LLVM it seems we should try to reduce amount of information stored into LTO object files that is almost twice of what LLVM streams. Part of the extra information is used for optimization and debug info, but the streaming was not very curefuly analyzed for what really is important where: it is more or less pickled in-memory representation. There is ongoing effort to cleanup GCC, modernize the source base that will hopefully enable some deeper cleanups, like simplifying the type representation, that will in turn improve memory footprint. Clearly speeding up C++ frontend and putting our datastructures on the diet can do wonders. (I would be curious to see clang retargeted to GCC backend).

              I also have quite few plans for profile feedback. I plan to reorgnize instrumentation so with LTO one will not need to recompile everything twice, only need to relink and hopefully AutoFDO and LIPO will be merged in.

              Next

              Hope you found the tests informative and I will try to followup on the unresolved questions in the near future. I also plan to write post on Libreoffice and for that I would welcome suggestions for benchmarking procedure.