Building openSUSE with link-time optimizations Jan Hubika and - - PowerPoint PPT Presentation

building opensuse with link time optimizations
SMART_READER_LITE
LIVE PREVIEW

Building openSUSE with link-time optimizations Jan Hubika and - - PowerPoint PPT Presentation

Building openSUSE with link-time optimizations Jan Hubika and Martin Lika SUSElabs jh@suse.cz, mliska@suse.cz Outlilne What is link-time optimization? Link-time optimization and GCC Benchmarks Can we build openSUSE with


slide-1
SLIDE 1

Jan Hubička and Martin Liška

SUSElabs jh@suse.cz, mliska@suse.cz

Building openSUSE with link-time optimizations

slide-2
SLIDE 2

Outlilne

  • What is link-time optimization?
  • Link-time optimization and GCC
  • Benchmarks
  • Can we build openSUSE with link-time optimization by

default?

slide-3
SLIDE 3

What is link-time

  • ptimization?
slide-4
SLIDE 4

Per-file compilation

File1.c File2.c File3.c File4.c File1.o File2.o File3.o File4.o GCC GCC GCC GCC ld / gold a.out

slide-5
SLIDE 5

Link-time optimization

File1.c File2.c File3.c File4.c File1.o IL File2.o IL File3.o IL File4.o IL GCC GCC GCC GCC ld / gold a.out LTO plug-in link-time compiler

slide-6
SLIDE 6

Benefits of LTO

  • Symbol promotion

(from linker’s resolution data most symbols become “static”)

  • Cross-module inlining, constant propagation
  • Aggressive unreachable code removal
  • Profile propagation
  • EH optimization (propagation of “nothrow”)
  • Identical code folding
  • Optimized code layout
  • and more :)
slide-7
SLIDE 7

Problems of LTO

  • Whole toolchain has to be restructured
  • Slow compile-edit cycle
  • Harder bugreports

(ofuen whole program needed to reproduce issue)

  • Not 100% transparent to user, but in most cases all one needs

to do is to add -flto

slide-8
SLIDE 8

Link-time optimization and GCC

slide-9
SLIDE 9

Modernizing GCC (LTO perspective)

  • 1

9 9 9 ( G C C 2 . 9 5 ) : Function-at-a-time

  • 2

1 ( G C C 3 . ) : New inliner (first high-level opt . in gcc)

  • 2

4 ( G C C 3 . 4 ) : Unit-at-a-time; intermodule compilation for C; Inter-procedural

  • ptimization framework
  • 2

5 ( G C C 4 . ) : New SSA optimization framework

  • 2

6 ( G C C 4 . 1 ) : Inter-procedural optimizations: profile guided inlining, pure/const discovery, mod/ref, inter-procedural constant propagation,

  • -fwhole-program –combine
  • 2

8 ( G C C 4 . 4 ) : Inter-procedural optimization on SSA; early optimization and inlining.

  • 2

1 ( G C C 4 . 5 ) : Basic LTO framework (5 years in development)

  • 2

1 1 ( G C C 4 . 6 ) : WHOPR (parallel link-time optimization); Firefox builds

slide-10
SLIDE 10

Link-time optimization

File1.c File2.c File3.c File4.c File1.o IL File2.o IL File3.o IL File4.o IL GCC GCC GCC GCC ld / gold a.out LTO plug-in link-time compiler

slide-11
SLIDE 11

Parallelized Link-time optimization (WHOPR)

File1.c File2.c File3.c File4.c File1.o IL File2.o IL File3.o IL File4.o IL GCC GCC GCC GCC ld / gold a.out LTO plguin Whole Program Analysis Local opt. Local opt Local opt.

slide-12
SLIDE 12

Modernizing GCC (LTO perspective)

  • 2

1 2 ( G C C 4 . 7 . ) : Memory use optimizations, new inliner heuristics, new inter-procedural constant propagation with clonning

  • 2

1 3 ( G C C 4 . 8 . ) : symbol table; propagation of values passed through aggregates

  • 2

1 4 ( G C C 4 . 9 . ) : slim LTO objects by default; on demand loading of functions; devirtualization pass; feedback directed code layout

  • 2

1 5 ( G C C 5 ) : Identical Code Folding; COMDAT optimization; One Definition Rule for C++; alignment propagation; correct command line options handling with LTO

  • 2

1 6 ( G C C 6 ) : Linker-plugin now detects type of output binary. C&Fortran type merging. Better alias anaysis

  • 2

1 7 ( G C C 7 ) : Inter-procedural value range propagaion; bitwise propagation

  • 2

1 8 ( G C C 8 ) : Early debug info. Profile representation rewrite. Function splitting now by default. Reworked runtime estimation; Malloc attribute propagation

slide-13
SLIDE 13

GCC optimization pipeline

Parser IL generation

Early opts:

Early Inliner Constant prop. Forwward prop. Jump threading Scalar repl. of aggr. Alias analysis Redundancy ellim. Dead store ellim. Dead code ellim. Tail recursion Switch conversion pure/const/nothrow EH optimization Profile guessing

Compile time Link-time serial

IP analysis streaming out Symbol & type streaming in + merging Inter-procedural (whole program) Opts:

Dead symbol ellim. Symbol promotion profile analysis Identical code folding devirtualization Constant propagation const/destr merging Inlining pure/const/nothrow mod/ref comdat

Partitioning streaming out Streaming in symbols, types and declarations & link Stream in and apply transformations High level opts:

Constant prop. Complette unroll Forward prop. Alias analysis Return slot opt. Redudancy ellim. Jump threading Dead code ellim. Conditional store ellim. Copy prop. If combine Tail recursion Copy loop headers Scalar repl. of aggr. Dead store ellim. Dead code ellim. Reassociation Sincos, bswap opt. Loop invariant motion Partial redundancy ellim. Loop splitting Unroll and jam Loop dsitribution Loop interchange ...

Low level opts:

Common subexpression ellim. Forward propagation Copy propagation Partial Redundancy Ellim. Code hoisting Copy propagation Store motion If conversion Loop invariant motion Loop unrolling Doloop optimization Web construction Copy propagation Common subexpression ellim. Dead store ellim. Instruction combine Function partitioning Instruction splitting Live range shrinking Scheduling Register allocation Global common subexpr. Ellim. Shrink wrapping Stack adjustment opt. Register renaming Constant prop. Code reordering Scheduling X87 register stack Code/data alignment Machine dependent reorg. Code output

Link-time parallel

slide-14
SLIDE 14

Benchmarks

slide-15
SLIDE 15

SPECint 2006 performance (relative to GCC 6)

GCC 7 -O2 GCC 8 -O2 0.5 1 1.5 2 2.5 generic native

slide-16
SLIDE 16

SPECint 2006 performance (relative to GCC 6)

GCC 7 -O2 GCC 8 -O2 GCC 6 -Ofast GCC 7 -Ofast GCC 8 -Ofast 2 4 6 8 10 12 14 generic native

slide-17
SLIDE 17

SPECint 2006 performance (relative to GCC 6)

GCC 7 -O2 GCC 8 -O2 GCC 6 -Ofast GCC 7 -Ofast GCC 8 -Ofast 2 4 6 8 10 12 14 generic native 429.mcf 458.sjeng 445.gobmk 403.gcc 462.libquantum 473.astar 464.h264ref 471.omnetpp 401.bzip2 483.xalancbmk 400.perlbench 456.hmmer Geomean

  • 20

20 40 60 80 100 120

GCC -Ofast relative to GCC 6 -O2

GCC 6 GCC 7 GCC 8

slide-18
SLIDE 18

SPECint 2006 performance (relative to GCC 6)

GCC 7 -O2 GCC 8 -O2 GCC 6 -Ofast GCC 7 -Ofast GCC 8 -Ofast 2 4 6 8 10 12 14 generic native

slide-19
SLIDE 19

SPECint 2006 performance (relative to GCC 6)

GCC 7 -O2 GCC 8 -O2 GCC 6 -Ofast GCC 7 -Ofast GCC 8 -Ofast clang/flang 6 -Ofast ICC 18 -Ofast 2 4 6 8 10 12 14 generic native

slide-20
SLIDE 20

SPECint 2006 performance (relative to GCC 6)

GCC 7 -O2 GCC 8 -O2 GCC 6 -Ofast GCC 7 -Ofast GCC 8 -Ofast clang/flang 6 -Ofast ICC 18 -Ofast 2 4 6 8 10 12 14 generic native 400.perlbench 401.bzip2 403.gcc 429.mcf 445.gobmk 456.hmmer 458.sjeng 462.libquantum 464.h264ref 471.omnetpp 473.astar 483.xalancbmk Geomean

  • 50
  • 40
  • 30
  • 20
  • 10

10

Clang & ICC -Ofast relative to GCC 8

clang/flang 6 ICC 18

slide-21
SLIDE 21

SPECint 2006 performance (relative to GCC 6)

GCC 7 -O2 GCC 8 -O2 GCC 6 -Ofast GCC 7 -Ofast GCC 8 -Ofast clang/flang 6 -Ofast ICC 18 -Ofast 2 4 6 8 10 12 14 generic native

slide-22
SLIDE 22

SPECint 2006 performance (relative to GCC 6)

GCC 7 -O2 GCC 8 -O2 GCC 6 -Ofast GCC 7 -Ofast GCC 8 -Ofast clang/flang 6 -Ofast ICC 18 -Ofast GCC 8 -O2 -flto GCC 8 -Ofast lto 2 4 6 8 10 12 14 generic native

slide-23
SLIDE 23

SPECint 2006 performance (relative to GCC 6)

GCC 7 -O2 GCC 8 -O2 GCC 6 -Ofast GCC 7 -Ofast GCC 8 -Ofast clang/flang 6 -Ofast ICC 18 -Ofast GCC 8 -O2 -flto GCC 8 -Ofast lto 2 4 6 8 10 12 14 generic native 400.perlbench 401.bzip2 403.gcc 429.mcf 445.gobmk 456.hmmer 458.sjeng 462.libquantum 464.h264ref 471.omnetpp 473.astar 483.xalancbmk Geomean

  • 4
  • 2

2 4 6 8

GCC -Ofast -flto relative to -Ofast

slide-24
SLIDE 24

SPECint 2006 performance (relative to GCC 6)

GCC 7 -O2 GCC 8 -O2 GCC 6 -Ofast GCC 7 -Ofast GCC 8 -Ofast clang/flang 6 -Ofast ICC 18 -Ofast GCC 8 -O2 -flto GCC 8 -Ofast lto 2 4 6 8 10 12 14 generic native

slide-25
SLIDE 25

SPECint 2006 performance (relative to GCC 6)

GCC 7 -O2 GCC 8 -O2 GCC 6 -Ofast GCC 7 -Ofast GCC 8 -Ofast clang/flang 6 -Ofast ICC 18 -Ofast GCC 8 -O2 -flto GCC 8 -Ofast lto clang/fllang 6 -Ofast -flto ICC 18 -Ofast -flto 5 10 15 20 25 generic native

slide-26
SLIDE 26

SPECint 2006 performance (relative to GCC 6)

GCC 7 -O2 GCC 8 -O2 GCC 6 -Ofast GCC 7 -Ofast GCC 8 -Ofast clang/flang 6 -Ofast ICC 18 -Ofast GCC 8 -O2 -flto GCC 8 -Ofast lto clang/fllang 6 -Ofast -flto ICC 18 -Ofast -flto 5 10 15 20 25 generic native 400.perlbench 401.bzip2 403.gcc 429.mcf 445.gobmk 456.hmmer 458.sjeng 462.libquantum 464.h264ref 471.omnetpp 473.astar 483.xalancbmk Geomean

  • 40
  • 20

20 40 60 80 100 120

Clang/flang 6 and ICC 18 relative to GCC 8 (-O2 -flto)

clang/flang 6 ICC 18

slide-27
SLIDE 27

SPECint 2006 performance (relative to GCC 6)

GCC 7 -O2 GCC 8 -O2 GCC 6 -Ofast GCC 7 -Ofast GCC 8 -Ofast clang/flang 6 -Ofast ICC 18 -Ofast GCC 8 -O2 -flto GCC 8 -Ofast lto clang/fllang 6 -Ofast -flto ICC 18 -Ofast -flto 5 10 15 20 25 generic native

slide-28
SLIDE 28

Profile feedback & LTO works well together

Hmmer benchmark has problems; was excluded. To build with profile feedback often you can use: ./configure ; CFLAGS=”-O2 -fprofile-generate” make ; make check ; make clean ; CFLAGS=”-O2 -fprofile-use” make

400.perlbench 401.bzip2 403.gcc 429.mcf 445.gobmk 458.sjeng 462.libquantum 464.h264ref 471.omnetpp 473.astar 483.xalancbmk Geomean

  • 10
  • 5

5 10 15 20 25

Performance relative to GCC 8 -Ofast

LTO FDO FDO+LTO

slide-29
SLIDE 29

SPECint 2006 code size

GCC 7 -O2 GCC 8 -O2 GCC 7 -Ofast GCC 8 -OfastGCC 8 -Ofast + FDO Clang 6 -O2 Clang 6 -Ofast Icc 18 -Ofast 2000000 4000000 6000000 8000000 10000000 12000000 Non-LTO LTO

slide-30
SLIDE 30

SPECfp 2006 performance (relative to GCC 6)

GCC 7 -O2 GCC 8 -O2 GCC 6 -Ofast GCC 7 -Ofast GCC 8 -Ofast ICC 18 -Ofast GCC 8 -O2 -flto GCC 8 -Ofast lto ICC 18 -Ofast -flto

  • 10

10 20 30 40 50 60 generic native

slide-31
SLIDE 31

Firefox performance summary

responsiveness tp5o dromaeo dom Displaylist mutate tap paint speedometer a11yr svgr_opacity tp5o ARES6 stylebench tart dromaeo css tsvgx startup time

  • 5

5 10 15 20 25 30 35

Firefox performance relative to non-LTO build

static Profile feedback

slide-32
SLIDE 32

Firefox performance – dromaeo DOM

slide-33
SLIDE 33

Firefox performance – tp5o page responsiveness

slide-34
SLIDE 34

Firefox binary size

clang6 -Oz -flto clang6 -Oz -flto=thin clang6 -O2 -flto clang6 -O3 -flto=thin + FDO clang6 O3 -flto + FDO clang6 -O3 -flto clang6 -O3 -flto=thin clang6 -Oz clang6 -Os clang6 -O2 clang6 -O3 + FDO clang6 -O3 gcc8 -Os -flto gcc8 -O3 -flto + FDO gcc8 -O2 -flto gcc8 -O3 -flto gcc8 -Os gcc8 -O3 + FDO gcc8 -O2 gcc8 -O3 gcc8 -O3 -flto + FDO gcc7 -O3 -flto + FDO gcc6 -O3 -flto + FDO 20000000 40000000 60000000 80000000 100000000 120000000 140000000 EH data relocations text

slide-35
SLIDE 35

Firefox binary size

clang6 -Oz -flto clang6 -Oz -flto=thin clang6 -O2 -flto clang6 -O3 -flto=thin + FDO clang6 O3 -flto + FDO clang6 -O3 -flto clang6 -O3 -flto=thin clang6 -Oz clang6 -Os clang6 -O2 clang6 -O3 + FDO clang6 -O3 gcc8 -Os -flto gcc8 -O3 -flto + FDO gcc8 -O2 -flto gcc8 -O3 -flto gcc8 -Os gcc8 -O3 + FDO gcc8 -O2 gcc8 -O3 gcc8 -O3 -flto + FDO gcc7 -O3 -flto + FDO gcc6 -O3 -flto + FDO 20000000 40000000 60000000 80000000 100000000 120000000 140000000 EH data relocations text

slide-36
SLIDE 36

Firefox binary size

clang6 -Oz -flto clang6 -Oz -flto=thin clang6 -O2 -flto clang6 -O3 -flto=thin + FDO clang6 O3 -flto + FDO clang6 -O3 -flto clang6 -O3 -flto=thin clang6 -Oz clang6 -Os clang6 -O2 clang6 -O3 + FDO clang6 -O3 gcc8 -Os -flto gcc8 -O3 -flto + FDO gcc8 -O2 -flto gcc8 -O3 -flto gcc8 -Os gcc8 -O3 + FDO gcc8 -O2 gcc8 -O3 gcc8 -O3 -flto + FDO gcc7 -O3 -flto + FDO gcc6 -O3 -flto + FDO 20000000 40000000 60000000 80000000 100000000 120000000 140000000 EH data relocations text

slide-37
SLIDE 37

Firefox binary size

clang6 -Oz -flto clang6 -Oz -flto=thin clang6 -O2 -flto clang6 -O3 -flto=thin + FDO clang6 O3 -flto + FDO clang6 -O3 -flto clang6 -O3 -flto=thin clang6 -Oz clang6 -Os clang6 -O2 clang6 -O3 + FDO clang6 -O3 gcc8 -Os -flto gcc8 -O3 -flto + FDO gcc8 -O2 -flto gcc8 -O3 -flto gcc8 -Os gcc8 -O3 + FDO gcc8 -O2 gcc8 -O3 gcc8 -O3 -flto + FDO gcc7 -O3 -flto + FDO gcc6 -O3 -flto + FDO 20000000 40000000 60000000 80000000 100000000 120000000 140000000 EH data relocations text

slide-38
SLIDE 38

Firefox binary size

clang6 -Oz -flto clang6 -Oz -flto=thin clang6 -O2 -flto clang6 -O3 -flto=thin + FDO clang6 O3 -flto + FDO clang6 -O3 -flto clang6 -O3 -flto=thin clang6 -Oz clang6 -Os clang6 -O2 clang6 -O3 + FDO clang6 -O3 gcc8 -Os -flto gcc8 -O3 -flto + FDO gcc8 -O2 -flto gcc8 -O3 -flto gcc8 -Os gcc8 -O3 + FDO gcc8 -O2 gcc8 -O3 gcc8 -O3 -flto + FDO gcc7 -O3 -flto + FDO gcc6 -O3 -flto + FDO 20000000 40000000 60000000 80000000 100000000 120000000 140000000 EH data relocations text

slide-39
SLIDE 39

Summary

  • LTO now works and scale for large applications

(Firefox, Libreofgice,…)

  • Important size optimization especially for large C++ programs
  • Ofuen important performance optimization especially when

combined with profile feedback

  • Space for future improvements

(both on GCC side and re-optimization of programs to LTO model)

slide-40
SLIDE 40

LTO in openSUSE Factory

slide-41
SLIDE 41

LTO in openSUSE Factory

  • A new staging project (openSUSE:Factory:Staging:N)
  • Setting: Optflags: * -flto
  • ~80 packages failed (out of all ~2300)
  • Branch all failing packages and disable LTO
  • LTO Factory (w/ some disabled packages) can run openQA test-

suite (with a small fallout)

slide-42
SLIDE 42

Package statistics

  • Total ELF files in distribution: ~6700
  • Reduction: 1839 MB to 1736MB (-5.6%)
  • libmergedlo.so: reduction 10MB (-16%)
  • Biggest reduction:

mysql_upgrade (-92%)

Innochecksum (-94%)

mbstream (-94%)

slide-43
SLIDE 43

Issues

  • Argyllcms: lto1: fatal error: multiple prevailing defs for 'xcal_read_icc':

LD PR: https://sourceware.org/PR23079

  • GCC LTO miscompilation: http://gcc.gnu.org/PR85248
  • .symver in shared libraries: https://gcc.gnu.org/PR48200:

__asm__(".symver old_foo,foo@VERS_1.1")

– New symver function attribute must be added (GCC 9.1.0)

slide-44
SLIDE 44

Issues (cont.)

  • static libraries (*.la):

– e2fsprogs, btrfsprogs, … – Fat LTO objects must be used (-fgat-lto-objects) – LTO elf sections should be stripped – OBS sanitizer should be extended – LTO mode can combine both LTO objects and assembly objects

slide-45
SLIDE 45

Issues (cont. 2)

  • LTO warnings (-Wodr, ...):

– ltrace: error: type of 'filter_matches_symbol' does not match

  • riginal declaration [-Werror=lto-type-mismatch]

– gdb: error: type 'struct ipa_sym_addresses' violates the C++ One

Definition Rule [-Werror=odr]

  • mpir: weak configure script that scans *.o file as a blob
  • weak support for LTO debug info in dwz tool
  • Maybe higher memory constaints for selected packages
slide-46
SLIDE 46

Issues (cont. 3)

  • Usage of top-level assembler (https://gcc.gnu.org/PR57703):

– Example of syscall.cc in Chromium project:

asm volatile(".text\n" ".align 16, 0x90\n" ".type SyscallAsm, @function\n" "SyscallAsm:.cfi_startproc\n"

Error:nacl_helper.ltrans1.ltrans.o: In function `playground2::SandboxSyscall(int, long,

long, long, long, long, long)': nacl_helper.ltrans1.o:(.text+0x4503): undefined reference to `SyscallAsm'

slide-47
SLIDE 47

Histogram of text segment sizes

5,000,000 10,000,000 15,000,000 20,000,000 25,000,000 30,000,000 35,000,000 40,000,000 20 40 60 80 100 120 140

slide-48
SLIDE 48

Conclusion

  • LTO looks mature enough to be used by default
  • Do we want it for openSUSE Factory?
  • For the future, can we ofger it also for SLE?
slide-49
SLIDE 49

License

This slide deck is licensed under the Creative Commons Attribution-ShareAlike 4.0 International license. It can be shared and adapted for any purpose (even commercially) as long as Attribution is given and any derivative work is distributed under the same license. Details can be found at https://creativecommons.org/licenses/by-sa/4.0/

General Disclaimer

This document is not to be construed as a promise by any participating organisation to develop, deliver, or market a product. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. openSUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability or fitness for any particular purpose. The development, release, and timing of features or functionality described for openSUSE products remains at the sole discretion of openSUSE. Further,

  • penSUSE reserves the right to revise this document and to make changes to its content, at any time,

without obligation to notify any person or entity of such revisions or changes. All openSUSE marks referenced in this presentation are trademarks or registered trademarks of SUSE LLC, in the United States and other countries. All third-party trademarks are the property of their respective owners.

Credits

Template Richard Brown rbrown@opensuse.org Design & Inspiration

  • penSUSE Design Team

http://opensuse.github.io/branding- guidelines/