Optimizing real-world applications with GCC Link Time Optimization - - PowerPoint PPT Presentation

optimizing real world applications with gcc link time
SMART_READER_LITE
LIVE PREVIEW

Optimizing real-world applications with GCC Link Time Optimization - - PowerPoint PPT Presentation

Basic overview of LTO Compiling large applications Problems specific for large applications Optimizing real-world applications with GCC Link Time Optimization Taras Glek Mozilla Corporation Honza Hubi cka SuSE CR s.r.o GCC Summit,


slide-1
SLIDE 1

Basic overview of LTO Compiling large applications Problems specific for large applications

Optimizing real-world applications with GCC Link Time Optimization

Taras Glek Mozilla Corporation Honza Hubiˇ cka SuSE ˇ CR s.r.o GCC Summit, 2010

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-2
SLIDE 2

Basic overview of LTO Compiling large applications Problems specific for large applications

Outline

1

Basic overview of LTO

2

Compiling large applications

3

Problems specific for large applications

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-3
SLIDE 3

Basic overview of LTO Compiling large applications Problems specific for large applications

Link Time Optimization and Inter Procedural Analysis

Link time optimization (LTO) extends the scope of interprocedural analysis from single source file to whole program visible at the link time

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-4
SLIDE 4

Basic overview of LTO Compiling large applications Problems specific for large applications

Link Time Optimization and Inter Procedural Analysis

Link time optimization (LTO) extends the scope of interprocedural analysis from single source file to whole program visible at the link time

Implemented by calling back to the optimizer backend from the linker. Development started in 2005, merged to mainline in 2009. First released in GCC 4.5.

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-5
SLIDE 5

Basic overview of LTO Compiling large applications Problems specific for large applications

Link Time Optimization and Inter Procedural Analysis

Link time optimization (LTO) extends the scope of interprocedural analysis from single source file to whole program visible at the link time

Implemented by calling back to the optimizer backend from the linker. Development started in 2005, merged to mainline in 2009. First released in GCC 4.5.

Interprocedural analysis (IPA) and optimization is about

  • ptimizing across function boundaries.
  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-6
SLIDE 6

Basic overview of LTO Compiling large applications Problems specific for large applications

Link Time Optimization and Inter Procedural Analysis

Link time optimization (LTO) extends the scope of interprocedural analysis from single source file to whole program visible at the link time

Implemented by calling back to the optimizer backend from the linker. Development started in 2005, merged to mainline in 2009. First released in GCC 4.5.

Interprocedural analysis (IPA) and optimization is about

  • ptimizing across function boundaries.

GCC callgraph module, in GCC mainline since 2003

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-7
SLIDE 7

Basic overview of LTO Compiling large applications Problems specific for large applications

Basic components

1

Infrastructure for streaming an intermediate language to disk

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-8
SLIDE 8

Basic overview of LTO Compiling large applications Problems specific for large applications

Basic components

1

Infrastructure for streaming an intermediate language to disk

2

A new compiler front-end (lto1)

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-9
SLIDE 9

Basic overview of LTO Compiling large applications Problems specific for large applications

Basic components

1

Infrastructure for streaming an intermediate language to disk

2

A new compiler front-end (lto1)

3

A linker plugin integrated into the Gold linker

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-10
SLIDE 10

Basic overview of LTO Compiling large applications Problems specific for large applications

Basic components

1

Infrastructure for streaming an intermediate language to disk

2

A new compiler front-end (lto1)

3

A linker plugin integrated into the Gold linker

4

Modifications to the GCC driver (collect2) to support linking of LTO object files using either the linker plugin or direct invocation of the LTO front-end,

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-11
SLIDE 11

Basic overview of LTO Compiling large applications Problems specific for large applications

Basic components

1

Infrastructure for streaming an intermediate language to disk

2

A new compiler front-end (lto1)

3

A linker plugin integrated into the Gold linker

4

Modifications to the GCC driver (collect2) to support linking of LTO object files using either the linker plugin or direct invocation of the LTO front-end,

5

Various middle-end infrastructure updates

(Symbol table representation, support for merging of declarations and types etc. . . )

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-12
SLIDE 12

Basic overview of LTO Compiling large applications Problems specific for large applications

Basic components

1

Infrastructure for streaming an intermediate language to disk

2

A new compiler front-end (lto1)

3

A linker plugin integrated into the Gold linker

4

Modifications to the GCC driver (collect2) to support linking of LTO object files using either the linker plugin or direct invocation of the LTO front-end,

5

Various middle-end infrastructure updates

(Symbol table representation, support for merging of declarations and types etc. . . )

6

Support for using the linker plugin in the tool-chain

(ar and nm)

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-13
SLIDE 13

Basic overview of LTO Compiling large applications Problems specific for large applications

Basic components

1

Infrastructure for streaming an intermediate language to disk

2

A new compiler front-end (lto1)

3

A linker plugin integrated into the Gold linker

4

Modifications to the GCC driver (collect2) to support linking of LTO object files using either the linker plugin or direct invocation of the LTO front-end,

5

Various middle-end infrastructure updates

(Symbol table representation, support for merging of declarations and types etc. . . )

6

Support for using the linker plugin in the tool-chain

(ar and nm)

7

Libtool update to handle LTO

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-14
SLIDE 14

Basic overview of LTO Compiling large applications Problems specific for large applications

On disk representation

Program is represented in GIMPLE IL in the SSA form

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-15
SLIDE 15

Basic overview of LTO Compiling large applications Problems specific for large applications

On disk representation

Program is represented in GIMPLE IL in the SSA form Intermediate language is streamed into target object files

Allows integration with the rest of toolchain (producing archives etc.) Supports “fat” object files with both the IL and assembly

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-16
SLIDE 16

Basic overview of LTO Compiling large applications Problems specific for large applications

On disk representation

Program is represented in GIMPLE IL in the SSA form Intermediate language is streamed into target object files

Allows integration with the rest of toolchain (producing archives etc.) Supports “fat” object files with both the IL and assembly

LTO information is structured into several sections of the

  • bject file.

Command line options (.gnu.lto_.opts) The symbol table (.gnu.lto_.symtab) Global declarations and types (.gnu.lto_.decls). The callgraph (.gnu.lto_.cgraph). IPA references (.gnu.lto_.refs). Function bodies Static variable initializers(.gnu.lto_.vars). Summaries and optimization summaries.

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-17
SLIDE 17

Basic overview of LTO Compiling large applications Problems specific for large applications

LTO versus WHOPR

LTO reads whole program into memory at link time and

  • ptimizes it as single compilation unit

WHOPR mode allows parallelization of the local

  • ptimization stage.
  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-18
SLIDE 18

Basic overview of LTO Compiling large applications Problems specific for large applications

LTO versus WHOPR

LTO reads whole program into memory at link time and

  • ptimizes it as single compilation unit

WHOPR mode allows parallelization of the local

  • ptimization stage.

src1 compilation .o .o

  • ptimization

.o src2 compilation .o IPA opt .o

  • ptimization

.o ld src3 compilation .o .o

  • ptimization

.o

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-19
SLIDE 19

Basic overview of LTO Compiling large applications Problems specific for large applications

LTO versus WHOPR

LTO reads whole program into memory at link time and

  • ptimizes it as single compilation unit

WHOPR mode allows parallelization of the local

  • ptimization stage.

src1 compilation .o .o

  • ptimization

.o src2 compilation .o IPA opt .o

  • ptimization

.o ld src3 compilation .o .o

  • ptimization

.o 3 stage compilation process

  • nly IPA propgation stage sees whole program and is not

executed in parallel WHOPR does not work in GCC 4.5. In GCC 4.6 it will replace LTO by default

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-20
SLIDE 20

Basic overview of LTO Compiling large applications Problems specific for large applications

3 stages of WHOPR

LGEN (compile time — parallel via make)

Parsing early optimization function summaries production streaming late compilation for “fat objects”

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-21
SLIDE 21

Basic overview of LTO Compiling large applications Problems specific for large applications

3 stages of WHOPR

LGEN (compile time — parallel via make)

Parsing early optimization function summaries production streaming late compilation for “fat objects” WPA (link time — serial) Merge declarations and types produce combined callgraph interprocedural optimizations streaming

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-22
SLIDE 22

Basic overview of LTO Compiling large applications Problems specific for large applications

3 stages of WHOPR

LGEN (compile time — parallel via make)

Parsing early optimization function summaries production streaming late compilation for “fat objects” WPA (link time — serial) Merge declarations and types produce combined callgraph interprocedural optimizations streaming LTRANS (link time — parallel via temporary Makefile) Apply results of interprocedural optimizations late optimization; production of assembly

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-23
SLIDE 23

Basic overview of LTO Compiling large applications Problems specific for large applications

WHOPR Interprocedural optimization pass

To make WHOPR possible, inter-procedural optimization passes are split to the following stages: LGEN time:

  • 1. Generate summary
  • 2. Write summary
  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-24
SLIDE 24

Basic overview of LTO Compiling large applications Problems specific for large applications

WHOPR Interprocedural optimization pass

To make WHOPR possible, inter-procedural optimization passes are split to the following stages: LGEN time:

  • 1. Generate summary
  • 2. Write summary

WPA time:

  • 3. Read summary
  • 4. Execute
  • 5. Write optimization summary
  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-25
SLIDE 25

Basic overview of LTO Compiling large applications Problems specific for large applications

WHOPR Interprocedural optimization pass

To make WHOPR possible, inter-procedural optimization passes are split to the following stages: LGEN time:

  • 1. Generate summary
  • 2. Write summary

WPA time:

  • 3. Read summary
  • 4. Execute
  • 5. Write optimization summary

LTRANS time:

  • 6. Read optimization summary
  • 7. Transform
  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-26
SLIDE 26

Basic overview of LTO Compiling large applications Problems specific for large applications

Inter-procedural optimization infrastructure

Callgraph: multi-graph where functions are nodes and call sites edges Varpool: list of static variables and initializers IPA references: Multi-graph across function and variables representing references (read, writes and addresses taken) Jump functions summarizing inter-procedural dataflow Pass manager inter-procedural passes

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-27
SLIDE 27

Basic overview of LTO Compiling large applications Problems specific for large applications

Pass ordering issues

In classical LTO modes passes execute in sequence. In WHOPR passes execute “in parallel”

All passes perform analysis All passes perform IP propagation All passes apply changes

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-28
SLIDE 28

Basic overview of LTO Compiling large applications Problems specific for large applications

Pass ordering issues

In classical LTO modes passes execute in sequence. In WHOPR passes execute “in parallel”

All passes perform analysis All passes perform IP propagation All passes apply changes

“parallel” execution if IP passes leads to ordering issues

Virtual clones was introduced to avoid need for pass specific transform.

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-29
SLIDE 29

Basic overview of LTO Compiling large applications Problems specific for large applications

Pass ordering issues

In classical LTO modes passes execute in sequence. In WHOPR passes execute “in parallel”

All passes perform analysis All passes perform IP propagation All passes apply changes

“parallel” execution if IP passes leads to ordering issues

Virtual clones was introduced to avoid need for pass specific transform. Virtual clone is:

A node in callgraph like normal function Unlike normal function has no body Has pointer to its master and description how to create clone from it.

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-30
SLIDE 30

Basic overview of LTO Compiling large applications Problems specific for large applications

Pass ordering issues

In classical LTO modes passes execute in sequence. In WHOPR passes execute “in parallel”

All passes perform analysis All passes perform IP propagation All passes apply changes

“parallel” execution if IP passes leads to ordering issues

Virtual clones was introduced to avoid need for pass specific transform. Virtual clone is:

A node in callgraph like normal function Unlike normal function has no body Has pointer to its master and description how to create clone from it.

Callgraph hooks are available to maintain pass specific info consistent.

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-31
SLIDE 31

Basic overview of LTO Compiling large applications Problems specific for large applications

The Linker plugin

Linker plugin integrate into Gold linker and:

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-32
SLIDE 32

Basic overview of LTO Compiling large applications Problems specific for large applications

The Linker plugin

Linker plugin integrate into Gold linker and:

1

Recognize objects containing GCC LTO sections and claims them

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-33
SLIDE 33

Basic overview of LTO Compiling large applications Problems specific for large applications

The Linker plugin

Linker plugin integrate into Gold linker and:

1

Recognize objects containing GCC LTO sections and claims them

2

Reads LTO symbol table and pass it to linker for linking

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-34
SLIDE 34

Basic overview of LTO Compiling large applications Problems specific for large applications

The Linker plugin

Linker plugin integrate into Gold linker and:

1

Recognize objects containing GCC LTO sections and claims them

2

Reads LTO symbol table and pass it to linker for linking

3

After linking is decides obtain resolution info

Information about resolved, prevailed and prevailing symbols Information whether given symbol is used outside LTO code

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-35
SLIDE 35

Basic overview of LTO Compiling large applications Problems specific for large applications

The Linker plugin

Linker plugin integrate into Gold linker and:

1

Recognize objects containing GCC LTO sections and claims them

2

Reads LTO symbol table and pass it to linker for linking

3

After linking is decides obtain resolution info

Information about resolved, prevailed and prevailing symbols Information whether given symbol is used outside LTO code

4

Save resolution info into file and execute GCC

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-36
SLIDE 36

Basic overview of LTO Compiling large applications Problems specific for large applications

The Linker plugin

Linker plugin integrate into Gold linker and:

1

Recognize objects containing GCC LTO sections and claims them

2

Reads LTO symbol table and pass it to linker for linking

3

After linking is decides obtain resolution info

Information about resolved, prevailed and prevailing symbols Information whether given symbol is used outside LTO code

4

Save resolution info into file and execute GCC

5

Adds GCC produced object files into the linker.

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-37
SLIDE 37

Basic overview of LTO Compiling large applications Problems specific for large applications

The Linker plugin

Linker plugin integrate into Gold linker and:

1

Recognize objects containing GCC LTO sections and claims them

2

Reads LTO symbol table and pass it to linker for linking

3

After linking is decides obtain resolution info

Information about resolved, prevailed and prevailing symbols Information whether given symbol is used outside LTO code

4

Save resolution info into file and execute GCC

5

Adds GCC produced object files into the linker. Linker is independent of GCC LTO infrastructure and Gold. LLVM use the plugin, GNU LD is being updated, too.

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-38
SLIDE 38

Basic overview of LTO Compiling large applications Problems specific for large applications

Linker plugin and whole program assumptions

Functions & vars not declared static prevent inter-procedural optimization.

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-39
SLIDE 39

Basic overview of LTO Compiling large applications Problems specific for large applications

Linker plugin and whole program assumptions

Functions & vars not declared static prevent inter-procedural optimization. Optimization works a lot better when compiler assume they are now whole program assumptions.

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-40
SLIDE 40

Basic overview of LTO Compiling large applications Problems specific for large applications

Linker plugin and whole program assumptions

Functions & vars not declared static prevent inter-procedural optimization. Optimization works a lot better when compiler assume they are now whole program assumptions.

  • fwhole-program makes GCC to declare every function

and var static to the linktime unit. externally_visible attribute overwrite the effect (main() is implicitly externally visible).

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-41
SLIDE 41

Basic overview of LTO Compiling large applications Problems specific for large applications

Linker plugin and whole program assumptions

Functions & vars not declared static prevent inter-procedural optimization. Optimization works a lot better when compiler assume they are now whole program assumptions.

  • fwhole-program makes GCC to declare every function

and var static to the linktime unit. externally_visible attribute overwrite the effect (main() is implicitly externally visible).

  • fwhole-program does not fit shared libraries (users

would need to annotate all of the interface).

To speedup dynamic linking a lot of libraries use visibility ("hidden") or

  • fdefault-visibility=hidden.

GCC use linker plugin to see what non-LTO objects use. Without linker plugin hidden symbols are implicitly brought local.

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-42
SLIDE 42

Basic overview of LTO Compiling large applications Problems specific for large applications

Optimizations performed

Early optimization (at compile time)

Into-SSA conversion

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-43
SLIDE 43

Basic overview of LTO Compiling large applications Problems specific for large applications

Optimizations performed

Early optimization (at compile time)

Into-SSA conversion Early inlining

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-44
SLIDE 44

Basic overview of LTO Compiling large applications Problems specific for large applications

Optimizations performed

Early optimization (at compile time)

Into-SSA conversion Early inlining constant propagation, copy propagation, dead code elimination, and scalar replacement.

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-45
SLIDE 45

Basic overview of LTO Compiling large applications Problems specific for large applications

Optimizations performed

Early optimization (at compile time)

Into-SSA conversion Early inlining constant propagation, copy propagation, dead code elimination, and scalar replacement. Inter-procedural scalar replacement

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-46
SLIDE 46

Basic overview of LTO Compiling large applications Problems specific for large applications

Optimizations performed

Early optimization (at compile time)

Into-SSA conversion Early inlining constant propagation, copy propagation, dead code elimination, and scalar replacement. Inter-procedural scalar replacement Tail recursion elimination

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-47
SLIDE 47

Basic overview of LTO Compiling large applications Problems specific for large applications

Optimizations performed

Early optimization (at compile time)

Into-SSA conversion Early inlining constant propagation, copy propagation, dead code elimination, and scalar replacement. Inter-procedural scalar replacement Tail recursion elimination Exception handling optimizations

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-48
SLIDE 48

Basic overview of LTO Compiling large applications Problems specific for large applications

Optimizations performed

Early optimization (at compile time)

Into-SSA conversion Early inlining constant propagation, copy propagation, dead code elimination, and scalar replacement. Inter-procedural scalar replacement Tail recursion elimination Exception handling optimizations Static profile estimation

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-49
SLIDE 49

Basic overview of LTO Compiling large applications Problems specific for large applications

Optimizations performed

Early optimization (at compile time)

Into-SSA conversion Early inlining constant propagation, copy propagation, dead code elimination, and scalar replacement. Inter-procedural scalar replacement Tail recursion elimination Exception handling optimizations Static profile estimation Attributes discovery

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-50
SLIDE 50

Basic overview of LTO Compiling large applications Problems specific for large applications

Optimizations performed

Early optimization (at compile time)

Into-SSA conversion Early inlining constant propagation, copy propagation, dead code elimination, and scalar replacement. Inter-procedural scalar replacement Tail recursion elimination Exception handling optimizations Static profile estimation Attributes discovery Function splitting

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-51
SLIDE 51

Basic overview of LTO Compiling large applications Problems specific for large applications

Optimizations performed

Early optimization (at compile time)

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-52
SLIDE 52

Basic overview of LTO Compiling large applications Problems specific for large applications

Optimizations performed

Early optimization (at compile time) Whole program visibility

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-53
SLIDE 53

Basic overview of LTO Compiling large applications Problems specific for large applications

Optimizations performed

Early optimization (at compile time) Whole program visibility IPA profile propagation

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-54
SLIDE 54

Basic overview of LTO Compiling large applications Problems specific for large applications

Optimizations performed

Early optimization (at compile time) Whole program visibility IPA profile propagation Constant propagation

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-55
SLIDE 55

Basic overview of LTO Compiling large applications Problems specific for large applications

Optimizations performed

Early optimization (at compile time) Whole program visibility IPA profile propagation Constant propagation Constructor and destructor merging

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-56
SLIDE 56

Basic overview of LTO Compiling large applications Problems specific for large applications

Optimizations performed

Early optimization (at compile time) Whole program visibility IPA profile propagation Constant propagation Constructor and destructor merging Inlining

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-57
SLIDE 57

Basic overview of LTO Compiling large applications Problems specific for large applications

Optimizations performed

Early optimization (at compile time) Whole program visibility IPA profile propagation Constant propagation Constructor and destructor merging Inlining Function attributes

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-58
SLIDE 58

Basic overview of LTO Compiling large applications Problems specific for large applications

Optimizations performed

Early optimization (at compile time) Whole program visibility IPA profile propagation Constant propagation Constructor and destructor merging Inlining Function attributes MOD/REF analysis (ipa-reference) New leaf function attribute.

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-59
SLIDE 59

Basic overview of LTO Compiling large applications Problems specific for large applications

Optimizations performed

Early optimization (at compile time) Whole program visibility IPA profile propagation Constant propagation Constructor and destructor merging Inlining Function attributes MOD/REF analysis (ipa-reference) New leaf function attribute. Experimental passes

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-60
SLIDE 60

Basic overview of LTO Compiling large applications Problems specific for large applications

Outline

1

Basic overview of LTO

2

Compiling large applications

3

Problems specific for large applications

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-61
SLIDE 61

Basic overview of LTO Compiling large applications Problems specific for large applications

Firefox and GCC

Firefox

Main shared lib libxul.so is about 6 000 000 lines of code. Links statically many libraries built in tree (libffi, cairo, gtk

  • etc. . . ).

Developed and tested with LTO on other compilers (i.e. MSVC) Mostly portable C++ code but use many performance related features (visibility aliases, asm hunks etc.)

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-62
SLIDE 62

Basic overview of LTO Compiling large applications Problems specific for large applications

Firefox and GCC

Firefox

Main shared lib libxul.so is about 6 000 000 lines of code. Links statically many libraries built in tree (libffi, cairo, gtk

  • etc. . . ).

Developed and tested with LTO on other compilers (i.e. MSVC) Mostly portable C++ code but use many performance related features (visibility aliases, asm hunks etc.)

GCC

800 000 lines of hand written C code, 500 000 lines of auto generated.

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-63
SLIDE 63

Basic overview of LTO Compiling large applications Problems specific for large applications

Serial GCC Build times

GCC non-LTO build time: 8m12s GCC LTO link time: 6m31s

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-64
SLIDE 64

Basic overview of LTO Compiling large applications Problems specific for large applications

Serial GCC Build times

GCC non-LTO build time: 8m12s GCC LTO link time: 6m31s

Reading the IL: 3% Merging of declarations: 1%. Outputting of the assembly file: 2%. Debug information generation (var-tracking and symout): 8%. Garbage collection: 2%. Local optimizations: rest partial redundancy elimination (5%), GIMPLE to RTL expansion (8%), RTL level dataflow analysis (11%), instruction combining (3%), register allocation (6%), scheduling (5%).

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-65
SLIDE 65

Basic overview of LTO Compiling large applications Problems specific for large applications

Parallel GCC Build times (24 cores)

GCC non-LTO build time: 56s GCC LTO link time: 48s

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-66
SLIDE 66

Basic overview of LTO Compiling large applications Problems specific for large applications

Parallel GCC Build times (24 cores)

GCC non-LTO build time: 56s GCC LTO link time: 48s Serial WPA stage: 19s

Reading global declarations and types: 28% of the overall time taken by the WPA stage. Merging declarations: 6%. Inter-procedural optimization: 9%. Streaming of object files to be passed to LTRANS: 42%.

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-67
SLIDE 67

Basic overview of LTO Compiling large applications Problems specific for large applications

Serial Firefox Build times

Firefox non-LTO build time: 39m Forefox LTO link time: 19m29s

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-68
SLIDE 68

Basic overview of LTO Compiling large applications Problems specific for large applications

Serial Firefox Build times

Firefox non-LTO build time: 39m Forefox LTO link time: 19m29s

Reading of the IL: 7%. Merging of declarations: 4%. Output of the assembly file: 3%. Debug information generation is disabled in our builds. Garbage collection: 2%. Local optimizations: the rest

  • perand scan (5%), partial redundancy elimination (5%),

GIMPLE to RTL expansion (13%), RTL level dataflow analysis (5%), instruction combining (3%), register allocation (9%), scheduling (3%).

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-69
SLIDE 69

Basic overview of LTO Compiling large applications Problems specific for large applications

Parallel Firefox Build times (24 cores)

Firefox non-LTO build time: 9m38s Firefox LTO link time: 5m30s

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-70
SLIDE 70

Basic overview of LTO Compiling large applications Problems specific for large applications

Parallel Firefox Build times (24 cores)

Firefox non-LTO build time: 9m38s Firefox LTO link time: 5m30s Serial WPA stage: 4m24s

Reading global declarations and types: 24%. Merging declarations: 20%. Inter-procedural optimization: 8%. Streaming of object files to be passed to LTRANS: 28%. Callgraph and WPA overhead (callgraph merging and partitioning): 12%.

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-71
SLIDE 71

Basic overview of LTO Compiling large applications Problems specific for large applications

Memory usage

LTO GCC: 2GB LTO Firefox: 8.5GB

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-72
SLIDE 72

Basic overview of LTO Compiling large applications Problems specific for large applications

Memory usage

LTO GCC: 2GB LTO Firefox: 8.5GB WHOPR GCC: 415MB for WPA, LTRANS compilations all less than 400MB

Memory mapped object files: 170MB (not all of it is paged in) Types and declarations: 260MB Callgraph, varpool and other IPA stuff: 52MB

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-73
SLIDE 73

Basic overview of LTO Compiling large applications Problems specific for large applications

Memory usage

LTO GCC: 2GB LTO Firefox: 8.5GB WHOPR GCC: 415MB for WPA, LTRANS compilations all less than 400MB

Memory mapped object files: 170MB (not all of it is paged in) Types and declarations: 260MB Callgraph, varpool and other IPA stuff: 52MB

WHOPR Firefox: 4GB for WPA 3.7GB declarations and types

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-74
SLIDE 74

Basic overview of LTO Compiling large applications Problems specific for large applications

Code quality

GCC gets faster only with -O3 (by about 4% on non-optimizing compilation of combine.c) Firefox gets a bit faster overall benchmark name speedup dromeao css 1.83% tdhtml

  • 0.54%

tp_dist 0.50% tsvg 0.07% Both projects was tuned for file-by-file compilation. Cross module optimizations are limited then.

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-75
SLIDE 75

Basic overview of LTO Compiling large applications Problems specific for large applications

Code quality II

GCC gets 6% smaller (-O2) Firefox gets 6% smaller (-O3)

Reducing inlining unit growth save additional 12%. Resulting -O3 -param inline-unit-growth=5

  • flto binary is of same size as -Os binary!

At -Os GCC produce 11% smaller Firefox, LLVM is reported to save 13%.

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-76
SLIDE 76

Basic overview of LTO Compiling large applications Problems specific for large applications

Code quality III

speedup size perlbench +1.4% +4% bzip2 +2.6%

  • 45%

gcc

  • 0.3%

+1.2% mcf +1.9%

  • 33%

gobmk +3.4% +1.8% hmmer +0.8%

  • 55%

sjeng +1.2%

  • 11%

libquantum

  • 0.5%
  • 61%

h264ref +7.0%

  • 9%
  • mnetpp
  • 0.8%
  • 11%

astar

  • 1.3%
  • 20%
  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-77
SLIDE 77

Basic overview of LTO Compiling large applications Problems specific for large applications

Code quality IV

speedup size bwaves 0% (+15%)

  • 27%

gamess

  • 0.7%
  • 50%

milc +2.2%

  • 26%

zeusmp +0.4%

  • 27%

gromacs 0%

  • 18%

cactusADM

  • 0.8%
  • 42%

leslie3d

  • 2.1% (0%)

+0.6% namd 0%

  • 40%

soplex +1.5%

  • 50%

povray +5%

  • 2.3%

calculix 1.1%

  • 38%

GemsFDTD 0%

  • 70%

tonto

  • 0.2%
  • 25%

lbm +3.2% 0% wrf 0%

  • 36%

sphinx3 +2.9%

  • 32%
  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-78
SLIDE 78

Basic overview of LTO Compiling large applications Problems specific for large applications

SPEC2000 are easier

11% on EON 5% on Perl. 2.5% on GCC. 17% on Vortex. 7% on bzip. 33% on wupwise, but it is gone probably because of profile issues 4% on Applu 2% on ART.

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-79
SLIDE 79

Basic overview of LTO Compiling large applications Problems specific for large applications

Outline

1

Basic overview of LTO

2

Compiling large applications

3

Problems specific for large applications

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-80
SLIDE 80

Basic overview of LTO Compiling large applications Problems specific for large applications

Code size and speed tradeoffs

For common benchmarks code size is rarely issue Tuning GCC on benchmarks leads to code size growth (-O2 and -O3) Firefox builds by default with -Os, they will switch to -O3 because it is faster Linux kernel is often -Os too

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-81
SLIDE 81

Basic overview of LTO Compiling large applications Problems specific for large applications

Code size and speed tradeoffs

For common benchmarks code size is rarely issue Tuning GCC on benchmarks leads to code size growth (-O2 and -O3) Firefox builds by default with -Os, they will switch to -O3 because it is faster Linux kernel is often -Os too We need to take more care to -O2 file size tradeoffs.

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-82
SLIDE 82

Basic overview of LTO Compiling large applications Problems specific for large applications

Code size and speed tradeoffs

For common benchmarks code size is rarely issue Tuning GCC on benchmarks leads to code size growth (-O2 and -O3) Firefox builds by default with -Os, they will switch to -O3 because it is faster Linux kernel is often -Os too We need to take more care to -O2 file size tradeoffs. LTO can help here: ipa-profile pass (not terribly effective) Global inliner and cloning decisions Whole program assumptions leads to large code size improvements

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-83
SLIDE 83

Basic overview of LTO Compiling large applications Problems specific for large applications

Code size and speed tradeoffs

For common benchmarks code size is rarely issue Tuning GCC on benchmarks leads to code size growth (-O2 and -O3) Firefox builds by default with -Os, they will switch to -O3 because it is faster Linux kernel is often -Os too We need to take more care to -O2 file size tradeoffs. LTO can help here: ipa-profile pass (not terribly effective) Global inliner and cloning decisions Whole program assumptions leads to large code size improvements Profile feedback helps even more. Firefox is going to use it.

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-84
SLIDE 84

Basic overview of LTO Compiling large applications Problems specific for large applications

Startup times

C++ applications tends to load slowly.

Time 15% relocations 4 6 % s t a t i c i n i t i a l i z e r s 6% misc runtime linker

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-85
SLIDE 85

Basic overview of LTO Compiling large applications Problems specific for large applications

Code locality improvements

Constructor merging pass (LTO only) Placing static ctors/dtors in special subsections Function reordering

Graph clustering techniques seems to work just slightly better than siple DFS order Callgraph lacks virtual function calls so the pass is confused at Firefox

Profile feedback driven function reordering

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-86
SLIDE 86

Basic overview of LTO Compiling large applications Problems specific for large applications

Other improvements

Reducing amount of relocations

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-87
SLIDE 87

Basic overview of LTO Compiling large applications Problems specific for large applications

Other improvements

Reducing amount of relocations

Privatizing comdat functions Optimizing out some of vtables (30% reduction of .data.rel.ro.local)

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-88
SLIDE 88

Basic overview of LTO Compiling large applications Problems specific for large applications

Other improvements

Reducing amount of relocations

Privatizing comdat functions Optimizing out some of vtables (30% reduction of .data.rel.ro.local)

About 5% smaller dynamic linker table, 0.6% fewer relocations.

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-89
SLIDE 89

Basic overview of LTO Compiling large applications Problems specific for large applications

Other improvements

Reducing amount of relocations

Privatizing comdat functions Optimizing out some of vtables (30% reduction of .data.rel.ro.local)

About 5% smaller dynamic linker table, 0.6% fewer relocations.

C++ API is not too PIC friendly, perhaps we can do local conventions

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-90
SLIDE 90

Basic overview of LTO Compiling large applications Problems specific for large applications

Other improvements

Reducing amount of relocations

Privatizing comdat functions Optimizing out some of vtables (30% reduction of .data.rel.ro.local)

About 5% smaller dynamic linker table, 0.6% fewer relocations.

C++ API is not too PIC friendly, perhaps we can do local conventions

Data section locality improvements

Order data sections to match references from code Group structures used by static constructions

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO

slide-91
SLIDE 91

Basic overview of LTO Compiling large applications Problems specific for large applications

Thank you!

Questions?

  • T. Glek—J. Hubiˇ

cka Optimizing real-world applications with GCC LTO