PGO and LLVM Status and Current Work Bob Wilson Diego Novillo - - PowerPoint PPT Presentation

pgo and llvm
SMART_READER_LITE
LIVE PREVIEW

PGO and LLVM Status and Current Work Bob Wilson Diego Novillo - - PowerPoint PPT Presentation

PGO and LLVM Status and Current Work Bob Wilson Diego Novillo Chandler Carruth PGO: What Is It? PGO: What Is It? PGO = Profile Guided Optimization PGO: What Is It? PGO = Profile Guided Optimization More information -> better


slide-1
SLIDE 1

PGO and LLVM

Status and Current Work

Bob Wilson Diego Novillo Chandler Carruth

slide-2
SLIDE 2

PGO: What Is It?

slide-3
SLIDE 3

PGO: What Is It?

  • PGO = Profile Guided Optimization
slide-4
SLIDE 4

PGO: What Is It?

  • PGO = Profile Guided Optimization
  • More information -> better optimization
slide-5
SLIDE 5

PGO: What Is It?

  • PGO = Profile Guided Optimization
  • More information -> better optimization
  • Profile data
  • Control flow: e.g., execution counts
  • Future extensions: object types, etc.
slide-6
SLIDE 6

What Is It Good For?

  • Some examples:
slide-7
SLIDE 7

What Is It Good For?

  • Some examples:
  • Block layout
slide-8
SLIDE 8

What Is It Good For?

  • Some examples:
  • Block layout
  • Spill placement
slide-9
SLIDE 9

What Is It Good For?

  • Some examples:
  • Block layout
  • Spill placement
  • Inlining heuristics
slide-10
SLIDE 10

What Is It Good For?

  • Some examples:
  • Block layout
  • Spill placement
  • Inlining heuristics
  • Hot/cold partitioning
slide-11
SLIDE 11

What Is It Good For?

  • Some examples:
  • Block layout
  • Spill placement
  • Inlining heuristics
  • Hot/cold partitioning
  • Can significantly improve performance
slide-12
SLIDE 12

What’s the Catch?

  • Assumes program behavior is always the same
  • PGO may hurt performance if behavior changes
  • May require some extra build steps
slide-13
SLIDE 13

History of PGO in LLVM

slide-14
SLIDE 14

History of PGO in LLVM

  • Instrumentation, profile info and block placement

(2004, Chris Lattner)

slide-15
SLIDE 15

History of PGO in LLVM

  • Instrumentation, profile info and block placement

(2004, Chris Lattner)

  • Branch weights and block frequencies

(2011, Jakub Staszak)

slide-16
SLIDE 16

History of PGO in LLVM

  • Instrumentation, profile info and block placement

(2004, Chris Lattner)

  • Branch weights and block frequencies

(2011, Jakub Staszak)

  • Setting branch weights from execution counts

(2012, Alastair Murray)

slide-17
SLIDE 17

Outline

  • Front-end instrumentation
  • Profiles from sampling
  • Using profile info in the optimizer and back-end
slide-18
SLIDE 18

Profiling with Instrumentation

slide-19
SLIDE 19

Profiling with Instrumentation

  • Pros:
  • Detailed information
  • Predictability
  • Resilient against changes
slide-20
SLIDE 20

Profiling with Instrumentation

  • Pros:
  • Detailed information
  • Predictability
  • Resilient against changes
  • Cons:
  • Need to build instrumented version
  • Running with instrumentation is slower
slide-21
SLIDE 21

Design Goals

slide-22
SLIDE 22

Design Goals

  • Degrade gracefully when code changes
slide-23
SLIDE 23

Design Goals

  • Degrade gracefully when code changes
  • Profile data not tied to specific compiler version
slide-24
SLIDE 24

Design Goals

  • Degrade gracefully when code changes
  • Profile data not tied to specific compiler version
  • Minimize instrumentation overhead
slide-25
SLIDE 25

Design Goals

  • Degrade gracefully when code changes
  • Profile data not tied to specific compiler version
  • Minimize instrumentation overhead
  • Execution counts accurately mapped to source
slide-26
SLIDE 26

Dealing with Change

slide-27
SLIDE 27

Dealing with Change

  • Project source code changes
  • Detect functions that have changed
  • Ignore profile data for those functions only
slide-28
SLIDE 28

Dealing with Change

  • Project source code changes
  • Detect functions that have changed
  • Ignore profile data for those functions only
  • Some changes are OK
  • Minimum requirement: same control-flow structure
slide-29
SLIDE 29

Compiler Changes

  • Compiler updates should not invalidate profiles
  • LLVM IR generated by front-end often changes
  • Associating profiles with IR can be a problem
slide-30
SLIDE 30

Source-level Accuracy

  • PGO vs. code coverage testing
  • Should only have one profile format for both
  • Profile data for PGO should be viewable
  • Requires profiles to map accurately to source
slide-31
SLIDE 31

Use the Source

  • Solution: associate profile data with clang ASTs
  • Compiler changes are (almost) irrelevant
  • Provides info to detect source changes
  • Independent of optimization and debug info
slide-32
SLIDE 32

Counters on ASTs

  • Walk through ASTs in program order
  • Assign counters to control-flow constructs
  • Compare number of counters to detect changes
  • Can add a hash of ASTs to be more sensitive
slide-33
SLIDE 33

Example

CompoundStmt WhileStmt Expr IfStmt Stmt Body Cond Then

slide-34
SLIDE 34

Example

CompoundStmt WhileStmt Expr IfStmt Stmt Body Cond Then

C0

slide-35
SLIDE 35

Example

CompoundStmt WhileStmt Expr IfStmt Stmt Body Cond Then

C0 C1 C2 C3

slide-36
SLIDE 36

Example

CompoundStmt WhileStmt Expr IfStmt Stmt Body Cond Then

C0 C1 C2 C3 C4

slide-37
SLIDE 37

Minimizing Overhead

  • Not every block needs a counter
  • CFG-based approach: compute a spanning tree
  • Can often do as well by following AST structure
slide-38
SLIDE 38

Example

CompoundStmt Stmt IfStmt Stmt Then Stmt Else

slide-39
SLIDE 39

Example

C0

CompoundStmt Stmt IfStmt Stmt Then Stmt Else

slide-40
SLIDE 40

Example

C0 C1

CompoundStmt Stmt IfStmt Stmt Then Stmt Else

slide-41
SLIDE 41

No-Return Calls

  • Important for code coverage
  • Not an issue for PGO

(we don’t have a “likely no-return” attribute)

  • A counter after every call would be expensive
  • Can we get away with ignoring this?
slide-42
SLIDE 42

Instrumentation Overhead: Compile Time

15 30 45 60 400.perlbench 401.bzip2 429.mcf 445.gobmk 456.hmmer 458.sjeng 462.libquantum 471.omnetpp 473.astar 483.xalancbmk

Percent Slowdown

PGO GCOV

68%

slide-43
SLIDE 43

Instrumentation Overhead: Execution Time

30 60 90 120 150 400.perlbench 401.bzip2 429.mcf 445.gobmk 456.hmmer 458.sjeng 462.libquantum 471.omnetpp 473.astar 483.xalancbmk

Percent Slowdown

PGO GCOV

239%

slide-44
SLIDE 44

PGO with External Profiling

Diego Novillo

slide-45
SLIDE 45

External Profilers

  • No changes needed to user application
  • Binary runs under control of profiler
  • binary instrumentation (valgrind,

cachegrind)

  • hardware counters (perf, oprofile)
  • Profilers using HW counters → low
  • verhead
  • Profiler saves profile results in a file
  • Used as input to analysis tools
  • Why not use it as input to the

compiler?

slide-46
SLIDE 46

$ perf annotate -l [ … ] : for (int i = 0; i < N; i++) { : A *= i / 32; /home/dnovillo/prog.cc:5 9.18% : 400520: mov %eax,%ecx 0.00% : 400522: sar $0x1f,%ecx 0.00% : 400525: shr $0x1b,%ecx 0.00% : 400528: add %eax,%ecx 7.89% : 40052a: sar $0x5,%ecx 0.00% : 40052d: xorps %xmm0,%xmm0 0.00% : 400530: cvtsi2sd %ecx,%xmm0 8.23% : 400534: mulsd 0x200aec(%rip),%xmm0 # 601028 <A> 66.10% : 40053c: movsd %xmm0,0x200ae4(%rip) # 601028 <A> [ … ]

GOAL: Use all the collected runtime knowledge as input to the optimizers

slide-47
SLIDE 47

Why External Profiler?

  • No need for instrumented builds
  • Simplifies build rules for user application
  • No build time overhead
slide-48
SLIDE 48

Why External Profiler?

  • Very low runtime overhead (< 1%)
  • Profiles can be collected in production environments
  • Profile data is more representative
  • Training is done on actual production loads
slide-49
SLIDE 49

Why External Profiler?

  • Allows application-specific profilers
  • e.g., game engines
  • Anything that can be converted into hints to the compiler
slide-50
SLIDE 50

User Model

Source Code Profile Peak Optimized Binary

  • O2 -gline-tables-only
  • O2 -fprofile-sample-use -gline-tables-only

Execute under profiler (low overhead)

Base Optimized Binary

slide-51
SLIDE 51

Design

  • Profile data often needs conversion
  • Samples are associated with

processor instructions

  • External tool converts into mapping

to source LOCs

  • Bad/stale/missing profiles
  • Never affect correctness
  • Only affect performance
  • Scalar pass incorporates profile into IR
  • Source locations mapped to IR

instructions

  • Profile kind dictates representation
  • Optimizers query via standard

analysis pass API

  • Analysis routines fallback on static

heuristics

slide-52
SLIDE 52

Current Implementation

  • 1. Conversion tool for Linux Perf

(Sample-based profiles)

  • 2. Samples converted to branch weights
  • 3. Profile pass simply annotates the IR
  • 4. Analysis uses IR metadata for estimates
  • 5. Optimizers automatically adjust cost models

(Provided they use the Analysis API properly) (Work is needed in this area)

slide-53
SLIDE 53

Limitations & Restrictions

  • Program behaviour must coincide

with profile

  • Stale profiles degrade

performance (significantly)

  • Non-representative runs mislead
  • ptimizers
  • Who do we listen to?
  • Warn the user?
  • Silently override?
  • Is the profile representative?

foo(int x) { if (__builtin_expect(x > 100, 1)) hot(); else cold(); } main() { while (true) foo(rand() % 100); }

Profile says “LIAR!”

slide-54
SLIDE 54

Limitations & Restrictions

  • HW counters → IR mapping is

lossy

  • Requires good line table

information

  • Many instructions on the same

line of code

1 foo(int x) { 2 if (x < 100) hot(); else cold(); 3 } 4 5 main() { 6 while (true) foo(rand() % 100); 7 }

Line 2 is HOT according to profile Need to know where in the line

  • Column numbers
  • DWARF discriminators
slide-55
SLIDE 55

Limitations & Restrictions

  • The optimizer must use profiles!
  • Notably, the inliner
slide-56
SLIDE 56

Early Results

NOT 0-BASED!

slide-57
SLIDE 57

Early Results

NOT 0-BASED!

slide-58
SLIDE 58

Status

  • Profile conversion tool for Linux

Perf Events

  • Writes flat profiles to text file
  • Working on release
  • Scalar pass works with

SPEC2006

  • Produces branch weights
  • Trunk patches under review
  • In the works
  • Other function attributes (e.g.

cold)

  • More efficient profile encoding

(bitcode)

  • Context aware profiles
  • Other profile types
  • value profiles to disambiguate

indirect calls

slide-59
SLIDE 59

So, we have some profile data... Now what?

slide-60
SLIDE 60
slide-61
SLIDE 61
slide-62
SLIDE 62

All profile info ends up in a common IR annotation

Code Layout

Spill Placement

Inliner?

Analysis

Source Code

Instrumentation

Sample Profile

IR

slide-63
SLIDE 63

Code Layout

Spill Placement

Inliner?

Analysis

Source Code

Instrumentation

Sample Profile

IR

Passes access it through a common analysis API

slide-64
SLIDE 64

BranchProbabilityInfo

succ1: ... succ1: ... succ1: ...

...

pred: ...

slide-65
SLIDE 65

define void @f(i1 %a) { entry: ... br i1 %a, label %t, label %f, !prof !0 t: ... br label %exit f: ... br label %exit exit: ret void } !0 = metadata !{metadata !"branch_weights", i32 64, i32 4}

slide-66
SLIDE 66

BranchProbabilityInfo

succ1: ... succ2: ... succ3: ...

...

entry: ... latch: br ...

slide-67
SLIDE 67

define void @f(i1 %a) { entry: ... br i1 %a, label %t, label %f, !prof !0 t: ... unreachable f: ... br label %exit exit: ret void } !0 = metadata !{metadata !"branch_weights", i32 64, i32 4}

slide-68
SLIDE 68

define void @f(i1 %a) { entry: ... br i1 %a, label %t, label %f t: ... call coldcc void @g() ... br label %exit f: ... br label %exit exit: ret void } declare coldcc void @g()

slide-69
SLIDE 69

define void @f(i32 %i) { entry: %a = icmp eq i32 %i, 0 br i1 %a, label %t, label %f t: ... br label %exit f: ... br label %exit exit: ret void }

slide-70
SLIDE 70

define void @f(i32 %i) { entry: %a = icmp ne i32 %i, 0 br i1 %a, label %t, label %f t: ... br label %exit f: ... br label %exit exit: ret void }

slide-71
SLIDE 71

define void @f(i32 %i) { entry: %a = icmp slt i32 %i, 0 br i1 %a, label %t, label %f t: ... br label %exit f: ... br label %exit exit: ret void }

slide-72
SLIDE 72

define void @f(i8* %p) { entry: %a = icmp eq i8* %p, null br i1 %a, label %t, label %f t: ... br label %exit f: ... br label %exit exit: ret void }

slide-73
SLIDE 73

BranchProbabilityInfo

succ1: ... succ2: ... succ3: ...

...

switch latch: br ... entry:

slide-74
SLIDE 74

BlockFrequencyInfo

succ1: ... succ2: ... succ3: ...

...

switch latch: br ... entry:

slide-75
SLIDE 75

What about MI? Everything is there too.

slide-76
SLIDE 76

Resolving Conflicts

  • Some times the profile will directly conflict with other

information:

  • Static heuristics may be contradicted
  • Other profiles may be incompatible
  • Need to be extremely cautious when disregarding profile

information, but may be necessary

  • When we have bad profiles, bounding the bad impact is both hard

and important

slide-77
SLIDE 77

The hard part: cache invalidation!

  • What happens when an optimization pass transforms the

CFG in a way that invalidates annotations on the IR?

  • The analyses are easy -- we re-run them
  • Annotations are hard
slide-78
SLIDE 78

define void @f(i1 %a) { entry: ... br i1 %a, label %t, label %f, !prof !0 t: ... br label %exit f: ... br label %exit exit: %phi = phi i32 [ ..., %t ], [ ..., %f ] ret void } !0 = metadata !{metadata !"branch_weights", i32 64, i32 4}

Before...

slide-79
SLIDE 79

define void @f(i1 %a) { entry: ... br i1 %a, label %f, label %t, !prof !0 t: ... br label %exit f: ... br label %exit exit: %phi = phi i32 [ ..., %t ], [ ..., %f ] ret void } !0 = metadata !{metadata !"branch_weights", i32 4, i32 64}

After...

slide-80
SLIDE 80

define void @f(i1 %a) { entry: ... br i1 %a, label %t, label %f, !prof !0 t: ... br label %exit f: ... br label %exit exit: %phi = phi i32 [ ..., %t ], [ ..., %f ] ret void } !0 = metadata !{metadata !"branch_weights", i32 64, i32 4}

Before...

slide-81
SLIDE 81

define void @f(i1 %a) { entry: ... ... ... %phi = select i1 %a, i32 ..., ... br i1 %a, label %t, label %f, !prof !0 t: br label %exit f: br label %exit exit: ret void } !0 = metadata !{metadata !"branch_weights", i32 64, i32 4}

After...

slide-82
SLIDE 82

define void @f(i32 %a, i32 %b, i32 %c, i32 %d) { entry: ... %x = icmp eq i32 %a, %b %y = icmp eq i32 %c, %d %xy = and i1 %x, %y br i1 %xy, label %t, label %f, !prof !0 t: ... br label %exit f: ... br label %exit exit: %phi = phi i32 [ ..., %t ], [ ..., %f ] ret void } !0 = metadata !{metadata !"branch_weights", i32 64, i32 4}

Before...

slide-83
SLIDE 83

define void @f(i32 %a, i32 %b, i32 %c, i32 %d) { entry: ... %x = icmp eq i32 %a, %b br i1 %x, label %entry2, label %f, !prof !0 entry2: %y = icmp eq i32 %c, %d br i1 %y, label %t, label %f, !prof !0 t: ... br label %exit f: ... br label %exit exit: %phi = phi i32 [ ..., %t ], [ ..., %f ] ret void } !0 = metadata !{metadata !"branch_weights", i32 64, i32 4}

After...

slide-84
SLIDE 84

Need other annotations?

  • While we believe that block frequency can and should be

derived from branch weight, there are other things being profiled

  • May need module-wide call site or function definition

annotation

  • May need value-based annotation for value profiling
slide-85
SLIDE 85

Profile Guided Transforms

slide-86
SLIDE 86

Spill Placement

  • RA has a collection of potential values to spill from registers
  • nto the stack to satisfy the allocation problem
  • Which spill is chosen will cause a spill inside of different

blocks

  • Can use profile information to prioritize the hot path’s in-

register values

slide-87
SLIDE 87

Code Layout

  • Called MachineBlockPlacement
  • Runs at the very end of MI to lay out the code of a single

function

  • Primarily layout is driven based on the topological structure
  • f the CFG and loop nest structure
  • Ties are broken using profile information
  • Cold regions of code are extracted out-of-line
slide-88
SLIDE 88

Hot/Cold Partitioning?

  • GCC picks a partition point in the layout of the function and

emits the two halves under different sections

  • The linker can then group the hot regions together, fully

isolating the cold code frem the hot code even at an IP level

slide-89
SLIDE 89

The Inliner

  • Today, the inliner doesn’t even know profile information
  • exists. Oops.
  • LLVM’s inliner is also unusual: mostly focused on enabling

simplifications: constant propagation, combining, etc.

  • Consequentially the primary expected change is to avoid

inlining into cold regions unhelpfully.

slide-90
SLIDE 90

Outlining & Merging

  • The more radical change we would like is to do function
  • utlining for cold regions
  • This will in turn allow a significantly larger set of non-cold

paths to be considered for simplifying inlining

  • Forms in essence a partial inliner by splitting it into two steps
  • Outlining in the middle-end allows merging of common cold

regions (perhaps expanded via macros) by outlining them to functions and then running merge functions.

slide-91
SLIDE 91

PGO Summary

  • Strong analysis support from annotations down
  • Two parallel and complementary efforts to annotate with

profile information, this is going on right now!

  • Most basic profile guided transformations in place
  • Still a lot of work to do on other transforms (inlining, etc)
slide-92
SLIDE 92

Questions?