Compiler-assisted Performance Analysis Adam Nemet Apple - - PowerPoint PPT Presentation

compiler assisted performance analysis
SMART_READER_LITE
LIVE PREVIEW

Compiler-assisted Performance Analysis Adam Nemet Apple - - PowerPoint PPT Presentation

Compiler-assisted Performance Analysis Adam Nemet Apple anemet@apple.com Hotspot User Bottleneck Compiler Optimization X, Y 2 Hotspot Hotspot Legality User Compiler Bottleneck Cost Model Compiler Optimization Some Optimizations?


slide-1
SLIDE 1

Compiler-assisted Performance Analysis

Adam Nemet Apple anemet@apple.com

slide-2
SLIDE 2

Compiler Optimization X, Y

2

User

Hotspot Bottleneck

slide-3
SLIDE 3

Some Optimizations? Compiler Optimization X, Y

2

User

Hotspot Bottleneck

Compiler

Hotspot Legality Cost Model

slide-4
SLIDE 4

Some Optimizations? Compiler Optimization X, Y

2

User

Hotspot Bottleneck

Compiler

Hotspot Legality Cost Model

slide-5
SLIDE 5

Some Optimizations? Compiler Optimization X, Y

Disassemble

2

User

Hotspot Bottleneck

Compiler

Hotspot Legality Cost Model

slide-6
SLIDE 6
  • debug-only

Some Optimizations? Compiler Optimization X, Y

2

User

Hotspot Bottleneck

Compiler

Hotspot Legality Cost Model

slide-7
SLIDE 7

Optimization Diagnostics

Some Optimizations? Compiler Optimization X, Y

2

User

Hotspot Bottleneck

Compiler

Hotspot Legality Cost Model

slide-8
SLIDE 8

Optimization Diagnostics in LLVM

  • Supported in LLVM
  • Only a small number of passes emit them
  • -Rpass options to enable them in the compiler output

3

foo.c:8:5: remark: accumulate inlined into compute_sum[-Rpass=inline] accumulate(arr[i], sum); ^

slide-9
SLIDE 9

Optimization Diagnostics in LLVM

  • Supported in LLVM
  • Only a small number of passes emit them
  • -Rpass options to enable them in the compiler output
  • For large programs, the output of -Rpass is noisy and unstructured

3

slide-10
SLIDE 10

4

slide-11
SLIDE 11

4

Remarks for hot and cold code are intermixed Messages appear in no particular order Messages from successful and failed

  • ptimizations are dumped together

How can we make this information accessible and actionable?

slide-12
SLIDE 12

Wish List

  • All in one place: Optimizations Dashboard
  • At a glance: See high-level interaction between optimizations for

targeted low-level debugging

  • Filtering: Noise-level should be minimized by focusing on the hot

code

  • Integration: Display hot code and the optimizations side-by-side

5

slide-13
SLIDE 13
  • pt-viewer

6

slide-14
SLIDE 14

Approach

  • Extend existing optimization remark infrastructure
  • Add the new optimizations
  • Add ability to output remarks to a data file
  • Visualize data in HTML
  • Targeting compiler developers initially

7

slide-15
SLIDE 15

Example

9

slide-16
SLIDE 16

Work Flow

$ clang -O3 —fsave-optimization-record -c foo.c $ utils/opt-viewer/opt-viewer.py foo.opt.yaml html $ open html/foo.c.html

11

slide-17
SLIDE 17

Successful Optimizations

13

Remarks appear inline under the referenced line Name of the pass Green for successful

  • ptimization

Further details about the optimization

slide-18
SLIDE 18

Successful Optimizations

14

Column aligned with the expression HTML link to facilitate further analysis

slide-19
SLIDE 19

Successful Optimizations

15

Remarks in white are Analysis remarks Optimizations can expose interesting analyses

slide-20
SLIDE 20

Missed Optimizations

15

slide-21
SLIDE 21

Missed Optimizations

16

Red means failed

  • ptimization
slide-22
SLIDE 22

22

ORE.emit(OptimizationRemarkAnalysis("inline", "CanBeInlined", Call) << NV("Callee", Callee) << " can be inlined into “ << NV("Caller", Caller) << " with cost=" << NV("Cost", IC.getCost()) << " threshold=“ << NV("Threshold", Threshold));

OptimizationRemarkEmitter

foo.c:8:5: remark: accumulate can be inlined into compute_sum with cost=-5 (threshold=487) [-Rpass-analysis=inline] accumulate(arr[i], sum); ^

LLVM Changes

Inliner LoopVectorizer

Pass pipeline

  • Rpass-analysis=inline
  • ld

new

IR IR

slide-23
SLIDE 23

22

ORE.emit(OptimizationRemarkAnalysis("inline", "CanBeInlined", Call) << NV("Callee", Callee) << " can be inlined into “ << NV("Caller", Caller) << " with cost=" << NV("Cost", IC.getCost()) << " threshold=“ << NV("Threshold", Threshold));

OptimizationRemarkEmitter YAML

LLVM Changes

Inliner LoopVectorizer

  • fsave-optmization-record

Pass pipeline

enables source line debug info

(-gline-tables-only)

  • ld

new

IR IR

slide-24
SLIDE 24

22

ORE.emit(OptimizationRemarkAnalysis("inline", "CanBeInlined", Call) << NV("Callee", Callee) << " can be inlined into “ << NV("Caller", Caller) << " with cost=" << NV("Cost", IC.getCost()) << " threshold=“ << NV("Threshold", Threshold));

OptimizationRemarkEmitter YAML

LLVM Changes

Inliner LoopVectorizer

  • fsave-optmization-record

Pass pipeline

enables source line debug info

(-gline-tables-only)

  • ld

new

IR IR

  • -- !Analysis

Pass: inline Name: CanBeInlined DebugLoc: { File: s.cc, Line: 8, Column: 5 } Function: compute_sum Args:

  • Callee: accumulate

DebugLoc: { File: s.cc, Line: 1, Column: 0 }

  • String: ' can be inlined into '
  • Caller: compute_sum

DebugLoc: { File: s.cc, Line: 5, Column: 0 }

  • String: ' with cost='
  • Cost: '-5'
  • String: ' (threshold='
  • Threshold: '487'
  • String: ')'

...

slide-25
SLIDE 25
  • pt-viewer

YAML utils/opt-viewer/opt-viewer.py index.html foo.o.html

23

  • ld

new

slide-26
SLIDE 26

Index

24

slide-27
SLIDE 27

Index

24

Noisy: Most of this code not hot Sort by hotness

slide-28
SLIDE 28

IR

Use PGO for Hotness

Inliner LoopVectorizer OptimizationRemarkEmitter YAML LazyBlockFrequencyInfo

  • -- !Analysis

Pass: inline Name: CanBeInlined DebugLoc: { File: s.cc, Line: 8, Column: 5 } Function: compute_sum Hotness: 3 Args:

  • Callee: accumulate

DebugLoc: { File: s.cc, Line: 1, Column: 0 }

  • String: ' can be inlined into '
  • Caller: compute_sum

DebugLoc: { File: s.cc, Line: 5, Column: 0 }

  • String: ' with cost='
  • Cost: '-5'
  • String: ' (threshold='
  • Threshold: '487'
  • String: ')'

...

BlockFrequencyInfo

25

  • ld

new

Pass pipeline IR

slide-29
SLIDE 29

Hotness

Relative to maximum hotness, NOT total time %

27

slide-30
SLIDE 30

Optimizations Recorded

Function Inliner Loop Vectorizer Loop Unroller LoopDataPrefetch

28

LICM GVN Loop Idiom Loop Deletion SLP Vectorizer

… more to follow

slide-31
SLIDE 31

Test Drive

  • n

LLVM test suite

29

slide-32
SLIDE 32

Improve & Evaluate

  • 1. Does the information presented in this high-level view contain

sufficient detail to reconstruct what happened?

  • 2. Can we discover the interactions between optimizations?
  • 3. With the improved visibility, can we quickly find real performance
  • pportunities?

30

slide-33
SLIDE 33

DhryStone (SingleSource/Benchmark) Interaction of Optimizations

31

slide-34
SLIDE 34

DhryStone

33

Inlining Context

slide-35
SLIDE 35

DhryStone

36

slide-36
SLIDE 36

DhryStone

38

slide-37
SLIDE 37

DhryStone

40

slide-38
SLIDE 38

DhryStone

42

slide-39
SLIDE 39

DhryStone

45

slide-40
SLIDE 40

DhryStone

46

slide-41
SLIDE 41

DhryStone

48

slide-42
SLIDE 42

DhryStone

50

slide-43
SLIDE 43

DhryStone: Summary

  • Without low-level debugging, quickly reconstructed what happened
  • Even though it involved interaction between multiple optimizations
  • Inlining and Alias Analysis/GVN
  • Missed optimizations: Extra analysis to manage with false positives
  • 1. Filter trivially false positives
  • 2. Expose enough information for quick detection by user

51

slide-44
SLIDE 44

Freebench/distray (MultiSource/Benchmarks) Finding Performance Opportunity

52

slide-45
SLIDE 45
slide-46
SLIDE 46
slide-47
SLIDE 47
slide-48
SLIDE 48

Not modified via LinP, maybe writes through other pointers

slide-49
SLIDE 49

Not modified via LinD, maybe writes through other pointers

slide-50
SLIDE 50
slide-51
SLIDE 51

Reads and writes don’t alias

slide-52
SLIDE 52

Reads and writes don’t alias Loop versioning with array overlap checks?

slide-53
SLIDE 53

55

LICM-based LoopVersioning

(-enable-loop-versioning-licm)

slide-54
SLIDE 54

55

LICM-based LoopVersioning

(-enable-loop-versioning-licm)

Performance opportunity if we can improve this pass

slide-55
SLIDE 55

55

LICM-based LoopVersioning

(-enable-loop-versioning-licm)

Performance opportunity if we can improve this pass Approximate the opportunity by manually modifying the source

slide-56
SLIDE 56
slide-57
SLIDE 57
slide-58
SLIDE 58

Dynamic Instruction Count Reduced by 11%

slide-59
SLIDE 59

Dynamic Instruction Count Reduced by 11% Performance headroom 11%

slide-60
SLIDE 60

Freebench/distray: Summary

  • Found optimization opportunity while staying in the high-level view
  • Reconstructed the reason for missed optimization
  • High-level view exposed that the gain may be substantial
  • Got immediate feedback of the desired effect on the prototype
  • Identified the pass for low-level debugging

58

slide-61
SLIDE 61

Check Out More Examples

http://lab.llvm.org:8080/artifacts/opt-view_test-suite

59

slide-62
SLIDE 62

Development Timeline

60

Code Author Tool Compiler Developer Tool Initial version on LLVM trunk Now New tools using Optimization Records

slide-63
SLIDE 63

Compiler Developer Tool: Status

  • Written in Python
  • Hook up new passes
  • Improve diagnostics quality for existing passes
  • Perform extra analysis for insightful messages
  • Improve UI

61

slide-64
SLIDE 64

Compiler Developer Tool: Status

  • Written in Python
  • Hook up new passes
  • Improve diagnostics quality for existing passes
  • Perform extra analysis for insightful messages
  • Improve UI

61

R e q u e s t f

  • r

H e l p

slide-65
SLIDE 65

Code Author Tool: Wishlist

  • Suggest specific actions
  • E.g. for the LICM case: if the two pointers can never point to the

same object consider using ‘restrict’

  • Add new “recommendation” analysis passes to detect opportunity

and suggest:

  • Source annotation to enable off-by-default passes (aggressive

loop transformations, non-temporal stores)

  • Refactoring: data transformations

62

slide-66
SLIDE 66

Code Author Tool: Wishlist

  • Suggest specific actions
  • E.g. for the LICM case: if the two pointers can never point to the

same object consider using ‘restrict’

  • Add new “recommendation” analysis passes to detect opportunity

and suggest:

  • Source annotation to enable off-by-default passes (aggressive

loop transformations, non-temporal stores)

  • Refactoring: data transformations

62

R e q u e s t f

  • r

H e l p

slide-67
SLIDE 67

Optimization Records: New Tools

  • llvm-opt-report
  • Performance regression analysis
  • Optimization statistics with the ability to zoom into the particular
  • ptimization
  • Bottom-up search for performance opportunities
  • See all the LICM opportunities like in Freebench/distray

63

slide-68
SLIDE 68

Optimization Records: New Tools

  • llvm-opt-report
  • Performance regression analysis
  • Optimization statistics with the ability to zoom into the particular
  • ptimization
  • Bottom-up search for performance opportunities
  • See all the LICM opportunities like in Freebench/distray

63

SELECT benchmark, hotspot, hotness
 FROM optimizations
 WHERE pass = ‘licm’ AND
 type = ‘missed’ AND
 name = ‘LoadWithLoopInvariantAddressInvalidated’
 ORDER BY hotness

slide-69
SLIDE 69

Optimization Records: New Tools

  • llvm-opt-report
  • Performance regression analysis
  • Optimization statistics with the ability to zoom into the particular
  • ptimization
  • Bottom-up search for performance opportunities
  • See all the LICM opportunities like in Freebench/distray
  • Allows finding opportunities that occur with high frequency but not

in the hottest code

63

slide-70
SLIDE 70

Acknowledgement

  • Tyler Nowicki
  • John McCall
  • Hal Finkel

64

slide-71
SLIDE 71

Q&A

65

slide-72
SLIDE 72

SIBsim4 (MultiSource/Applications) Finding Performance Opportunity

66

slide-73
SLIDE 73

SIBsim4

67

slide-74
SLIDE 74

SIBsim4

68

slide-75
SLIDE 75

SIBsim4

69

slide-76
SLIDE 76

SIBsim4

70

slide-77
SLIDE 77

SIBsim4

71

slide-78
SLIDE 78

SIBsim4

72

slide-79
SLIDE 79

SIBsim4

Look at the loads

73

slide-80
SLIDE 80

SIBsim4

Look at the loads

74

slide-81
SLIDE 81

SIBsim4

75

slide-82
SLIDE 82

SIBsim4

Look at the stores

76

slide-83
SLIDE 83

SIBsim4

Look at the stores

77

slide-84
SLIDE 84

SIBsim4

Can ‘m’ and ’n’ really alias?

78

slide-85
SLIDE 85

SIBsim4

Probably not!

exon_p_t m = mCol->e.exon[i];

79

slide-86
SLIDE 86

SIBsim4

We need to use ‘restrict’

  • r hoist manually

exon_p_t m = mCol->e.exon[i];

80

slide-87
SLIDE 87

SIBsim4

81

slide-88
SLIDE 88

SIBsim4

82

slide-89
SLIDE 89

SIBsim4

83

slide-90
SLIDE 90

SIBsim4

84