An example of a research compiler Simone Campanoni - PowerPoint PPT Presentation

An example of a research compiler Simone Campanoni simonec@eecs.northwestern.edu

Sequential programs are not accelerating like they used to Performance (log scale) Performance gap Core frequency scaling Multicore era Sequential program running on a platform 1992 2004 2

Multicores are underutilized Single application: Not enough explicit parallelism • Developing parallel code is hard • Sequentially-designed code is still ubiquitous Multiple applications: Only a few CPU-intensive applications running concurrently in client devices 3

Parallelizing compiler: Exploit unused cores to accelerate sequential programs 4

Non-numerical programs need to be parallelized Non-numerical programs Numerical programs 5

Parallelize loops to parallelize a program Outermost loops 99% of time is spent in loops Time 6

DOALL parallelism Time Iteration 0 work() Iteration 1 work() Iteration 2 work() 7

DOACROSS parallelism Time Sequential c=f(c) c=f(c) c=f(c) segment d=f(d) d=f(d) d=f(d) work() work() work() Parallel segment 8

HELIX: DOACROSS for multicore [ Campanoni et al, CGO 2012, Campanoni et al, DAC 2012, Campanoni et al, IEEE Micro 2012 ] c=f(c) d=f(d) work() c=f(c) d=f(d) work() c=f(c) d=f(d) work() 9

HELIX: DOACROSS for multicore [ Campanoni et al, CGO 2012, Campanoni et al, DAC 2012, Campanoni et al, IEEE Micro 2012 ] c=f(c) d=f(d) work() c=f(c) d=f(d) c=f(c) work() d=f(d) work() 10

HELIX: DOACROSS for multicore [ Campanoni et al, CGO 2012, Campanoni et al, DAC 2012, Campanoni et al, IEEE Micro 2012 ] Wait 0 Seq. Signal 0 c=f(c) Segment 0 d=f(d) c=f(c) Signal 1 Wait 1 work(x) d=f(d) c=f(c) work() d=f(d) Seq. work() Segment 1 11

HELIX: DOACROSS for multicore [ Campanoni et al, CGO 2012, Campanoni et al, DAC 2012, Campanoni et al, IEEE Micro 2012 ] 12

HELIX: DOACROSS for multicore [ Campanoni et al, CGO 2012, Campanoni et al, DAC 2012, Campanoni et al, IEEE Micro 2012 ] 13

Parallelize loops to parallelize a program Outermost loops Innermost loops 99% of time is spent in loops Time 14

Parallelize loops to parallelize a program Innermost Outermost loops loops Coverage HELIX Ease of analysis Communication 15

HELIX: DOACROSS for multicore 16 [ Campanoni et al, CGO 2012, Campanoni et al, DAC 2012, Campanoni et al, IEEE Micro 2012 ] Innermost Outermost loops loops Coverage HELIX HELIX-RC Easy of HELIX-UP analysis 4 Speedup Small Loop Parallelism HELIX ICC, Microsoft Visual Studio,DOACROSS Communication 1 SPEC INT baseline 4-core Intel Nehalem 16

Outline Small Loop Parallelism and HELIX [ CGO 2012 DAC 2012, IEEE Micro 2012 ] HELIX-RC: Architecture/Compiler Co-Design [ ISCA 2014 ] Small loops HELIX Communication HELIX-UP: Unleash Parallelization [ CGO 2015 ] 17

SLP challenge: short loop iterations SPEC CPU Int benchmarks Duration of loop iteration (cycles) Clock cycles 18

SLP challenge: short loop iterations 90 SPEC CPU Int benchmarks Duration of loop iteration (cycles) Clock cycles 19

SLP challenge: short loop iterations Adjacent core communication latency Duration of loop iteration (cycles) Clock cycles 20

A compiler-architecture co-design to efficiently execute short iterations Compiler • Identify latency-critical code in each small loop • Code that generates shared data • Expose information to the architecture Wait 0 Seq. Signal 0 Architecture: Ring Cache Segment 0 Signal 1 • Reduce the communication latency Wait 1 on the critical path Seq. Segment 1 21

Light-weight enhancement of today’s multicore architecture Store X, 1 … Core 0 Core 1 Store X, 1 Load X Store Y, 1 Ring node Ring node Load Y … DL1 DL1 Iter. 0 Iter. 1 Last level cache DL1 DL1 75 – 260 Store Y, Store Y, Ring node Ring node cycles! 1 1 Iter. 3 Iter. 2 Core 3 Core 2 22

Light-weight enhancement of today’s multicore architecture Store X, 1 … Core 0 Core 1 Wait 0 Wait 0 Store Y, 1 Ring node Ring node Load Y Signal 0 … Iter. 0 Iter. 1 Ring node Ring node 23

98% hit rate 24

The importance of HELIX-RC Numerical programs Non-numerical programs 25

The importance of HELIX-RC Numerical programs Non-numerical programs 26

Outline Small Loop Parallelism and HELIX [ CGO 2012 DAC 2012, IEEE Micro 2012 ] HELIX-RC: Architecture/Compiler Co-Design [ ISCA 2014 ] Small loops HELIX Communication HELIX-UP: Unleash Parallelization [ CGO 2015 ] 27

HELIX and its limitations Iteration 0 Thread 0 80% Data Iteration 1 Thread 1 Data Iteration 2 Thread 2 Data 50% Thread 3 78% accuracy 1.19 Performance: 79% accuracy 1.61 Lower than you would like • Nehalem 2.77 Inconsistent across architectures • Bulldozer 2.31 1.68 Haswell Sensitive to • 4 Cores dependence analysis accuracy What can we do to improve it? 28 28

Opportunity: relax program semantics • Some workloads tolerate output distortion • Output distortion is workload-dependent 29

Relaxing transformations remove performance bottlenecks • Sequential bottleneck Thread 1 Thread 2 Thread 3 Inst 1 Inst 1 Inst 1 Inst 2 Inst 2 Inst 2 Sequential Inst 3 Dep segment Inst 4 Inst 3 Inst 4 Inst 3 Inst 4 Speedup 30

Relaxing transformations remove performance bottlenecks • Sequential bottleneck • Communication bottleneck • Data locality bottleneck 31

Relaxing transformations remove performance bottlenecks No relaxing transformations No output distortion Relaxing transformation 1 Baseline performance Relaxing transformation 2 … Max output distortion Max performance Relaxing transformation k 32

Design space of HELIX-UP Apply relaxing transformation 5 to code region 2 o Performance o Energy saved o Output distortion Code Apply relaxing transformation 3 region 1 to code region 1 Code region 2 1) User provides output distortion limits 2) System finds the best configuration 3) Run parallelized code with that configuration 33

Pruning the design space Empirical observation: Transforming a code region affects only the loop it belongs to 50 loops, 2 code regions per loop 2 transformations per code region Complete space = 2 100 Pruned space = 50 * (2 2 ) = 200 How well does HELIX-UP perform? 34

HELIX-UP unblocks extra parallelism HELIX: no relaxing transformations with small output distortions Nehalem 6 cores 2 threads per core 35

HELIX-UP unblocks extra parallelism with small output distortions Nehalem 6 cores 2 threads per core 36

Performance/distortion tradeoff 256.bzip2 HELIX % 37

Run time code tuning • Static HELIX-UP decides how to transform the code based on profile data averaged over inputs • The runtime reacts to transient bottlenecks by adjusting code accordingly 38

Adapting code at run time unlocks more parallelism 256.bzip2 HELIX % 39

HELIX-UP improves more than just performance • Robustness to DDG inaccuracies • Consistent performance across platforms 40

Relaxed transformations to be robust to DDG inaccuracies 256.bzip2 Increasing DDG No impact inaccuracies leads to lower performance on HELIX-UP HELIX HELIX-UP 41

Relaxed transformations for consistent performance Increasing communication latency 42

Small Loop Parallelism and HELIX • Parallelism hides in small loops HELIX-RC: Architecture/Compiler Co-Design • Irregular programs require low latency HELIX-UP: Unleash Parallelization • Tolerating distortions boosts parallelization 43

Thank you!

Small Loop Parallelism and HELIX • Parallelism hides in small loops HELIX-RC: Architecture/Compiler Co-Design • Irregular programs require low latency HELIX-UP: Unleash Parallelization • Tolerating distortions boosts parallelization 45

An example of a research compiler Simone Campanoni - PowerPoint PPT Presentation

An example of a research compiler Simone Campanoni simonec@eecs.northwestern.edu Sequential programs are not accelerating like they used to Performance (log scale) Performance gap Core frequency scaling Multicore era Sequential program

Compiler Construction Chapter 11 1 Compiler Construction Compiler Construction A New Compiler

11/8/2012 The Structure of a Compiler (2) The Structure of a Compiler (1) Any compiler must

Compiler Development (CMPSC 401) Janyl Jumadinova January 17, 2018 Janyl Jumadinova Compiler

Principles of Compiler Design - The Brainf*ck Compiler - Clifford Wolf - www.clifford.at

The Firefighter Problem on Trees David Ellison RMIT School of Science Co-authors: Pierre

Toy Example Toy Example Toy Example Toy Example Toy Example D 1 weak classifiers = vertical or

Toy Example Toy Example Toy Example Toy Example Toy Example D 1 weak classifiers = vertical or

Compiler-assisted Performance Analysis Adam Nemet Apple anemet@apple.com Hotspot User

Compiler verification for fun and profit Xavier Leroy Inria Paris-Rocquencourt FMCAD,

Compiler Construction Compiler Construction 1 / 111 Mayer Goldberg \ Ben-Gurion University

Compiler Construction November 21, 2018 Compiler Construction November 21, 2018 1 / 102 Mayer

Compiler Construction Compiler Construction 1 / 54 Mayer Goldberg \ Ben-Gurion University Tuesday

Compiler Construction Compiler Construction 1 / 193 Mayer Goldberg \ Ben-Gurion University Friday

Compiler Development (CMPSC 401) Intermediate Representations Janyl Jumadinova March 28, 2019

Compiler Construction October 20, 2018 Compiler Construction October 20, 2018 1 / 115 Mayer

Compiler Construction Compiler Construction 1 / 177 Mayer Goldberg \ Ben-Gurion University

Computations by Luminous Robots Giuseppe Prencipe Universit di Pisa Swarms of robots Many

MTN Regional Meeting Pharmacy Break-out Session October 30, 2013 Day 2 Wednesday October 3rd

Lecture 13: Dense Linear Algebra II David Bindel 8 Mar 2010 Logistics Tell me your project

CSE 543 - Computer Security Lecture 12 - MAC Security October 4, 2007 URL:

Set Variables SONET Problem Marco Chiarandini Department of Mathematics & Computer Science

Disclaimers Picture credit: Rick Bowmer/AP How to use the new 65-megawatt Bluffdale

Network communication Hardware platform 1 4 sender receiver 1 2 message

Lecture 7: Distributed memory David Bindel 15 Feb 2010 Logistics HW 1 due Wednesday: See

An example of a research compiler Simone Campanoni - PowerPoint PPT Presentation

An example of a research compiler Simone Campanoni simonec@eecs.northwestern.edu Sequential programs are not accelerating like they used to Performance (log scale) Performance gap Core frequency scaling Multicore era Sequential program

Compiler Construction Chapter 11 1 Compiler Construction Compiler Construction A New Compiler

11/8/2012 The Structure of a Compiler (2) The Structure of a Compiler (1) Any compiler must

Compiler Development (CMPSC 401) Janyl Jumadinova January 17, 2018 Janyl Jumadinova Compiler

Principles of Compiler Design - The Brainf*ck Compiler - Clifford Wolf - www.clifford.at

The Firefighter Problem on Trees David Ellison RMIT School of Science Co-authors: Pierre

Toy Example Toy Example Toy Example Toy Example Toy Example D 1 weak classifiers = vertical or

Toy Example Toy Example Toy Example Toy Example Toy Example D 1 weak classifiers = vertical or

Compiler-assisted Performance Analysis Adam Nemet Apple anemet@apple.com Hotspot User

Compiler verification for fun and profit Xavier Leroy Inria Paris-Rocquencourt FMCAD,

Compiler Construction Compiler Construction 1 / 111 Mayer Goldberg \ Ben-Gurion University

Compiler Construction November 21, 2018 Compiler Construction November 21, 2018 1 / 102 Mayer

Compiler Construction Compiler Construction 1 / 54 Mayer Goldberg \ Ben-Gurion University Tuesday

Compiler Construction Compiler Construction 1 / 193 Mayer Goldberg \ Ben-Gurion University Friday

Compiler Development (CMPSC 401) Intermediate Representations Janyl Jumadinova March 28, 2019

Compiler Construction October 20, 2018 Compiler Construction October 20, 2018 1 / 115 Mayer

Compiler Construction Compiler Construction 1 / 177 Mayer Goldberg \ Ben-Gurion University

Computations by Luminous Robots Giuseppe Prencipe Universit di Pisa Swarms of robots Many

MTN Regional Meeting Pharmacy Break-out Session October 30, 2013 Day 2 Wednesday October 3rd

Lecture 13: Dense Linear Algebra II David Bindel 8 Mar 2010 Logistics Tell me your project

CSE 543 - Computer Security Lecture 12 - MAC Security October 4, 2007 URL:

Set Variables SONET Problem Marco Chiarandini Department of Mathematics &amp; Computer Science

Disclaimers Picture credit: Rick Bowmer/AP How to use the new 65-megawatt Bluffdale

Network communication Hardware platform 1 4 sender receiver 1 2 message

Lecture 7: Distributed memory David Bindel 15 Feb 2010 Logistics HW 1 due Wednesday: See

Set Variables SONET Problem Marco Chiarandini Department of Mathematics & Computer Science