Compiler Construction of Idempotent Regions and Applications in - PowerPoint PPT Presentation

ISA Sensitivity you might be curious ISA matters? (1) two-address (e.g. x86) vs. three-address (e.g. ARM) (2) register-memory (e.g. x86) vs. register-register (e.g. ARM) (3) number of available registers the short version ( ) – impact of (1) & (2) not significant (+/- 2% overall) – even less significant as regions grow larger – impact of (3): to get same performance w/ idempotence – increase registers by 0% (large) to ~60% (small regions) 40

Compiler Design & Evaluation a summary design and implementation – static analysis algorithms, modular and perform well – code-gen algorithms, modular and perform well – LLVM implementation source code available * findings – pressure-related performance overheads range from: – 0% (large regions) to ~15% (small regions) – greatest opportunity: loop-intensive applications – ISA effects are insignificant * http://research.cs.wisc.edu/vertical/iCompiler 41

Overview ❶ Idempotence Models in Architecture ❷ Compiler Design & Evaluation ❸ Architecture Design & Evaluation 42

Architecture Recovery: It’s Real safety first speed first (safety second) 43

Architecture Recovery: It’s Real lots of sharp turns 1 Fetch 2 Decode 3 Execute 4 Write-back 2 Decode 1 Fetch closer to 3 Execute the truth 4 Write-back 44

Architecture Recovery: It’s Real lots of interaction 1 Fetch 2 Decode 3 Execute 4 Write-back !!! 2 Decode 1 Fetch too late! 3 Execute 4 Write-back 45

Architecture Recovery: It’s Real bad stuff can happen mis-speculation hardware faults exceptions (a) branch mis-prediction, (d) wear-out fault, (g) page fault, (b) memory re-ordering, (e) particle strike, (h) divide-by-zero, (c) transaction violation, (f) voltage spike, (i) mis-aligned access, etc. etc. etc. 46

Architecture Recovery: It’s Real bad stuff can happen detection register re-execution latency pressure time mis-speculation hardware faults exceptions (a) branch mis-prediction, (d) wear-out fault, (g) page fault, (b) memory re-ordering, (e) particle strike, (h) divide-by-zero, (c) transaction violation, (f) voltage spike, (i) mis-aligned access, etc. etc. etc. 47

Architecture Recovery: It’s Real bad stuff can happen mis-speculation hardware faults exceptions (a) branch mis-prediction, (d) wear-out fault, (g) page fault, (b) memory re-ordering, (e) particle strike, (h) divide-by-zero, (c) transaction violation, (f) voltage spike, (i) mis-aligned access, etc. etc. etc. 48

Architecture Recovery: It’s Real bad stuff can happen integrated GPU low-power CPU high-reliability systems exceptions hardware faults (g) page fault, (d) wear-out fault, (h) divide-by-zero, (e) particle strike, (i) mis-aligned access, (f) voltage spike, etc. etc. 49

GPU Exception Support 50

GPU Exception Support why would we want it? GPU/CPU integration – unified address space: support for demand paging – numerous secondary benefits as well… 51

GPU Exception Support why is it hard? the CPU solution pipeline buffers registers 52

GPU Exception Support why is it hard? CPU: 10s of registers/core GPU: 10s of registers/thread 32 threads/warp 48 warps per “core” 10,000s of registers/core 53

GPU Exception Support idempotence on GPUs GPUs hit the sweet spot (1) extractably large regions (low compiler overheads) (2) detection latencies long or hard to bound (large is good) (3) exceptions are infrequent (low re-execution overheads) detection register re-execution latency pressure time 54

GPU Exception Support idempotence on GPUs GPUs hit the sweet spot (1) extractably large regions (low compiler overheads) (2) detection latencies long or hard to bound (large is good) (3) exceptions are infrequent (low re-execution overheads) GPU design topics D ETAILS – compiler flow – hardware support – exception live-lock – bonus: fast context switching 55

GPU Exception Support evaluation methodology compiler – LLVM targeting ARM simulation – gem5 for ARM: simple dual-issue in-order (e.g. Fermi) – 10-cycle page fault detection latency benchmarks – Parboil GPU benchmarks for CPUs, modified measurement – performance overhead in execution cycles 56

GPU Exception Support evaluation results performance overhead 1.5% 1.0% 0.54% 0.5% 0.0% cutcp fft histo mri-q sad tpacf gmean 57

CPU Exception Support 58

CPU Exception Support why is it a problem? the CPU solution pipeline buffers registers 59

CPU Exception Support why is it a problem? Before Integer Bypass Integer Branch Multiply Decode, Fetch RF Rename, Load/Store & Issue Flush? FP Replay? … Replay IEEE FP queue … 60

CPU Exception Support why is it a problem? After Integer Integer Branch Multiply Decode & Fetch RF Issue Load/Store FP … 61

CPU Exception Support idempotence on CPUs CPU design simplification – in ARM Cortex-A8 (dual-issue in-order) can remove: – bypass / staging register file, replay queue – rename pipeline stage – IEEE-compliant floating point unit – pipeline flush for exceptions and replays D ETAILS – all associated control logic leaner hardware – bonus: cheap (but modest) OoO issue 62

CPU Exception Support evaluation methodology compiler – LLVM targeting ARM, minimize pressure ( take 2/3 ) simulation – gem5 for ARM: aggressive dual-issue in-order (e.g. A8) – stall on potential in-flight exception benchmarks – SPEC 2006 & PARSEC suites ( unmodified ) measurement – performance overhead in execution cycles 63

CPU Exception Support evaluation results performance overhead 14% 12% 9.1% 10% 8% 6% 4% 2% 0% SPEC INT SPEC FP PARSEC OVERALL 64

Hardware Fault Tolerance 65

Hardware Fault Tolerance what is the opportunity? reliability trends – CMOS reliability is a growing problem – future CMOS alternatives are no better architecture trends – hardware power and complexity are premium – desire for simple hardware + efficient recovery application trends – emerging workloads consist of large idempotent regions – increasing levels of software abstraction 66

Hardware Fault Tolerance design topics fault detection capability – fine- grained in hardware (e.g. Argus, MICRO ‘07) or – fine-grained in software (e.g. instruction/region DMR) hardware organizations – homogenous : idempotence everywhere F AULT – statically heterogeneous : e.g. accelerators MODEL – dynamically heterogeneous : adaptive cores fault model (aka ISA semantics) – similar to pipeline-based (e.g. ROB) recovery 67

Hardware Fault Tolerance evaluation methodology compiler – LLVM targeting ARM (compiled to minimize pressure) simulation – gem5 for ARM: simple dual-issue in-order – DMR detection; compare against checkpoint/log and TMR benchmarks – SPEC 2006, PARSEC, and Parboil suites (unmodified) measurement – performance overhead in execution cycles 68

Hardware Fault Tolerance evaluation results 29.3 performance overhead 35% 22.2 30% idempotence 25% checkpoint/log 20% 9.1 TMR 15% 10% 5% 0% 69

Overview ❶ Idempotence Models in Architecture ❷ Compiler Design & Evaluation ❸ Architecture Design & Evaluation 70

R ELATED W ORK C ONCLUSIONS 71

Conclusions idempotence: not good for everything – small regions are expensive – preserving register state is difficult with limited flexibility – large regions are cheap – preserving register state is easy with amortization effect – preserving memory state is mostly “for free” idempotence: synergistic with modern trends – programmability (for GPUs) – low power (for everyone) – high-level software efficient recovery (for everyone) 72

The End 73

Back-Up: Chronology prelim defense Time MapReduce MICRO ’11: for CELL Idempotent Processors SELSE ’09: ISCA ’12: CGO ??: Synergy iGPU Code Gen ISCA ‘10: PLDI ’12: Relax Static Analysis and Compiler Design DSN ’10: TACO ??: TS model Models 74

Choose Your Own Adventure Slides 75

Idempotence Analysis is this idempotent? 2 Yes 76

Idempotence Analysis how about this? 2 No 77

Idempotence Analysis maybe this? 2 Yes 78

Idempotence Analysis it’s all about the data dependences operation sequence dependence chain idempotent? write Yes read, write No write, read, write Yes 79

Idempotence Analysis it’s all about the data dependences operation sequence dependence chain idempotent? C LOBBER A NTIDEPENDENCE write, read Yes antidependence with an exposed read read, write No write, read, write Yes 80

Semantic Idempotence two types of program state (1) local (“ pseudoregister ”) state: can be renamed to remove clobber antidependences* does not semantically constrain idempotence (2) non- local (“memory”) state: cannot “rename” to avoid clobber antidependences semantically constrains idempotence semantic idempotence = no non-local clobber antidep. preserve local state by renaming and careful allocation 81

Region Partitioning Algorithm steps one, two, and three Step 1: transform function remove artificial dependences, remove non-clobbers Step 2: construct regions around antidependences cut all non-local antidependences in the CFG Step 3: refine for correctness & performance account for loops, optimize for dynamic behavior 82

Step 1: Transform not one, but two transformations Transformation 1: SSA for pseudoregister antidependences But we still have a problem: region identification depends on region clobber boundaries antidependences 83

Step 1: Transform not one, but two transformations Transformation 1: SSA for pseudoregister antidependences Transformation 2: Scalar replacement of memory variables [x] = a; [x] = a; b = [x] ; b = a; [x] = c; [x] = c; non-clobber antidependences … GONE! 84

Step 1: Transform not one, but two transformations Transformation 1: SSA for pseudoregister antidependences Transformation 2: Scalar replacement of memory variables region identification depends on region clobber boundaries antidependences 85

Step 2: Cut the CFG cut, cut, cut… construct regions by “cutting” non -local antidependences antidependence 87

Step 2: Cut the CFG but where to cut…? sources of overhead overhead optimal region size? rough sketch region size larger is (generally) better: large regions amortize the cost of input preservation 88

Step 2: Cut the CFG but where to cut…? goal : the minimum set of cuts that cuts all antidependence paths intuition : minimum cuts fewest regions large regions approach : a series of reductions: minimum vertex multi-cut (NP-complete) minimum hitting set among paths minimum hitting set among “dominating nodes” details omitted 89

Step 3: Loop-Related Refinements loops affect correctness and performance correctness: Not all local antidependences removed by SSA… loop-carried antidependences may clobber depends on boundary placement; handled as a post-pass performance: Loops tend to execute multiple times… to maximize region size, place cuts outside of loop algorithm modified to prefer cuts outside of loops details omitted 91

Code Generation Algorithms idempotence preservation background & concepts: live intervals, region intervals, and shadow intervals compiling for contextual idempotence : potentially variable control flow upon re-execution compiling for architectural idempotence : invariable control flow upon re-execution 92

Code Generation Algorithms live intervals and region intervals x ’s live region x = ... interval boundaries ... = f(x) y = ... region interval 93

Code Generation Algorithms shadow intervals shadow interval the interval over which a variable must not be overwritten specifically to preserve idempotence different for architectural and contextual idempotence 94

Code Generation Algorithms for contextual idempotence x ’s live region x = ... interval boundaries ... = f(x) y = ... x ’s shadow interval 95

Code Generation Algorithms for architectural idempotence x ’s live region x = ... interval boundaries ... = f(x) y = ... x ’s shadow interval 96

Code Generation Algorithms for architectural idempotence x ’s live region x = ... interval boundaries ... = f(x) y = ... y ’s live interval x ’s shadow interval 97

Big Regions Re: Problem #2 (cut in loops are bad) C code CFG + SSA for (i = 0; i < X; i++) { i 0 = φ (0, i 1 ) ... } i 1 = i 0 + 1 if (i 1 < X) 98

Big Regions Re: Problem #2 (cut in loops are bad) C code machine code for (i = 0; R0 = 0 i < X; i++) { ... } R0 = R0 + 1 if (R0 < X) 99

Big Regions Re: Problem #2 (cut in loops are bad) C code machine code for (i = 0; R0 = 0 i < X; i++) { ... } R0 = R0 + 1 if (R0 < X) 100

Compiler Construction of Idempotent Regions and Applications in - PowerPoint PPT Presentation

Compiler Construction of Idempotent Regions and Applications in Architecture Design Marc de Kruijf Advisor: Karthikeyan Sankaralingam PhD Defense 07/20/2012 Example source code int int sum(int int *array, int int len) { int int x = 0;

Compiler Construction Chapter 11 1 Compiler Construction Compiler Construction A New Compiler

Non-Idempotent Plonka Functions and Hyperidentities Non-Idempotent Plonka Functions and

Maximal subgroups of free idempotent generated semigroups Dandan Yang University of York, UK

Compiler Construction Compiler Construction 1 / 111 Mayer Goldberg \ Ben-Gurion University

Compiler Construction November 21, 2018 Compiler Construction November 21, 2018 1 / 102 Mayer

Compiler Construction Compiler Construction 1 / 54 Mayer Goldberg \ Ben-Gurion University Tuesday

Compiler Construction Compiler Construction 1 / 193 Mayer Goldberg \ Ben-Gurion University Friday

Compiler Construction October 20, 2018 Compiler Construction October 20, 2018 1 / 115 Mayer

Compiler Construction Compiler Construction 1 / 177 Mayer Goldberg \ Ben-Gurion University

Compiler Construction Compiler Construction 1 / 87 Mayer Goldberg \ Ben-Gurion University

Compiler Construction Compiler Construction 1 / 88 Mayer Goldberg \ Ben-Gurion University Tuesday

Compiler Construction Compiler Construction 1 / 104 Mayer Goldberg \ Ben-Gurion University Friday

Compiler Construction Compiler Construction 1 / 104 Mayer Goldberg \ Ben-Gurion University Monday

Compiler Construction October 31, 2018 Compiler Construction October 31, 2018 1 / 175 Mayer

Compiler Construction Compiler Construction 1 / 114 Mayer Goldberg \ Ben-Gurion University

Compiler Construction Compiler Construction 1 / 112 Mayer Goldberg \ Ben-Gurion University

COMPARING OPENACC AND OPENMP PERFORMANCE AND PROGRAMMABILITY JEFF LARKIN, NVIDIA GUIDO

Promoting Womens Economic Empowerment PrOpComs Experience in Nigeria SDC e+i network Public

Food Service Advisory Committee October 3, 2014 Open Forum Introductions Roles &

CAMPUS WIKI ANURAG MISRA DURGESH DEEP Whats a Wiki? A wiki is a type of website

Ijaz Hossain Chemical Engineering Department, BUET Email: ijaz@che.buet.ac.bd Greenhouse

CSI Ingenieros Our Firm August 2017 1 Our Firm: CSI Ingenieros CSI Ingenieros is Uruguays

PrOpCom Making Nigerian Agricultural Markets Work for the Poor Monograph Series Vol. 43 Consumer

ON ENERGY AUDIT AND EFFICIENCY IN POWER SECTOR MD. GIASH UDDIN MUGAL SENIOR ASSISTANT

Compiler Construction of Idempotent Regions and Applications in - PowerPoint PPT Presentation

Compiler Construction of Idempotent Regions and Applications in Architecture Design Marc de Kruijf Advisor: Karthikeyan Sankaralingam PhD Defense 07/20/2012 Example source code int int sum(int int *array, int int len) { int int x = 0;

Compiler Construction Chapter 11 1 Compiler Construction Compiler Construction A New Compiler

Non-Idempotent Plonka Functions and Hyperidentities Non-Idempotent Plonka Functions and

Maximal subgroups of free idempotent generated semigroups Dandan Yang University of York, UK

Compiler Construction Compiler Construction 1 / 111 Mayer Goldberg \ Ben-Gurion University

Compiler Construction November 21, 2018 Compiler Construction November 21, 2018 1 / 102 Mayer

Compiler Construction Compiler Construction 1 / 54 Mayer Goldberg \ Ben-Gurion University Tuesday

Compiler Construction Compiler Construction 1 / 193 Mayer Goldberg \ Ben-Gurion University Friday

Compiler Construction October 20, 2018 Compiler Construction October 20, 2018 1 / 115 Mayer

Compiler Construction Compiler Construction 1 / 177 Mayer Goldberg \ Ben-Gurion University

Compiler Construction Compiler Construction 1 / 87 Mayer Goldberg \ Ben-Gurion University

Compiler Construction Compiler Construction 1 / 88 Mayer Goldberg \ Ben-Gurion University Tuesday

Compiler Construction Compiler Construction 1 / 104 Mayer Goldberg \ Ben-Gurion University Friday

Compiler Construction Compiler Construction 1 / 104 Mayer Goldberg \ Ben-Gurion University Monday

Compiler Construction October 31, 2018 Compiler Construction October 31, 2018 1 / 175 Mayer

Compiler Construction Compiler Construction 1 / 114 Mayer Goldberg \ Ben-Gurion University

Compiler Construction Compiler Construction 1 / 112 Mayer Goldberg \ Ben-Gurion University

COMPARING OPENACC AND OPENMP PERFORMANCE AND PROGRAMMABILITY JEFF LARKIN, NVIDIA GUIDO

Promoting Womens Economic Empowerment PrOpComs Experience in Nigeria SDC e+i network Public

Food Service Advisory Committee October 3, 2014 Open Forum Introductions Roles &amp;

CAMPUS WIKI ANURAG MISRA DURGESH DEEP Whats a Wiki? A wiki is a type of website

Ijaz Hossain Chemical Engineering Department, BUET Email: ijaz@che.buet.ac.bd Greenhouse

CSI Ingenieros Our Firm August 2017 1 Our Firm: CSI Ingenieros CSI Ingenieros is Uruguays

PrOpCom Making Nigerian Agricultural Markets Work for the Poor Monograph Series Vol. 43 Consumer

ON ENERGY AUDIT AND EFFICIENCY IN POWER SECTOR MD. GIASH UDDIN MUGAL SENIOR ASSISTANT

Food Service Advisory Committee October 3, 2014 Open Forum Introductions Roles &