Architectural Specialization for Inter-Iteration Loop Dependence - PowerPoint PPT Presentation

Architectural Specialization for Inter-Iteration Loop Dependence Patterns Shreesha Srinath, Berkin Ilbeyi, Mingxing Tan, Gai Liu Zhiru Zhang, Christopher Batten Computer Systems Laboratory School of Electrical and Computer Engineering Cornell University 47th Int’l Symp. on Microarchitecture, Dec 2014

• Motivation • XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture Evaluation Energy Efficiency (Tasks per Joule) General Purpose Processor Performance (Tasks per Second) Cornell University Shreesha Srinath 2 / 31

• Motivation • XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture Evaluation Energy Efficiency (Tasks per Joule) Golden Triangle General Purpose Processor Performance (Tasks per Second) Cornell University Shreesha Srinath 2 / 31

• Motivation • XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture Evaluation Custom Custom Energy Efficiency (Tasks per Joule) ASIC ASIC n o i t a Less Flexible z i l a Accelerator i c e p S More Flexible . s v Accelerator y t i l i b i x e l F General Purpose Processor Performance (Tasks per Second) Cornell University Shreesha Srinath 2 / 31

• Motivation • XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture Evaluation Loop Dependence Pattern Specialization Iteration Iteration Iteration Iteration Iteration 0 1 2 3 n-1 inst0 inst0 inst0 inst0 inst0 inst1 inst1 inst1 inst1 inst1 inst2 inst2 inst2 inst2 inst2 inst3 inst3 inst3 inst3 inst3 ... ... ... ... ... branch branch branch branch branch Intra-Iteration Micro-op Fusion, ASIPs, CCA Cornell University Shreesha Srinath 3 / 31

• Motivation • XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture Evaluation Loop Dependence Pattern Specialization Iteration Iteration Iteration Iteration Iteration 0 1 2 3 n-1 inst0 inst0 inst0 inst0 inst0 inst1 inst1 inst1 inst1 inst1 inst2 inst2 inst2 inst2 inst2 inst3 inst3 inst3 inst3 inst3 ... ... ... ... ... branch branch branch branch branch Intra-Iteration Inter-Iteration Micro-op Fusion, Vector, GPU, ASIPs, CCA HELIX-RC Cornell University Shreesha Srinath 3 / 31

• Motivation • XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture Evaluation Loop Dependence Pattern Specialization Iteration Iteration Iteration Iteration Iteration 0 1 2 3 n-1 inst0 inst0 inst0 inst0 inst0 inst1 inst1 inst1 inst1 inst1 inst2 inst2 inst2 inst2 inst2 inst3 inst3 inst3 inst3 inst3 ... ... ... ... ... branch branch branch branch branch Intra-Iteration Inter-Iteration Both Micro-op Fusion, Vector, GPU, DySER, Qs-Cores, ASIPs, CCA HELIX-RC BERET Cornell University Shreesha Srinath 3 / 31

• Motivation • XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture Evaluation Loop Dependence Pattern Specialization Iteration Iteration Iteration Iteration Iteration 0 1 2 3 n-1 inst0 inst0 inst0 inst0 inst0 inst1 inst1 inst1 inst1 inst1 inst2 inst2 inst2 inst2 inst2 inst3 inst3 inst3 inst3 inst3 ... ... ... ... ... branch branch branch branch branch Intra-Iteration Inter-Iteration Both Micro-op Fusion, Vector, GPU, DySER, Qs-Cores, ASIPs, CCA HELIX-RC BERET Key Challenge: Creating HW/SW abstractions that are flexible and enable performance-portable execution Cornell University Shreesha Srinath 3 / 31

• Motivation • XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture Evaluation Explicit Loop Specialization (XLOOPS) Key Idea 1: Expose fine-grained parallelism by elegantly encoding inter-iteration loop dependence patterns in the ISA Cornell University Shreesha Srinath 4 / 31

• Motivation • XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture Evaluation Explicit Loop Specialization (XLOOPS) Key Idea 1: Expose fine-grained parallelism by elegantly encoding inter-iteration loop dependence patterns in the ISA Key Idea 2: Single-ISA hetereogenous architecture with a new execution paradigm supporting traditional, specialized, and adaptive execution Cornell University Shreesha Srinath 4 / 31

• Motivation • XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture Evaluation Explicit Loop Specialization (XLOOPS) Key Idea 1: Expose fine-grained parallelism by elegantly encoding inter-iteration loop dependence patterns in the ISA Key Idea 2: Single-ISA hetereogenous architecture with a new execution paradigm supporting traditional, specialized, and adaptive execution I Traditional GPP Execution L1 Data Cache Cornell University Shreesha Srinath 4 / 31

• Motivation • XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture Evaluation Explicit Loop Specialization (XLOOPS) Key Idea 1: Expose fine-grained parallelism by elegantly encoding inter-iteration loop dependence patterns in the ISA Key Idea 2: Single-ISA hetereogenous architecture with a new execution paradigm supporting traditional, specialized, and adaptive execution I Traditional GPP Lane Manager Execution Lanes I Specialized Execution Mem XBar L1 Data Cache Cornell University Shreesha Srinath 4 / 31

• Motivation • XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture Evaluation Explicit Loop Specialization (XLOOPS) Key Idea 1: Expose fine-grained parallelism by elegantly encoding inter-iteration loop dependence patterns in the ISA Key Idea 2: Single-ISA hetereogenous architecture with a new execution paradigm supporting traditional, specialized, and adaptive execution I Traditional GPP Lane Manager Execution Lanes I Specialized Execution I Adaptive Mem XBar Execution L1 Data Cache Cornell University Shreesha Srinath 4 / 31

• Motivation • XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture Evaluation 1. XLOOPS ISA 2. XLOOPS Compiler loop: #pragma xloops ordered lw r2, 0(rA) for(i = 0; i < N i++) lw r3, 0(rB) A[i] = A[i] * A[i-K]; ... ... #pragma xloops atomic addiu.xi rA, 4 for(i = 0; i < N; i++) addiu.xi rB, 4 B[ A[i] ]++; addiu r1, r1, 1 D[ C[i] ]++; xloop.uc r1, rN, loop 3. XLOOPS Microarchitecture 4. Evaluation OoO GPP Lane Manager Lanes Mem XBar 0 0.5 1.0 1.5 2.0 2.5 L1 Data Cache Cornell University Shreesha Srinath 5 / 31

Motivation • XLOOPS ISA • XLOOPS Compiler XLOOPS Microarchitecture Evaluation 1. XLOOPS ISA 2. XLOOPS Compiler loop: #pragma xloops ordered lw r2, 0(rA) for(i = 0; i < N i++) lw r3, 0(rB) A[i] = A[i] * A[i-K]; ... ... #pragma xloops atomic addiu.xi rA, 4 for(i = 0; i < N; i++) addiu.xi rB, 4 B[ A[i] ]++; addiu r1, r1, 1 D[ C[i] ]++; xloop.uc r1, rN, loop 3. XLOOPS Microarchitecture 4. Evaluation OoO GPP Lane Manager Lanes Mem XBar 0 0.5 1.0 1.5 2.0 2.5 L1 Data Cache Cornell University Shreesha Srinath 6 / 31

Motivation • XLOOPS ISA • XLOOPS Compiler XLOOPS Microarchitecture Evaluation XLOOPS Instruction Set Extensions XLOOP Instruction xloop.{d}.{c} rI, rN, L Data Control Induction Loop Loop Dependence Dependence Variable Bound Label Cornell University Shreesha Srinath 7 / 31

Motivation • XLOOPS ISA • XLOOPS Compiler XLOOPS Microarchitecture Evaluation XLOOPS Instruction Set Extensions XLOOP Instruction xloop.{d}.{c} rI, rN, L Data Control Induction Loop Loop Dependence Dependence Variable Bound Label xloop.uc.fb r2, r3, 0x8000 Unordered Concurrent Fixed Bound Cornell University Shreesha Srinath 7 / 31

Motivation • XLOOPS ISA • XLOOPS Compiler XLOOPS Microarchitecture Evaluation XLOOPS Instruction Set Extensions XLOOP Instruction xloop.{d}.{c} rI, rN, L Data Control Induction Loop Loop Dependence Dependence Variable Bound Label xloop.uc.fb r2, r3, 0x8000 Unordered Concurrent Fixed Bound Cross-Iteration Instructions addiu.xi rX, imm addu.xi rX, rT Variables that can be computed as linear functions of the induction variable Cornell University Shreesha Srinath 7 / 31

Motivation • XLOOPS ISA • XLOOPS Compiler XLOOPS Microarchitecture Evaluation XLOOPS ISA: Unordered Concurrent Element-wise Vector Multiplication for ( i=0; i<N; i++ ) C[i] = A[i] * B[i] RISC ISA loop: lw r2, 0(rA) lw r3, 0(rB) mul r4, r2, r3 sw r4, 0(rC) addiu rA, rA, 4 addiu rB, rB, 4 addiu rC, rC, 4 addiu r1, r1, 1 bne r1, rN, loop Cornell University Shreesha Srinath 8 / 31

Motivation • XLOOPS ISA • XLOOPS Compiler XLOOPS Microarchitecture Evaluation XLOOPS ISA: Unordered Concurrent Element-wise Vector Iteration 0 Iteration 1 Iteration 2 Iteration 3 Multiplication inst0 inst0 inst0 inst0 inst1 inst1 inst1 inst1 inst2 inst2 inst2 inst2 inst3 inst3 inst3 inst3 for ( i=0; i<N; i++ ) ... ... ... ... xloop.uc xloop.uc xloop.uc xloop.uc C[i] = A[i] * B[i] RISC ISA loop: lw r2, 0(rA) lw r3, 0(rB) mul r4, r2, r3 sw r4, 0(rC) addiu rA, rA, 4 addiu rB, rB, 4 addiu rC, rC, 4 addiu r1, r1, 1 bne r1, rN, loop Cornell University Shreesha Srinath 8 / 31

Architectural Specialization for Inter-Iteration Loop Dependence - PowerPoint PPT Presentation

Architectural Specialization for Inter-Iteration Loop Dependence Patterns Shreesha Srinath, Berkin Ilbeyi, Mingxing Tan, Gai Liu Zhiru Zhang, Christopher Batten Computer Systems Laboratory School of Electrical and Computer Engineering Cornell

Closing the Loop Closing the Loop Closing the Loop Closing the Loop Closing the Loop Closing

Explicit Loop Specialization & Polymorphic Hardware Specialization Christopher Batten and

OBAMA PRESIDENTIAL CENTER INTRODUCTION 2 INTRODUCTION 3 ARCHITECTURAL DESIGN 4 ARCHITECTURAL

Matrix Iteration Higher Modes Inverse Iteration Matrix Iteration Giacomo Boffi with Shifts

Religious Architectural Religious Architectural Religious Architectural Religious Architectural

Repetition Types of Loops Counting loop Know how many times to loop

Iteration and Debugging Check out Iteration from SVN Loop review Debugging Java programs

Iteration and Debugging Check out Iteration from SVN Loop review Debugging Java programs

Trading Strategies Introduction Trading Loop Trading Loop Trading Loop Trading Loop Three

Coarse-Grained Parallelism Variable Privatization, Loop Alignment, Loop Fusion, Loop

Architectural Resources Cambridge Architectural Resources Cambridge Architectural Resources

Iteration/loops variety of iteration constructs provided with varying degrees of complexity,

Loop Invariants: Part 2 7 January 2019 OSU CSE 1 Maintaining the Loop Invariant A claimed

Loop Optimizations Important because lots of execution Loop Optimizations Loop Optimizations

Upper and Lower Loop Bound Estimation by Symbolic Execution and Loop Acceleration Pavel Cadek

Enhancing Fine- Grained Parallelism Loop vectorization, Loop distribution, Scalar expansion

TSBK01 J RGEN A HLBERG - History - How many samples/pixels/bits? I MAGE CODING AND DATA 3. A

In the name of Allah the compassionate, the merciful Digital Image Processing S. Kasaei Kasaei

Image and Video Coding: Exam Preparation bitstream encoder decoder What Type of Exam? - When

COMP 3403 Algorithm Analysis Part 5 Chapter 9 Jim Diamond CAR 409 Jodrey School of

Robust Header Compression (ROHC) 53rd IETF Minneapolis, March 2002 Chairs: Carsten Bormann

Character Recognition Reporter: Zecheng Xie South China University of Technology August 5 th ,

Online and Approximation Algorithms http://www14.in.tum.de/lehre/2014SS/oa/index.html.en Susanne

Lecture 1: Asymptotics, Recurrences, Elementary Sorting Instructor: Saravanan Thirumuruganathan

Architectural Specialization for Inter-Iteration Loop Dependence - PowerPoint PPT Presentation

Architectural Specialization for Inter-Iteration Loop Dependence Patterns Shreesha Srinath, Berkin Ilbeyi, Mingxing Tan, Gai Liu Zhiru Zhang, Christopher Batten Computer Systems Laboratory School of Electrical and Computer Engineering Cornell

Closing the Loop Closing the Loop Closing the Loop Closing the Loop Closing the Loop Closing

Explicit Loop Specialization &amp; Polymorphic Hardware Specialization Christopher Batten and

OBAMA PRESIDENTIAL CENTER INTRODUCTION 2 INTRODUCTION 3 ARCHITECTURAL DESIGN 4 ARCHITECTURAL

Matrix Iteration Higher Modes Inverse Iteration Matrix Iteration Giacomo Boffi with Shifts

Religious Architectural Religious Architectural Religious Architectural Religious Architectural

Repetition Types of Loops Counting loop Know how many times to loop

Iteration and Debugging Check out Iteration from SVN Loop review Debugging Java programs

Iteration and Debugging Check out Iteration from SVN Loop review Debugging Java programs

Trading Strategies Introduction Trading Loop Trading Loop Trading Loop Trading Loop Three

Coarse-Grained Parallelism Variable Privatization, Loop Alignment, Loop Fusion, Loop

Architectural Resources Cambridge Architectural Resources Cambridge Architectural Resources

Iteration/loops variety of iteration constructs provided with varying degrees of complexity,

Loop Invariants: Part 2 7 January 2019 OSU CSE 1 Maintaining the Loop Invariant A claimed

Loop Optimizations Important because lots of execution Loop Optimizations Loop Optimizations

Upper and Lower Loop Bound Estimation by Symbolic Execution and Loop Acceleration Pavel Cadek

Enhancing Fine- Grained Parallelism Loop vectorization, Loop distribution, Scalar expansion

TSBK01 J RGEN A HLBERG - History - How many samples/pixels/bits? I MAGE CODING AND DATA 3. A

In the name of Allah the compassionate, the merciful Digital Image Processing S. Kasaei Kasaei

Image and Video Coding: Exam Preparation bitstream encoder decoder What Type of Exam? - When

COMP 3403 Algorithm Analysis Part 5 Chapter 9 Jim Diamond CAR 409 Jodrey School of

Robust Header Compression (ROHC) 53rd IETF Minneapolis, March 2002 Chairs: Carsten Bormann

Character Recognition Reporter: Zecheng Xie South China University of Technology August 5 th ,

Online and Approximation Algorithms http://www14.in.tum.de/lehre/2014SS/oa/index.html.en Susanne

Lecture 1: Asymptotics, Recurrences, Elementary Sorting Instructor: Saravanan Thirumuruganathan

Explicit Loop Specialization & Polymorphic Hardware Specialization Christopher Batten and