Architectural Specialization for Inter-Iteration Loop Dependence - - PowerPoint PPT Presentation

architectural specialization for inter iteration loop
SMART_READER_LITE
LIVE PREVIEW

Architectural Specialization for Inter-Iteration Loop Dependence - - PowerPoint PPT Presentation

Architectural Specialization for Inter-Iteration Loop Dependence Patterns Shreesha Srinath, Berkin Ilbeyi, Mingxing Tan, Gai Liu Zhiru Zhang, Christopher Batten Computer Systems Laboratory School of Electrical and Computer Engineering Cornell


slide-1
SLIDE 1

Architectural Specialization for Inter-Iteration Loop Dependence Patterns

Shreesha Srinath, Berkin Ilbeyi, Mingxing Tan, Gai Liu Zhiru Zhang, Christopher Batten

Computer Systems Laboratory School of Electrical and Computer Engineering Cornell University 47th Int’l Symp. on Microarchitecture, Dec 2014

slide-2
SLIDE 2
  • Motivation •

XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture Evaluation

Performance (Tasks per Second) Energy Efficiency (Tasks per Joule) General Purpose Processor

Cornell University Shreesha Srinath 2 / 31

slide-3
SLIDE 3
  • Motivation •

XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture Evaluation

Performance (Tasks per Second) Energy Efficiency (Tasks per Joule) General Purpose Processor

Golden Triangle

Cornell University Shreesha Srinath 2 / 31

slide-4
SLIDE 4
  • Motivation •

XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture Evaluation

F l e x i b i l i t y v s . S p e c i a l i z a t i

  • n

Custom ASIC Less Flexible Accelerator More Flexible Accelerator

Performance (Tasks per Second) Energy Efficiency (Tasks per Joule) General Purpose Processor

Custom ASIC

Cornell University Shreesha Srinath 2 / 31

slide-5
SLIDE 5
  • Motivation •

XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture Evaluation

Loop Dependence Pattern Specialization

Iteration

inst0 inst1 inst2 inst3 ... branch

Iteration 1

inst0 inst1 inst2 inst3 ... branch inst0 inst1 inst2 inst3 ... branch

Iteration 2

inst0 inst1 inst2 inst3 ... branch

Iteration 3

inst0 inst1 inst2 inst3 ... branch

Iteration n-1 Intra-Iteration Micro-op Fusion, ASIPs, CCA

Cornell University Shreesha Srinath 3 / 31

slide-6
SLIDE 6
  • Motivation •

XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture Evaluation

Loop Dependence Pattern Specialization

Iteration

inst0 inst1 inst2 inst3 ... branch

Iteration 1

inst0 inst1 inst2 inst3 ... branch inst0 inst1 inst2 inst3 ... branch

Iteration 2

inst0 inst1 inst2 inst3 ... branch

Iteration 3

inst0 inst1 inst2 inst3 ... branch

Iteration n-1 Intra-Iteration Micro-op Fusion, ASIPs, CCA Inter-Iteration Vector, GPU, HELIX-RC

Cornell University Shreesha Srinath 3 / 31

slide-7
SLIDE 7
  • Motivation •

XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture Evaluation

Loop Dependence Pattern Specialization

Iteration

inst0 inst1 inst2 inst3 ... branch

Iteration 1

inst0 inst1 inst2 inst3 ... branch inst0 inst1 inst2 inst3 ... branch

Iteration 2

inst0 inst1 inst2 inst3 ... branch

Iteration 3

inst0 inst1 inst2 inst3 ... branch

Iteration n-1 Intra-Iteration Micro-op Fusion, ASIPs, CCA Inter-Iteration Vector, GPU, HELIX-RC Both DySER, Qs-Cores, BERET

Cornell University Shreesha Srinath 3 / 31

slide-8
SLIDE 8
  • Motivation •

XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture Evaluation

Loop Dependence Pattern Specialization

Iteration

inst0 inst1 inst2 inst3 ... branch

Iteration 1

inst0 inst1 inst2 inst3 ... branch inst0 inst1 inst2 inst3 ... branch

Iteration 2

inst0 inst1 inst2 inst3 ... branch

Iteration 3

inst0 inst1 inst2 inst3 ... branch

Iteration n-1 Intra-Iteration Micro-op Fusion, ASIPs, CCA Inter-Iteration Vector, GPU, HELIX-RC Both DySER, Qs-Cores, BERET Key Challenge: Creating HW/SW abstractions that are flexible and enable performance-portable execution

Cornell University Shreesha Srinath 3 / 31

slide-9
SLIDE 9
  • Motivation •

XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture Evaluation

Explicit Loop Specialization (XLOOPS)

Key Idea 1: Expose fine-grained parallelism by elegantly encoding inter-iteration loop dependence patterns in the ISA

Cornell University Shreesha Srinath 4 / 31

slide-10
SLIDE 10
  • Motivation •

XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture Evaluation

Explicit Loop Specialization (XLOOPS)

Key Idea 1: Expose fine-grained parallelism by elegantly encoding inter-iteration loop dependence patterns in the ISA Key Idea 2: Single-ISA hetereogenous architecture with a new execution paradigm supporting traditional, specialized, and adaptive execution

Cornell University Shreesha Srinath 4 / 31

slide-11
SLIDE 11
  • Motivation •

XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture Evaluation

Explicit Loop Specialization (XLOOPS)

Key Idea 1: Expose fine-grained parallelism by elegantly encoding inter-iteration loop dependence patterns in the ISA Key Idea 2: Single-ISA hetereogenous architecture with a new execution paradigm supporting traditional, specialized, and adaptive execution

GPP L1 Data Cache

I Traditional

Execution

Cornell University Shreesha Srinath 4 / 31

slide-12
SLIDE 12
  • Motivation •

XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture Evaluation

Explicit Loop Specialization (XLOOPS)

Key Idea 1: Expose fine-grained parallelism by elegantly encoding inter-iteration loop dependence patterns in the ISA Key Idea 2: Single-ISA hetereogenous architecture with a new execution paradigm supporting traditional, specialized, and adaptive execution

GPP L1 Data Cache Lanes Lane Manager Mem XBar

I Traditional

Execution

I Specialized

Execution

Cornell University Shreesha Srinath 4 / 31

slide-13
SLIDE 13
  • Motivation •

XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture Evaluation

Explicit Loop Specialization (XLOOPS)

Key Idea 1: Expose fine-grained parallelism by elegantly encoding inter-iteration loop dependence patterns in the ISA Key Idea 2: Single-ISA hetereogenous architecture with a new execution paradigm supporting traditional, specialized, and adaptive execution

GPP L1 Data Cache Lanes Lane Manager Mem XBar

I Traditional

Execution

I Specialized

Execution

I Adaptive

Execution

Cornell University Shreesha Srinath 4 / 31

slide-14
SLIDE 14
  • Motivation •

XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture Evaluation

  • 3. XLOOPS Microarchitecture

0.5 1.0 1.5 2.0 2.5

  • 4. Evaluation
  • 1. XLOOPS ISA

loop: lw r2, 0(rA) lw r3, 0(rB) ... ... addiu.xi rA, 4 addiu.xi rB, 4 addiu r1, r1, 1 xloop.uc r1, rN, loop

  • 2. XLOOPS Compiler

#pragma xloops ordered for(i = 0; i < N i++) A[i] = A[i] * A[i-K]; #pragma xloops atomic for(i = 0; i < N; i++) B[ A[i] ]++; D[ C[i] ]++; OoO GPP L1 Data Cache Lanes Lane Manager Mem XBar

Cornell University Shreesha Srinath 5 / 31

slide-15
SLIDE 15

Motivation

  • XLOOPS ISA •

XLOOPS Compiler XLOOPS Microarchitecture Evaluation

  • 3. XLOOPS Microarchitecture

0.5 1.0 1.5 2.0 2.5

  • 4. Evaluation
  • 1. XLOOPS ISA

loop: lw r2, 0(rA) lw r3, 0(rB) ... ... addiu.xi rA, 4 addiu.xi rB, 4 addiu r1, r1, 1 xloop.uc r1, rN, loop

  • 2. XLOOPS Compiler

#pragma xloops ordered for(i = 0; i < N i++) A[i] = A[i] * A[i-K]; #pragma xloops atomic for(i = 0; i < N; i++) B[ A[i] ]++; D[ C[i] ]++; OoO GPP L1 Data Cache Lanes Lane Manager Mem XBar

Cornell University Shreesha Srinath 6 / 31

slide-16
SLIDE 16

Motivation

  • XLOOPS ISA •

XLOOPS Compiler XLOOPS Microarchitecture Evaluation

XLOOPS Instruction Set Extensions xloop.{d}.{c} rI, rN, L

Data Dependence Control Dependence Induction Variable Loop Bound Loop Label

XLOOP Instruction

Cornell University Shreesha Srinath 7 / 31

slide-17
SLIDE 17

Motivation

  • XLOOPS ISA •

XLOOPS Compiler XLOOPS Microarchitecture Evaluation

XLOOPS Instruction Set Extensions xloop.{d}.{c} rI, rN, L

Data Dependence Control Dependence Induction Variable Loop Bound Loop Label

XLOOP Instruction

Unordered Concurrent Fixed Bound

xloop.uc.fb r2, r3, 0x8000

Cornell University Shreesha Srinath 7 / 31

slide-18
SLIDE 18

Motivation

  • XLOOPS ISA •

XLOOPS Compiler XLOOPS Microarchitecture Evaluation

XLOOPS Instruction Set Extensions xloop.{d}.{c} rI, rN, L

Data Dependence Control Dependence Induction Variable Loop Bound Loop Label

XLOOP Instruction

Unordered Concurrent Fixed Bound

xloop.uc.fb r2, r3, 0x8000

Cross-Iteration Instructions

addiu.xi rX, imm addu.xi rX, rT

Variables that can be computed as linear functions of the induction variable

Cornell University Shreesha Srinath 7 / 31

slide-19
SLIDE 19

Motivation

  • XLOOPS ISA •

XLOOPS Compiler XLOOPS Microarchitecture Evaluation

XLOOPS ISA: Unordered Concurrent

for ( i=0; i<N; i++ ) C[i] = A[i] * B[i] Element-wise Vector Multiplication loop: lw r2, 0(rA) lw r3, 0(rB) mul r4, r2, r3 sw r4, 0(rC) addiu rA, rA, 4 addiu rB, rB, 4 addiu rC, rC, 4 addiu r1, r1, 1 bne r1, rN, loop RISC ISA

Cornell University Shreesha Srinath 8 / 31

slide-20
SLIDE 20

Motivation

  • XLOOPS ISA •

XLOOPS Compiler XLOOPS Microarchitecture Evaluation

XLOOPS ISA: Unordered Concurrent

Iteration 0 inst0 inst1 inst2 inst3 ... xloop.uc Iteration 1 inst0 inst1 inst2 inst3 ... xloop.uc Iteration 2 inst0 inst1 inst2 inst3 ... xloop.uc Iteration 3 inst0 inst1 inst2 inst3 ... xloop.uc

for ( i=0; i<N; i++ ) C[i] = A[i] * B[i] Element-wise Vector Multiplication loop: lw r2, 0(rA) lw r3, 0(rB) mul r4, r2, r3 sw r4, 0(rC) addiu rA, rA, 4 addiu rB, rB, 4 addiu rC, rC, 4 addiu r1, r1, 1 bne r1, rN, loop RISC ISA

Cornell University Shreesha Srinath 8 / 31

slide-21
SLIDE 21

Motivation

  • XLOOPS ISA •

XLOOPS Compiler XLOOPS Microarchitecture Evaluation

XLOOPS ISA: Unordered Concurrent

Iteration 0 inst0 inst1 inst2 inst3 ... xloop.uc Iteration 1 inst0 inst1 inst2 inst3 ... xloop.uc Iteration 2 inst0 inst1 inst2 inst3 ... xloop.uc Iteration 3 inst0 inst1 inst2 inst3 ... xloop.uc

loop: lw r2, 0(rA) lw r3, 0(rB) mul r4, r2, r3 sw r4, 0(rC) addiu rA, rA, 4 addiu rB, rB, 4 addiu rC, rC, 4 addiu r1, r1, 1 bne r1, rN, loop RISC ISA for ( i=0; i<N; i++ ) C[i] = A[i] * B[i] Element-wise Vector Multiplication XLOOPS ISA loop: lw r2, 0(rA) lw r3, 0(rB) mul r4, r2, r3 sw r4, 0(rC) addiu rA, rA, 4 addiu rB, rB, 4 addiu rC, rC, 4 addiu r1, r1, 1 xloop.uc r1, rN, loop

Cornell University Shreesha Srinath 8 / 31

slide-22
SLIDE 22

Motivation

  • XLOOPS ISA •

XLOOPS Compiler XLOOPS Microarchitecture Evaluation

XLOOPS ISA: Unordered Concurrent

Iteration 0 inst0 inst1 inst2 inst3 ... xloop.uc Iteration 1 inst0 inst1 inst2 inst3 ... xloop.uc Iteration 2 inst0 inst1 inst2 inst3 ... xloop.uc Iteration 3 inst0 inst1 inst2 inst3 ... xloop.uc

for ( i=0; i<N; i++ ) C[i] = A[i] * B[i] Element-wise Vector Multiplication XLOOPS ISA loop: lw r2, 0(rA) lw r3, 0(rB) mul r4, r2, r3 sw r4, 0(rC) addiu.xi rA, 4 addiu.xi rB, 4 addiu.xi rC, 4 addiu r1, r1, 1 xloop.uc r1, rN, loop loop: lw r2, 0(rA) lw r3, 0(rB) mul r4, r2, r3 sw r4, 0(rC) addiu rA, rA, 4 addiu rB, rB, 4 addiu rC, rC, 4 addiu r1, r1, 1 bne r1, rN, loop RISC ISA

Cornell University Shreesha Srinath 8 / 31

slide-23
SLIDE 23

Motivation

  • XLOOPS ISA •

XLOOPS Compiler XLOOPS Microarchitecture Evaluation

XLOOPS ISA: Unordered Concurrent

Iteration 0 inst0 inst1 inst2 inst3 ... xloop.uc Iteration 1 inst0 inst1 inst2 inst3 ... xloop.uc Iteration 2 inst0 inst1 inst2 inst3 ... xloop.uc Iteration 3 inst0 inst1 inst2 inst3 ... xloop.uc

loop: lw r2, 0(rA) lw r3, 0(rB) mul r4, r2, r3 sw r4, 0(rC) addiu.xi rA, 4 addiu.xi rB, 4 addiu.xi rC, 4 addiu r1, r1, 1 xloop.uc r1, rN, loop for ( i=0; i<N; i++ ) C[i] = A[i] * B[i] Element-wise Vector Multiplication Instructions in loop cannot write live-in registers Live-out values are stored in memory Data-races are possible

Cornell University Shreesha Srinath 8 / 31

slide-24
SLIDE 24

Motivation

  • XLOOPS ISA •

XLOOPS Compiler XLOOPS Microarchitecture Evaluation

XLOOPS ISA: Unordered Atomic

loop: lw r6, 0(rA) lw r7, 0(rB) addiu r7, r7, 1 sw r7, 0(r6) addiu.xi rA, 4 ... addiu r1, r1, 1 xloop.ua r1, rN, loop for ( i=0; i<N; i++ ) B[A[i]]++; D[C[i]]++; Histogram Updates Iterations execute atomically No race conditions

Iteration 0 inst0 inst1 inst2 inst3 ... xloop.ua Iteration 1 inst0 inst1 inst2 inst3 ... xloop.ua Iteration 2 inst0 inst1 inst2 inst3 ... xloop.ua Iteration 3 inst0 inst1 inst2 inst3 ... xloop.ua

Results can be non-deterministic Inspired by Transactional Memory

Cornell University Shreesha Srinath 9 / 31

slide-25
SLIDE 25

Motivation

  • XLOOPS ISA •

XLOOPS Compiler XLOOPS Microarchitecture Evaluation

XLOOPS ISA: Ordered-Through-Registers

loop: lw r2, 0(rA) addu rX, r2, rX sw rX, 0(rB) addiu.xi rA, 4 addiu.xi rB, 4 addiu r1, r1, 1 xloop.or r1, rN, loop for ( i=0; i<N; i++ ) X += A[i]; B[i] = X Parallel-Prefix Summation rX - Cross Iteration Register CIRs are guranteed to have the same value as a serial execution Inspired by Multiscalar

Iteration 0 inst0 inst1 inst2 inst3 ... xloop.or Iteration 1 inst0 inst1 inst2 inst3 ... xloop.or Iteration 2 inst0 inst1 inst2 inst3 ... xloop.or Iteration 3 inst0 inst1 inst2 inst3 ... xloop.or Cornell University Shreesha Srinath 10 / 31

slide-26
SLIDE 26

Motivation

  • XLOOPS ISA •

XLOOPS Compiler XLOOPS Microarchitecture Evaluation

XLOOPS ISA: Ordered-Through-Memory

# r1 = rK # r3 = rA + 4*rK loop: lw r4, 0(r3) lw r5, 0(rA) mul r6, r4, r5 sw r6, 0(r3) addiu.xi r3, 4 addiu.xi rA, 4 addiu r1, r1, 1 xloop.om r1, rN, loop for ( i=0; i<N; i++ ) A[i] = A[i] * A[i-k]; Updates to memory defined by serial iteration order No race conditions

Iteration 0 inst0 inst1 inst2 inst3 ... xloop.om Iteration 1 inst0 inst1 inst2 inst3 ... xloop.om Iteration 2 inst0 inst1 inst2 inst3 ... xloop.om Iteration 3 inst0 inst1 inst2 inst3 ... xloop.om

Inspired by Multiscalar, TLS

Cornell University Shreesha Srinath 11 / 31

slide-27
SLIDE 27

Motivation

  • XLOOPS ISA •

XLOOPS Compiler XLOOPS Microarchitecture Evaluation

XLOOPS ISA: Dynamic Bound

1 2 3 4 5 6 7

Recursive traversal

Cornell University Shreesha Srinath 12 / 31

slide-28
SLIDE 28

Motivation

  • XLOOPS ISA •

XLOOPS Compiler XLOOPS Microarchitecture Evaluation

XLOOPS ISA: Dynamic Bound

1 2 3 4 5 6 7

Parallelize across frontier using xloop.uc Recursive traversal

Cornell University Shreesha Srinath 12 / 31

slide-29
SLIDE 29

Motivation

  • XLOOPS ISA •

XLOOPS Compiler XLOOPS Microarchitecture Evaluation

XLOOPS ISA: Dynamic Bound

Iteration 0 inst0 inst1 inst2 inst3 ... xloop.uc.db Iteration 6 Iteration 7 Iteration 1 inst0 inst1 inst2 inst3 ... xloop.uc.db Iteration 2 inst0 inst1 inst2 inst3 ... xloop.uc.db Iteration 3 inst0 inst1 inst2 inst3 ... xloop.uc.db Iteration 4 inst0 inst1 inst2 inst3 ... xloop.uc.db Iteration 5 inst0 inst1 inst2 inst3 ... xloop.uc.db

Parallelize using xloop.uc.db

1 2 3 4 5 6 7

for ( i=0; i<N; i++ ) ... if ( cond ) N++;

Cornell University Shreesha Srinath 12 / 31

slide-30
SLIDE 30

Motivation XLOOPS ISA

  • XLOOPS Compiler •

XLOOPS Microarchitecture Evaluation

  • 3. XLOOPS Microarchitecture

0.5 1.0 1.5 2.0 2.5

  • 4. Evaluation
  • 1. XLOOPS ISA

loop: lw r2, 0(rA) lw r3, 0(rB) ... ... addiu.xi rA, 4 addiu.xi rB, 4 addiu r1, r1, 1 xloop.uc r1, rN, loop

  • 2. XLOOPS Compiler

#pragma xloops ordered for(i = 0; i < N i++) A[i] = A[i] * A[i-K]; #pragma xloops atomic for(i = 0; i < N; i++) B[ A[i] ]++; D[ C[i] ]++; OoO GPP L1 Data Cache Lanes Lane Manager Mem XBar

Cornell University Shreesha Srinath 13 / 31

slide-31
SLIDE 31

Motivation XLOOPS ISA

  • XLOOPS Compiler •

XLOOPS Microarchitecture Evaluation

XLOOPS Compiler

Kernel implementing Floyd-Warshall shortest path algorithm for ( int k = 0; k < n; k++ ) #pragma xloops ordered for ( int i = 0; i < n; i++ ) #pragma xloops unordered for ( int j = 0; j < n; j++ ) path[i][j] = min( path[i][j], path[i][k] + path[k][j] );

Cornell University Shreesha Srinath 14 / 31

slide-32
SLIDE 32

Motivation XLOOPS ISA

  • XLOOPS Compiler •

XLOOPS Microarchitecture Evaluation C++ Mid-level

  • ptimization

passes Code Generation xloops binary Modified LSR pass XLOOPS control- dependence analysis pass XLOOPS data- dependence analysis pass Cornell University Shreesha Srinath 15 / 31

slide-33
SLIDE 33

Motivation XLOOPS ISA

  • XLOOPS Compiler •

XLOOPS Microarchitecture Evaluation C++ Mid-level

  • ptimization

passes Code Generation xloops binary Modified LSR pass XLOOPS control- dependence analysis pass XLOOPS data- dependence analysis pass

I Programmer annotations

. unordered: no data-dependences . ordered: preserve data-dependences . atomic: atomic memory updates

Cornell University Shreesha Srinath 15 / 31

slide-34
SLIDE 34

Motivation XLOOPS ISA

  • XLOOPS Compiler •

XLOOPS Microarchitecture Evaluation C++ Mid-level

  • ptimization

passes Code Generation xloops binary Modified LSR pass XLOOPS control- dependence analysis pass XLOOPS data- dependence analysis pass

I Programmer annotations

. unordered: no data-dependences . ordered: preserve data-dependences . atomic: atomic memory updates

I Loop strength reduction pass encodes MIVs as xi instructions

Cornell University Shreesha Srinath 15 / 31

slide-35
SLIDE 35

Motivation XLOOPS ISA

  • XLOOPS Compiler •

XLOOPS Microarchitecture Evaluation C++ Mid-level

  • ptimization

passes Code Generation xloops binary Modified LSR pass XLOOPS control- dependence analysis pass XLOOPS data- dependence analysis pass

I Programmer annotations

. unordered: no data-dependences . ordered: preserve data-dependences . atomic: atomic memory updates

I Loop strength reduction pass encodes MIVs as xi instructions I XLOOPS data-dependence analysis pass

. Register-dependence: analysing use-definition chains through PHI nodes . Memory-dependence: well known dependence analysis techniques

Cornell University Shreesha Srinath 15 / 31

slide-36
SLIDE 36

Motivation XLOOPS ISA

  • XLOOPS Compiler •

XLOOPS Microarchitecture Evaluation C++ Mid-level

  • ptimization

passes Code Generation xloops binary Modified LSR pass XLOOPS control- dependence analysis pass XLOOPS data- dependence analysis pass

I Programmer annotations

. unordered: no data-dependences . ordered: preserve data-dependences . atomic: atomic memory updates

I Loop strength reduction pass encodes MIVs as xi instructions I XLOOPS data-dependence analysis pass

. Register-dependence: analysing use-definition chains through PHI nodes . Memory-dependence: well known dependence analysis techniques

I Detect updates to the loop bound to encode

dynamic-bound control-dependence pattern

Cornell University Shreesha Srinath 15 / 31

slide-37
SLIDE 37

Motivation XLOOPS ISA XLOOPS Compiler

  • XLOOPS Microarchitecture •

Evaluation

  • 3. XLOOPS Microarchitecture

0.5 1.0 1.5 2.0 2.5

  • 4. Evaluation
  • 1. XLOOPS ISA

loop: lw r2, 0(rA) lw r3, 0(rB) ... ... addiu.xi rA, 4 addiu.xi rB, 4 addiu r1, r1, 1 xloop.uc r1, rN, loop

  • 2. XLOOPS Compiler

#pragma xloops ordered for(i = 0; i < N i++) A[i] = A[i] * A[i-K]; #pragma xloops atomic for(i = 0; i < N; i++) B[ A[i] ]++; D[ C[i] ]++; OoO GPP L1 Data Cache Lanes Lane Manager Mem XBar

Cornell University Shreesha Srinath 16 / 31

slide-38
SLIDE 38

Motivation XLOOPS ISA XLOOPS Compiler

  • XLOOPS Microarchitecture •

Evaluation

Traditional Execution

GPR RF 32 × 32b 2r2w

GPP LLFU D$ Request/Response Crossbar L1 I$ 16 KB L2 Request and Response Crossbars L1 D$ 16 KB SLFU

Minimal changes to a general-purpose processor (GPP)

I xloop → bne I addiu.xi → addiu I addu.xi → addu

Cornell University Shreesha Srinath 17 / 31

slide-39
SLIDE 39

Motivation XLOOPS ISA XLOOPS Compiler

  • XLOOPS Microarchitecture •

Evaluation

Traditional Execution

GPR RF 32 × 32b 2r2w

GPP LLFU D$ Request/Response Crossbar L1 I$ 16 KB L2 Request and Response Crossbars L1 D$ 16 KB SLFU

Minimal changes to a general-purpose processor (GPP)

I xloop → bne I addiu.xi → addiu I addu.xi → addu

Efficient traditional execution

I Enables gradual adoption I Enables adaptive execution to

migrate an xloop instruction

Cornell University Shreesha Srinath 17 / 31

slide-40
SLIDE 40

Motivation XLOOPS ISA XLOOPS Compiler

  • XLOOPS Microarchitecture •

Evaluation

Specialized Execution – xloop.uc

GPR RF 32 × 32b 2r2w

GPP LLFU D$ Request/Response Crossbar L1 I$ 16 KB L2 Request and Response Crossbars L1 D$ 16 KB SLFU Cornell University Shreesha Srinath 18 / 31

slide-41
SLIDE 41

Motivation XLOOPS ISA XLOOPS Compiler

  • XLOOPS Microarchitecture •

Evaluation

Specialized Execution – xloop.uc

GPR RF 32 × 32b 2r2w

GPP LLFU D$ Request/Response Crossbar L1 I$ 16 KB L2 Request and Response Crossbars L1 D$ 16 KB SLFU Lane 3 Lane 1

Lane RF 24 × 32b 2r2w Inst Buf 128× Lane RF 24 × 32b 2r2w Inst Buf 128× Lane RF 24 × 32b 2r2w Inst Buf 128×

Lane SLFU SLFU SLFU IDQ Lane Management Unit IDQ IDQ

Loop Pattern Specialization Unit

I Lane Management Unit (LMU) I Four decoupled in-order lanes I Lanes contain instruction buffers

and index queues

I Lanes and the GPP arbitrate for

data-memory port and long-latency functional unit

Cornell University Shreesha Srinath 18 / 31

slide-42
SLIDE 42

Motivation XLOOPS ISA XLOOPS Compiler

  • XLOOPS Microarchitecture •

Evaluation

Specialized Execution – xloop.uc

GPR RF 32 × 32b 2r2w

GPP LLFU D$ Request/Response Crossbar L1 I$ 16 KB L2 Request and Response Crossbars L1 D$ 16 KB SLFU Lane 3 Lane 1

Lane RF 24 × 32b 2r2w Inst Buf 128× Lane RF 24 × 32b 2r2w Inst Buf 128× Lane RF 24 × 32b 2r2w Inst Buf 128×

Lane SLFU SLFU SLFU IDQ Lane Management Unit IDQ IDQ

Loop Pattern Specialization Unit

I Lane Management Unit (LMU) I Four decoupled in-order lanes I Lanes contain instruction buffers

and index queues

I Lanes and the GPP arbitrate for

data-memory port and long-latency functional unit Specialized execution

I Scan phase

Cornell University Shreesha Srinath 18 / 31

slide-43
SLIDE 43

Motivation XLOOPS ISA XLOOPS Compiler

  • XLOOPS Microarchitecture •

Evaluation

Specialized Execution – xloop.uc

GPR RF 32 × 32b 2r2w

GPP LLFU D$ Request/Response Crossbar L1 I$ 16 KB L2 Request and Response Crossbars L1 D$ 16 KB SLFU Lane 3 Lane 1

Lane RF 24 × 32b 2r2w Inst Buf 128× Lane RF 24 × 32b 2r2w Inst Buf 128× Lane RF 24 × 32b 2r2w Inst Buf 128×

Lane SLFU SLFU SLFU IDQ Lane Management Unit IDQ IDQ

Loop Pattern Specialization Unit

I Lane Management Unit (LMU) I Four decoupled in-order lanes I Lanes contain instruction buffers

and index queues

I Lanes and the GPP arbitrate for

data-memory port and long-latency functional unit Specialized execution

I Scan phase

Cornell University Shreesha Srinath 18 / 31

slide-44
SLIDE 44

Motivation XLOOPS ISA XLOOPS Compiler

  • XLOOPS Microarchitecture •

Evaluation

Specialized Execution – xloop.uc

GPR RF 32 × 32b 2r2w

GPP LLFU D$ Request/Response Crossbar L1 I$ 16 KB L2 Request and Response Crossbars L1 D$ 16 KB SLFU Lane 3 Lane 1

Lane RF 24 × 32b 2r2w Inst Buf 128× Lane RF 24 × 32b 2r2w Inst Buf 128× Lane RF 24 × 32b 2r2w Inst Buf 128×

Lane SLFU SLFU SLFU IDQ Lane Management Unit IDQ IDQ

Loop Pattern Specialization Unit

I Lane Management Unit (LMU) I Four decoupled in-order lanes I Lanes contain instruction buffers

and index queues

I Lanes and the GPP arbitrate for

data-memory port and long-latency functional unit Specialized execution

I Scan phase I Specialized execution phase

Cornell University Shreesha Srinath 18 / 31

slide-45
SLIDE 45

Motivation XLOOPS ISA XLOOPS Compiler

  • XLOOPS Microarchitecture •

Evaluation

GPP LMU Lane0 Lane1 LLFU

loop: lw r2, 0(rA) lw r3, 0(rB) mul r4, r2, r3 sw r4, 0(rC) addiu.xi rA, 4 addiu.xi rB, 4 addiu.xi rC, 4 addiu r1, r1, 1 xloop.uc r1, rN, loop

  • p
  • p

Time

Cornell University Shreesha Srinath 19 / 31

slide-46
SLIDE 46

Motivation XLOOPS ISA XLOOPS Compiler

  • XLOOPS Microarchitecture •

Evaluation

GPP LMU Lane0 Lane1 LLFU

loop: lw r2, 0(rA) lw r3, 0(rB) mul r4, r2, r3 sw r4, 0(rC) addiu.xi rA, 4 addiu.xi rB, 4 addiu.xi rC, 4 addiu r1, r1, 1 xloop.uc r1, rN, loop

  • p
  • p

Time

xloop

Scan Phase

rename

  • p

lw lw mul sw addiu.xi addiu.xi

  • p

addiu.xi addiu xloop

  • p

rename rename rename rename rename rename rename rename write write write write write write write write write

  • p

write write write write write write write write write Cornell University Shreesha Srinath 19 / 31

slide-47
SLIDE 47

Motivation XLOOPS ISA XLOOPS Compiler

  • XLOOPS Microarchitecture •

Evaluation

GPP LMU Lane0 Lane1 LLFU

loop: lw r2, 0(rA) lw r3, 0(rB) mul r4, r2, r3 sw r4, 0(rC) addiu.xi rA, 4 addiu.xi rB, 4 addiu.xi rC, 4 addiu r1, r1, 1 xloop.uc r1, rN, loop

  • p
  • p

Time

xloop

Scan Phase

rename

  • p

lw lw mul sw addiu.xi addiu.xi

  • p

addiu.xi addiu xloop

  • p

rename rename rename rename rename rename rename rename write write write write write write write write write

  • p

write write write write write write write write write

Specialized Execution Phase

lw Iteration 0 dispatch Cornell University Shreesha Srinath 19 / 31

slide-48
SLIDE 48

Motivation XLOOPS ISA XLOOPS Compiler

  • XLOOPS Microarchitecture •

Evaluation

GPP LMU Lane0 Lane1 LLFU

loop: lw r2, 0(rA) lw r3, 0(rB) mul r4, r2, r3 sw r4, 0(rC) addiu.xi rA, 4 addiu.xi rB, 4 addiu.xi rC, 4 addiu r1, r1, 1 xloop.uc r1, rN, loop

  • p
  • p

Time

xloop

Scan Phase

rename

  • p

lw lw mul sw addiu.xi addiu.xi

  • p

addiu.xi addiu xloop

  • p

rename rename rename rename rename rename rename rename write write write write write write write write write

  • p

write write write write write write write write write

Specialized Execution Phase

lw Iteration 0 dispatch lw lw Iteration 1 dispatch mul

X

lw mul

X

Sharing LLFU

Cornell University Shreesha Srinath 19 / 31

slide-49
SLIDE 49

Motivation XLOOPS ISA XLOOPS Compiler

  • XLOOPS Microarchitecture •

Evaluation sw addiu.xi addiu.xi

  • p

addiu.xi addiu xloop sw addiu.xi addiu.xi

  • p

addiu.xi addiu xloop

GPP LMU Lane0 Lane1 LLFU

loop: lw r2, 0(rA) lw r3, 0(rB) mul r4, r2, r3 sw r4, 0(rC) addiu.xi rA, 4 addiu.xi rB, 4 addiu.xi rC, 4 addiu r1, r1, 1 xloop.uc r1, rN, loop

  • p
  • p

Time

xloop

Scan Phase

rename

  • p

lw lw mul sw addiu.xi addiu.xi

  • p

addiu.xi addiu xloop

  • p

rename rename rename rename rename rename rename rename write write write write write write write write write

  • p

write write write write write write write write write

Specialized Execution Phase

lw Iteration 0 dispatch lw lw Iteration 1 dispatch mul

X

lw mul

X

Specialized logic

Cornell University Shreesha Srinath 19 / 31

slide-50
SLIDE 50

Motivation XLOOPS ISA XLOOPS Compiler

  • XLOOPS Microarchitecture •

Evaluation lw Iteration 2 Iteration 3 lw dispatch dispatch sw addiu.xi addiu.xi

  • p

addiu.xi addiu xloop sw addiu.xi addiu.xi

  • p

addiu.xi addiu xloop

GPP LMU Lane0 Lane1 LLFU

loop: lw r2, 0(rA) lw r3, 0(rB) mul r4, r2, r3 sw r4, 0(rC) addiu.xi rA, 4 addiu.xi rB, 4 addiu.xi rC, 4 addiu r1, r1, 1 xloop.uc r1, rN, loop

  • p
  • p

Time

xloop

Scan Phase

rename

  • p

lw lw mul sw addiu.xi addiu.xi

  • p

addiu.xi addiu xloop

  • p

rename rename rename rename rename rename rename rename write write write write write write write write write

  • p

write write write write write write write write write

Specialized Execution Phase

lw Iteration 0 dispatch lw lw Iteration 1 dispatch mul

X

lw mul

X

Cornell University Shreesha Srinath 19 / 31

slide-51
SLIDE 51

Motivation XLOOPS ISA XLOOPS Compiler

  • XLOOPS Microarchitecture •

Evaluation

Specialized Execution – xloop.or

GPR RF 32 × 32b 2r2w

GPP LLFU D$ Request/Response Crossbar L1 I$ 16 KB L2 Request and Response Crossbars L1 D$ 16 KB SLFU Lane 3 Lane 1

Lane RF 24 × 32b 2r2w Inst Buf 128× Lane RF 24 × 32b 2r2w Inst Buf 128× Lane RF 24 × 32b 2r2w Inst Buf 128×

Lane SLFU SLFU SLFU IDQ Lane Management Unit IDQ IDQ Cornell University Shreesha Srinath 20 / 31

slide-52
SLIDE 52

Motivation XLOOPS ISA XLOOPS Compiler

  • XLOOPS Microarchitecture •

Evaluation

Specialized Execution – xloop.or

GPR RF 32 × 32b 2r2w

GPP LLFU D$ Request/Response Crossbar L1 I$ 16 KB L2 Request and Response Crossbars L1 D$ 16 KB SLFU Lane 3 Lane 1

Lane RF 24 × 32b 2r2w Inst Buf 128× Lane RF 24 × 32b 2r2w Inst Buf 128× Lane RF 24 × 32b 2r2w Inst Buf 128×

Lane SLFU SLFU SLFU IDQ Lane Management Unit IDQ IDQ CIB 8× CIB 8× CIB 8×

I Cross-iteration buffers (CIBs)

forward register-dependences

I More details in the paper!

Cornell University Shreesha Srinath 20 / 31

slide-53
SLIDE 53

Motivation XLOOPS ISA XLOOPS Compiler

  • XLOOPS Microarchitecture •

Evaluation

Specialized Execution – xloop.om

GPR RF 32 × 32b 2r2w

GPP LLFU D$ Request/Response Crossbar L1 I$ 16 KB L2 Request and Response Crossbars L1 D$ 16 KB SLFU Lane 3 Lane 1

Lane RF 24 × 32b 2r2w Inst Buf 128× Lane RF 24 × 32b 2r2w Inst Buf 128× Lane RF 24 × 32b 2r2w Inst Buf 128×

Lane SLFU SLFU SLFU IDQ Lane Management Unit IDQ IDQ CIB 8× CIB 8× CIB 8× Cornell University Shreesha Srinath 21 / 31

slide-54
SLIDE 54

Motivation XLOOPS ISA XLOOPS Compiler

  • XLOOPS Microarchitecture •

Evaluation

Specialized Execution – xloop.om

GPR RF 32 × 32b 2r2w

GPP LLFU D$ Request/Response Crossbar L1 I$ 16 KB L2 Request and Response Crossbars L1 D$ 16 KB SLFU Lane 3 Lane 1

Lane RF 24 × 32b 2r2w Inst Buf 128× Lane RF 24 × 32b 2r2w Inst Buf 128× Lane RF 24 × 32b 2r2w Inst Buf 128×

Lane SLFU SLFU SLFU IDQ Lane Management Unit IDQ IDQ CIB 8× CIB 8× CIB 8× LSQ 16× LSQ 16× LSQ 16×

I LSQ to support hardware

memory disambiguation

I LMU control logic

. Track non-speculative vs. speculative lanes . Promote lanes to be non-speculative

I Lane control logic

. Handle structural hazards . Handle dependence violations

Cornell University Shreesha Srinath 21 / 31

slide-55
SLIDE 55

Motivation XLOOPS ISA XLOOPS Compiler

  • XLOOPS Microarchitecture •

Evaluation

GPP LMU Lane0 Lane1 LLFU

  • p
  • p

Time

xloop lw lw xloop sw

. . .

rename

. . .

write

. . .

rename rename write rename write write write write

. . .

write write write write

Scan Phase loop: lw r4, 0(r3) lw r5, 0(rA) ... ... sw r6, 0(r7) addiu r1, r1, 1 xloop.om r1, rN, loop

Cornell University Shreesha Srinath 22 / 31

slide-56
SLIDE 56

Motivation XLOOPS ISA XLOOPS Compiler

  • XLOOPS Microarchitecture •

Evaluation

GPP LMU Lane0 Lane1 LLFU

  • p
  • p

Time

xloop lw lw xloop sw

. . .

rename

. . .

write

. . .

rename rename write rename write write write write

. . .

write write write write

Scan Phase loop: lw r4, 0(r3) lw r5, 0(rA) ... ... sw r6, 0(r7) addiu r1, r1, 1 xloop.om r1, rN, loop

lw Iteration 0 dispatch

Specialized Execution Phase

Cornell University Shreesha Srinath 22 / 31

slide-57
SLIDE 57

Motivation XLOOPS ISA XLOOPS Compiler

  • XLOOPS Microarchitecture •

Evaluation

GPP LMU Lane0 Lane1 LLFU

  • p
  • p

Time

xloop lw lw xloop sw

. . .

rename

. . .

write

. . .

rename rename write rename write write write write

. . .

write write write write

Scan Phase loop: lw r4, 0(r3) lw r5, 0(rA) ... ... sw r6, 0(r7) addiu r1, r1, 1 xloop.om r1, rN, loop

lw Iteration 0 dispatch

Specialized Execution Phase

dispatch

Non-Speculative Lane Speculative Lane

lw lw Iteration 1 lw Cornell University Shreesha Srinath 22 / 31

slide-58
SLIDE 58

Motivation XLOOPS ISA XLOOPS Compiler

  • XLOOPS Microarchitecture •

Evaluation

GPP LMU Lane0 Lane1 LLFU

  • p
  • p

Time

xloop lw lw xloop sw

. . .

rename

. . .

write

. . .

rename rename write rename write write write write

. . .

write write write write

Scan Phase

Iteration 0 dispatch

Specialized Execution Phase

dispatch Iteration 1 lw lw xloop sw

. . .

lw lw

. .

check

Broadcast Store Dependence Check

loop: lw r4, 0(r3) lw r5, 0(rA) ... ... sw r6, 0(r7) addiu r1, r1, 1 xloop.om r1, rN, loop

Cornell University Shreesha Srinath 22 / 31

slide-59
SLIDE 59

Motivation XLOOPS ISA XLOOPS Compiler

  • XLOOPS Microarchitecture •

Evaluation

GPP LMU Lane0 Lane1 LLFU

  • p
  • p

Time

xloop lw lw xloop sw

. . .

rename

. . .

write

. . .

rename rename write rename write write write write

. . .

write write write write

Scan Phase

Iteration 0 dispatch

Specialized Execution Phase

dispatch Iteration 1 lw lw xloop sw

. . .

lw lw check Iteration 1 lw lw

. . .

dispatch Iteration 2 lw lw

. . . Non-Speculative Lane Wasted work Speculative Lane

X

loop: lw r4, 0(r3) lw r5, 0(rA) ... ... sw r6, 0(r7) addiu r1, r1, 1 xloop.om r1, rN, loop

Cornell University Shreesha Srinath 22 / 31

slide-60
SLIDE 60

Motivation XLOOPS ISA XLOOPS Compiler

  • XLOOPS Microarchitecture •

Evaluation

GPP LMU Lane0 Lane1 LLFU

  • p
  • p

Time

xloop lw lw xloop sw

. . .

rename

. . .

write

. . .

rename rename write rename write write write write

. . .

write write write write

loop: lw r4, 0(r3) lw r5, 0(rA) ... ... sw r6, 0(r7) addiu r1, r1, 1 xloop.om r1, rN, loop Scan Phase

Iteration 0 dispatch

Specialized Execution Phase

dispatch Iteration 1 lw lw xloop sw

. . .

lw lw check Iteration 1 lw lw xloop sw

. . .

Iteration 2 lw lw sw

. . .

xloop dispatch Iteration 3

Useful work Buffered Stores

X

check dispatch Cornell University Shreesha Srinath 22 / 31

slide-61
SLIDE 61

Motivation XLOOPS ISA XLOOPS Compiler

  • XLOOPS Microarchitecture •

Evaluation

Supporting other patterns

GPR RF 32 × 32b 2r2w

GPP LLFU D$ Request/Response Crossbar L1 I$ 16 KB L2 Request and Response Crossbars L1 D$ 16 KB SLFU Lane 3 Lane 1

Lane RF 24 × 32b 2r2w Inst Buf 128× Lane RF 24 × 32b 2r2w Inst Buf 128× Lane RF 24 × 32b 2r2w Inst Buf 128×

Lane SLFU SLFU SLFU IDQ Lane Management Unit IDQ IDQ CIB 8× CIB 8× CIB 8× LSQ 16× LSQ 16× LSQ 16× DBN Lane Management Unit

I xloop.ua – Using xloop.om

mechanisms

I xloop.orm – Combine xloop.or

and xloop.om mechanisms

I xloop.*.db

. Lanes communicate updates to loop bound . LMU tracks maximum bound and generates additional work

Cornell University Shreesha Srinath 23 / 31

slide-62
SLIDE 62

Motivation XLOOPS ISA XLOOPS Compiler

  • XLOOPS Microarchitecture •

Evaluation

Adaptive Execution

Iteration

inst0 inst1 inst2 inst3 ... branch

Iteration 1

inst0 inst1 inst2 inst3 ... branch inst0 inst1 inst2 inst3 ... branch

Iteration 2

inst0 inst1 inst2 inst3 ... branch

Iteration 3

inst0 inst1 inst2 inst3 ... branch

Iteration n-1

I Significant intra-iteration and

limited inter-iteration parallelism

I Specialized execution not

beneficial using simple in-order lanes

Cornell University Shreesha Srinath 24 / 31

slide-63
SLIDE 63

Motivation XLOOPS ISA XLOOPS Compiler

  • XLOOPS Microarchitecture •

Evaluation

Adaptive Execution

Iteration

inst0 inst1 inst2 inst3 ... branch

Iteration 1

inst0 inst1 inst2 inst3 ... branch inst0 inst1 inst2 inst3 ... branch

Iteration 2

inst0 inst1 inst2 inst3 ... branch

Iteration 3

inst0 inst1 inst2 inst3 ... branch

Iteration n-1

OoO GPP L1 Data Cache OoO GPP L1 Data Cache Lanes Lane Manager Mem XBar

I Significant intra-iteration and

limited inter-iteration parallelism

I Specialized execution not

beneficial using simple in-order lanes

I Adaptively migrate to complex

OoO cores

Cornell University Shreesha Srinath 24 / 31

slide-64
SLIDE 64

Motivation XLOOPS ISA XLOOPS Compiler

  • XLOOPS Microarchitecture •

Evaluation

GPP LMU Lane0 Lane1 LLFU Time

OoO GPP L1 Data Cache Lanes Lane Manager Mem XBar

GPP Profiling

Cornell University Shreesha Srinath 25 / 31

slide-65
SLIDE 65

Motivation XLOOPS ISA XLOOPS Compiler

  • XLOOPS Microarchitecture •

Evaluation

GPP LMU Lane0 Lane1 LLFU Time

OoO GPP L1 Data Cache Lanes Lane Manager Mem XBar

GPP Profiling LPSU Profiling

Cornell University Shreesha Srinath 25 / 31

slide-66
SLIDE 66

Motivation XLOOPS ISA XLOOPS Compiler

  • XLOOPS Microarchitecture •

Evaluation

GPP LMU Lane0 Lane1 LLFU Time

OoO GPP L1 Data Cache Lanes Lane Manager Mem XBar

GPP Profiling Traditional Execution LPSU Profiling

Cornell University Shreesha Srinath 25 / 31

slide-67
SLIDE 67

Motivation XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture

  • Evaluation •
  • 3. XLOOPS Microarchitecture

0.5 1.0 1.5 2.0 2.5

  • 4. Evaluation
  • 1. XLOOPS ISA

loop: lw r2, 0(rA) lw r3, 0(rB) ... ... addiu.xi rA, 4 addiu.xi rB, 4 addiu r1, r1, 1 xloop.uc r1, rN, loop

  • 2. XLOOPS Compiler

#pragma xloops ordered for(i = 0; i < N i++) A[i] = A[i] * A[i-K]; #pragma xloops atomic for(i = 0; i < N; i++) B[ A[i] ]++; D[ C[i] ]++; OoO GPP L1 Data Cache Lanes Lane Manager Mem XBar

Cornell University Shreesha Srinath 26 / 31

slide-68
SLIDE 68

Motivation XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture

  • Evaluation •

Application Kernels

xloop.uc

Color space conversion Dense matrix-multiply String search algorithm Symmetric matrix-multiply Viterbi decoding algorithm Floyd-Warshall shortest path

xloop.or

ADPCM decoder Covriance computation Floyd-Steinberg dithering K-Means clustering SHA-1 encryption kernel Symmetric matrix-multiply

xloop.om

Dynamic-programming K-Nearest neighbors Knapsack kernel Floyd-Warshall shortest path

xloop.orm, xloop.ua

Greedy maximal-matching 2D Stencil computation Binary tree construction Heap-sort computation Huffman entropy coding Radix sort algorithm

xloop.uc.db

Breadth-first search Quick-sort algorithm

25 Kernels from MiBench, PolyBench, PBBS, and Custom

Cornell University Shreesha Srinath 27 / 31

slide-69
SLIDE 69

Motivation XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture

  • Evaluation •

Cycle-Level Methodology

PyMTL

I LLVM-3.1 based compiler framework I gem5 – in-order and out-of-order processors I PyMTL – LPSU models I McPAT-1.0 – 45nm energy models

Cornell University Shreesha Srinath 28 / 31

slide-70
SLIDE 70

Motivation XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture

  • Evaluation •

Energy-Efficiency vs. Performance Results

In-order+LPSU vs. In-order Core OOO 2-way+LPSU vs. OOO 2-Way OOO 4-way+LPSU vs. OOO 4-Way

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

Normalized Performance

0.5 1.0 1.5 2.0 2.5 3.0 3.5

Normalized Energy Efficiency

0.5 1.0 1.5 2.0 2.5 3.0

Normalized Performance

0.5 1.0 1.5 2.0 2.5

Normalized Performance

I Competitive energy efficiency I Higher dynamic power I Always higher performance

Cornell University Shreesha Srinath 29 / 31

slide-71
SLIDE 71

Motivation XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture

  • Evaluation •

Energy-Efficiency vs. Performance Results

In-order+LPSU vs. In-order Core OOO 2-way+LPSU vs. OOO 2-Way OOO 4-way+LPSU vs. OOO 4-Way

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

Normalized Performance

0.5 1.0 1.5 2.0 2.5 3.0 3.5

Normalized Energy Efficiency

0.5 1.0 1.5 2.0 2.5 3.0

Normalized Performance

0.5 1.0 1.5 2.0 2.5

Normalized Performance

I Always more energy efficient I Mixed dynamic power I Competitive or higher performance (uc/or/om/ua/db)

Cornell University Shreesha Srinath 29 / 31

slide-72
SLIDE 72

Motivation XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture

  • Evaluation •

Energy-Efficiency vs. Performance Results

In-order+LPSU vs. In-order Core OOO 2-way+LPSU vs. OOO 2-Way OOO 4-way+LPSU vs. OOO 4-Way

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

Normalized Performance

0.5 1.0 1.5 2.0 2.5 3.0 3.5

Normalized Energy Efficiency

0.5 1.0 1.5 2.0 2.5 3.0

Normalized Performance

0.5 1.0 1.5 2.0 2.5

Normalized Performance

I Always more energy efficient I Always lower dynamic power I Mixed performance (uc/om/ua/db)

Cornell University Shreesha Srinath 29 / 31

slide-73
SLIDE 73

Motivation XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture

  • Evaluation •

Energy-Efficiency vs. Performance Results

In-order+LPSU vs. In-order Core OOO 2-way+LPSU vs. OOO 2-Way OOO 4-way+LPSU vs. OOO 4-Way

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

Normalized Performance

0.5 1.0 1.5 2.0 2.5 3.0 3.5

Normalized Energy Efficiency

0.5 1.0 1.5 2.0 2.5 3.0

Normalized Performance

0.5 1.0 1.5 2.0 2.5

Normalized Performance

I Trade energy efficiency for performance for slower kernels I Profiling and migration cause minimal performance degradtion

Cornell University Shreesha Srinath 29 / 31

slide-74
SLIDE 74

Motivation XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture

  • Evaluation •

Energy-Efficiency vs. Performance Results

In-order+LPSU vs. In-order Core OOO 2-way+LPSU vs. OOO 2-Way OOO 4-way+LPSU vs. OOO 4-Way

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

Normalized Performance

0.5 1.0 1.5 2.0 2.5 3.0 3.5

Normalized Energy Efficiency

0.5 1.0 1.5 2.0 2.5 3.0

Normalized Performance

0.5 1.0 1.5 2.0 2.5

Normalized Performance

More results in the paper!

Cornell University Shreesha Srinath 29 / 31

slide-75
SLIDE 75

Motivation XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture

  • Evaluation •

DCache 16KB SRAM for Cache Lines DCache Tags ICache Tags ICache 16KB SRAM for Cache Lines L0 Instr Buffer L0 Instr Buffer L0 Instr Buffer L0 Instr Buffer Loop Pattern Specialization Unit Scalar Processor 32b IEEE Floating Point Unit 32b Integer Mul/Div Unit

VLSI Implementation

I TSMC 40 nm standard-cell-based implementation I RISC scalar processor with 4-lane LPSU I Supports xloop.uc I ≈40% extra area compared to simple RISC processor

Cornell University Shreesha Srinath 30 / 31

slide-76
SLIDE 76

Motivation XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture Evaluation

loop: lw r2, 0(rA) lw r3, 0(rB) ... ... addiu.xi rA, 4 addiu.xi rB, 4 addiu r1, r1, 1 xloop.uc r1, rN, loop OoO GPP L1 Data Cache Lanes Lane Manager Mem XBar #pragma xloops ordered for(i = 0; i < N i++) A[i] = A[i] * A[i-K]; #pragma xloops atomic for(i = 0; i < N; i++) B[ A[i] ]++; D[ C[i] ]++;

Take-Away Points

I Elegant new abstraction that enables performance-portable execution of loops I A single-ISA heterogeneous architecture with a new execution paradigm . Traditional Execution . Specialized Execution . Adaptive Execution

This work was supported in part by the National Science Foundation (NSF), the Defense Advanced Research Projects Agency (DARPA), and donations from Intel Corporation, Synopsys, Inc., and Xilinx, Inc.

Cornell University Shreesha Srinath 31 / 31