Architectural Specialization for Inter-Iteration Loop Dependence - - PowerPoint PPT Presentation
Architectural Specialization for Inter-Iteration Loop Dependence - - PowerPoint PPT Presentation
Architectural Specialization for Inter-Iteration Loop Dependence Patterns Shreesha Srinath, Berkin Ilbeyi, Mingxing Tan, Gai Liu Zhiru Zhang, Christopher Batten Computer Systems Laboratory School of Electrical and Computer Engineering Cornell
- Motivation •
XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture Evaluation
Performance (Tasks per Second) Energy Efficiency (Tasks per Joule) General Purpose Processor
Cornell University Shreesha Srinath 2 / 31
- Motivation •
XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture Evaluation
Performance (Tasks per Second) Energy Efficiency (Tasks per Joule) General Purpose Processor
Golden Triangle
Cornell University Shreesha Srinath 2 / 31
- Motivation •
XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture Evaluation
F l e x i b i l i t y v s . S p e c i a l i z a t i
- n
Custom ASIC Less Flexible Accelerator More Flexible Accelerator
Performance (Tasks per Second) Energy Efficiency (Tasks per Joule) General Purpose Processor
Custom ASIC
Cornell University Shreesha Srinath 2 / 31
- Motivation •
XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture Evaluation
Loop Dependence Pattern Specialization
Iteration
inst0 inst1 inst2 inst3 ... branch
Iteration 1
inst0 inst1 inst2 inst3 ... branch inst0 inst1 inst2 inst3 ... branch
Iteration 2
inst0 inst1 inst2 inst3 ... branch
Iteration 3
inst0 inst1 inst2 inst3 ... branch
Iteration n-1 Intra-Iteration Micro-op Fusion, ASIPs, CCA
Cornell University Shreesha Srinath 3 / 31
- Motivation •
XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture Evaluation
Loop Dependence Pattern Specialization
Iteration
inst0 inst1 inst2 inst3 ... branch
Iteration 1
inst0 inst1 inst2 inst3 ... branch inst0 inst1 inst2 inst3 ... branch
Iteration 2
inst0 inst1 inst2 inst3 ... branch
Iteration 3
inst0 inst1 inst2 inst3 ... branch
Iteration n-1 Intra-Iteration Micro-op Fusion, ASIPs, CCA Inter-Iteration Vector, GPU, HELIX-RC
Cornell University Shreesha Srinath 3 / 31
- Motivation •
XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture Evaluation
Loop Dependence Pattern Specialization
Iteration
inst0 inst1 inst2 inst3 ... branch
Iteration 1
inst0 inst1 inst2 inst3 ... branch inst0 inst1 inst2 inst3 ... branch
Iteration 2
inst0 inst1 inst2 inst3 ... branch
Iteration 3
inst0 inst1 inst2 inst3 ... branch
Iteration n-1 Intra-Iteration Micro-op Fusion, ASIPs, CCA Inter-Iteration Vector, GPU, HELIX-RC Both DySER, Qs-Cores, BERET
Cornell University Shreesha Srinath 3 / 31
- Motivation •
XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture Evaluation
Loop Dependence Pattern Specialization
Iteration
inst0 inst1 inst2 inst3 ... branch
Iteration 1
inst0 inst1 inst2 inst3 ... branch inst0 inst1 inst2 inst3 ... branch
Iteration 2
inst0 inst1 inst2 inst3 ... branch
Iteration 3
inst0 inst1 inst2 inst3 ... branch
Iteration n-1 Intra-Iteration Micro-op Fusion, ASIPs, CCA Inter-Iteration Vector, GPU, HELIX-RC Both DySER, Qs-Cores, BERET Key Challenge: Creating HW/SW abstractions that are flexible and enable performance-portable execution
Cornell University Shreesha Srinath 3 / 31
- Motivation •
XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture Evaluation
Explicit Loop Specialization (XLOOPS)
Key Idea 1: Expose fine-grained parallelism by elegantly encoding inter-iteration loop dependence patterns in the ISA
Cornell University Shreesha Srinath 4 / 31
- Motivation •
XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture Evaluation
Explicit Loop Specialization (XLOOPS)
Key Idea 1: Expose fine-grained parallelism by elegantly encoding inter-iteration loop dependence patterns in the ISA Key Idea 2: Single-ISA hetereogenous architecture with a new execution paradigm supporting traditional, specialized, and adaptive execution
Cornell University Shreesha Srinath 4 / 31
- Motivation •
XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture Evaluation
Explicit Loop Specialization (XLOOPS)
Key Idea 1: Expose fine-grained parallelism by elegantly encoding inter-iteration loop dependence patterns in the ISA Key Idea 2: Single-ISA hetereogenous architecture with a new execution paradigm supporting traditional, specialized, and adaptive execution
GPP L1 Data Cache
I Traditional
Execution
Cornell University Shreesha Srinath 4 / 31
- Motivation •
XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture Evaluation
Explicit Loop Specialization (XLOOPS)
Key Idea 1: Expose fine-grained parallelism by elegantly encoding inter-iteration loop dependence patterns in the ISA Key Idea 2: Single-ISA hetereogenous architecture with a new execution paradigm supporting traditional, specialized, and adaptive execution
GPP L1 Data Cache Lanes Lane Manager Mem XBar
I Traditional
Execution
I Specialized
Execution
Cornell University Shreesha Srinath 4 / 31
- Motivation •
XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture Evaluation
Explicit Loop Specialization (XLOOPS)
Key Idea 1: Expose fine-grained parallelism by elegantly encoding inter-iteration loop dependence patterns in the ISA Key Idea 2: Single-ISA hetereogenous architecture with a new execution paradigm supporting traditional, specialized, and adaptive execution
GPP L1 Data Cache Lanes Lane Manager Mem XBar
I Traditional
Execution
I Specialized
Execution
I Adaptive
Execution
Cornell University Shreesha Srinath 4 / 31
- Motivation •
XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture Evaluation
- 3. XLOOPS Microarchitecture
0.5 1.0 1.5 2.0 2.5
- 4. Evaluation
- 1. XLOOPS ISA
loop: lw r2, 0(rA) lw r3, 0(rB) ... ... addiu.xi rA, 4 addiu.xi rB, 4 addiu r1, r1, 1 xloop.uc r1, rN, loop
- 2. XLOOPS Compiler
#pragma xloops ordered for(i = 0; i < N i++) A[i] = A[i] * A[i-K]; #pragma xloops atomic for(i = 0; i < N; i++) B[ A[i] ]++; D[ C[i] ]++; OoO GPP L1 Data Cache Lanes Lane Manager Mem XBar
Cornell University Shreesha Srinath 5 / 31
Motivation
- XLOOPS ISA •
XLOOPS Compiler XLOOPS Microarchitecture Evaluation
- 3. XLOOPS Microarchitecture
0.5 1.0 1.5 2.0 2.5
- 4. Evaluation
- 1. XLOOPS ISA
loop: lw r2, 0(rA) lw r3, 0(rB) ... ... addiu.xi rA, 4 addiu.xi rB, 4 addiu r1, r1, 1 xloop.uc r1, rN, loop
- 2. XLOOPS Compiler
#pragma xloops ordered for(i = 0; i < N i++) A[i] = A[i] * A[i-K]; #pragma xloops atomic for(i = 0; i < N; i++) B[ A[i] ]++; D[ C[i] ]++; OoO GPP L1 Data Cache Lanes Lane Manager Mem XBar
Cornell University Shreesha Srinath 6 / 31
Motivation
- XLOOPS ISA •
XLOOPS Compiler XLOOPS Microarchitecture Evaluation
XLOOPS Instruction Set Extensions xloop.{d}.{c} rI, rN, L
Data Dependence Control Dependence Induction Variable Loop Bound Loop Label
XLOOP Instruction
Cornell University Shreesha Srinath 7 / 31
Motivation
- XLOOPS ISA •
XLOOPS Compiler XLOOPS Microarchitecture Evaluation
XLOOPS Instruction Set Extensions xloop.{d}.{c} rI, rN, L
Data Dependence Control Dependence Induction Variable Loop Bound Loop Label
XLOOP Instruction
Unordered Concurrent Fixed Bound
xloop.uc.fb r2, r3, 0x8000
Cornell University Shreesha Srinath 7 / 31
Motivation
- XLOOPS ISA •
XLOOPS Compiler XLOOPS Microarchitecture Evaluation
XLOOPS Instruction Set Extensions xloop.{d}.{c} rI, rN, L
Data Dependence Control Dependence Induction Variable Loop Bound Loop Label
XLOOP Instruction
Unordered Concurrent Fixed Bound
xloop.uc.fb r2, r3, 0x8000
Cross-Iteration Instructions
addiu.xi rX, imm addu.xi rX, rT
Variables that can be computed as linear functions of the induction variable
Cornell University Shreesha Srinath 7 / 31
Motivation
- XLOOPS ISA •
XLOOPS Compiler XLOOPS Microarchitecture Evaluation
XLOOPS ISA: Unordered Concurrent
for ( i=0; i<N; i++ ) C[i] = A[i] * B[i] Element-wise Vector Multiplication loop: lw r2, 0(rA) lw r3, 0(rB) mul r4, r2, r3 sw r4, 0(rC) addiu rA, rA, 4 addiu rB, rB, 4 addiu rC, rC, 4 addiu r1, r1, 1 bne r1, rN, loop RISC ISA
Cornell University Shreesha Srinath 8 / 31
Motivation
- XLOOPS ISA •
XLOOPS Compiler XLOOPS Microarchitecture Evaluation
XLOOPS ISA: Unordered Concurrent
Iteration 0 inst0 inst1 inst2 inst3 ... xloop.uc Iteration 1 inst0 inst1 inst2 inst3 ... xloop.uc Iteration 2 inst0 inst1 inst2 inst3 ... xloop.uc Iteration 3 inst0 inst1 inst2 inst3 ... xloop.uc
for ( i=0; i<N; i++ ) C[i] = A[i] * B[i] Element-wise Vector Multiplication loop: lw r2, 0(rA) lw r3, 0(rB) mul r4, r2, r3 sw r4, 0(rC) addiu rA, rA, 4 addiu rB, rB, 4 addiu rC, rC, 4 addiu r1, r1, 1 bne r1, rN, loop RISC ISA
Cornell University Shreesha Srinath 8 / 31
Motivation
- XLOOPS ISA •
XLOOPS Compiler XLOOPS Microarchitecture Evaluation
XLOOPS ISA: Unordered Concurrent
Iteration 0 inst0 inst1 inst2 inst3 ... xloop.uc Iteration 1 inst0 inst1 inst2 inst3 ... xloop.uc Iteration 2 inst0 inst1 inst2 inst3 ... xloop.uc Iteration 3 inst0 inst1 inst2 inst3 ... xloop.uc
loop: lw r2, 0(rA) lw r3, 0(rB) mul r4, r2, r3 sw r4, 0(rC) addiu rA, rA, 4 addiu rB, rB, 4 addiu rC, rC, 4 addiu r1, r1, 1 bne r1, rN, loop RISC ISA for ( i=0; i<N; i++ ) C[i] = A[i] * B[i] Element-wise Vector Multiplication XLOOPS ISA loop: lw r2, 0(rA) lw r3, 0(rB) mul r4, r2, r3 sw r4, 0(rC) addiu rA, rA, 4 addiu rB, rB, 4 addiu rC, rC, 4 addiu r1, r1, 1 xloop.uc r1, rN, loop
Cornell University Shreesha Srinath 8 / 31
Motivation
- XLOOPS ISA •
XLOOPS Compiler XLOOPS Microarchitecture Evaluation
XLOOPS ISA: Unordered Concurrent
Iteration 0 inst0 inst1 inst2 inst3 ... xloop.uc Iteration 1 inst0 inst1 inst2 inst3 ... xloop.uc Iteration 2 inst0 inst1 inst2 inst3 ... xloop.uc Iteration 3 inst0 inst1 inst2 inst3 ... xloop.uc
for ( i=0; i<N; i++ ) C[i] = A[i] * B[i] Element-wise Vector Multiplication XLOOPS ISA loop: lw r2, 0(rA) lw r3, 0(rB) mul r4, r2, r3 sw r4, 0(rC) addiu.xi rA, 4 addiu.xi rB, 4 addiu.xi rC, 4 addiu r1, r1, 1 xloop.uc r1, rN, loop loop: lw r2, 0(rA) lw r3, 0(rB) mul r4, r2, r3 sw r4, 0(rC) addiu rA, rA, 4 addiu rB, rB, 4 addiu rC, rC, 4 addiu r1, r1, 1 bne r1, rN, loop RISC ISA
Cornell University Shreesha Srinath 8 / 31
Motivation
- XLOOPS ISA •
XLOOPS Compiler XLOOPS Microarchitecture Evaluation
XLOOPS ISA: Unordered Concurrent
Iteration 0 inst0 inst1 inst2 inst3 ... xloop.uc Iteration 1 inst0 inst1 inst2 inst3 ... xloop.uc Iteration 2 inst0 inst1 inst2 inst3 ... xloop.uc Iteration 3 inst0 inst1 inst2 inst3 ... xloop.uc
loop: lw r2, 0(rA) lw r3, 0(rB) mul r4, r2, r3 sw r4, 0(rC) addiu.xi rA, 4 addiu.xi rB, 4 addiu.xi rC, 4 addiu r1, r1, 1 xloop.uc r1, rN, loop for ( i=0; i<N; i++ ) C[i] = A[i] * B[i] Element-wise Vector Multiplication Instructions in loop cannot write live-in registers Live-out values are stored in memory Data-races are possible
Cornell University Shreesha Srinath 8 / 31
Motivation
- XLOOPS ISA •
XLOOPS Compiler XLOOPS Microarchitecture Evaluation
XLOOPS ISA: Unordered Atomic
loop: lw r6, 0(rA) lw r7, 0(rB) addiu r7, r7, 1 sw r7, 0(r6) addiu.xi rA, 4 ... addiu r1, r1, 1 xloop.ua r1, rN, loop for ( i=0; i<N; i++ ) B[A[i]]++; D[C[i]]++; Histogram Updates Iterations execute atomically No race conditions
Iteration 0 inst0 inst1 inst2 inst3 ... xloop.ua Iteration 1 inst0 inst1 inst2 inst3 ... xloop.ua Iteration 2 inst0 inst1 inst2 inst3 ... xloop.ua Iteration 3 inst0 inst1 inst2 inst3 ... xloop.ua
Results can be non-deterministic Inspired by Transactional Memory
Cornell University Shreesha Srinath 9 / 31
Motivation
- XLOOPS ISA •
XLOOPS Compiler XLOOPS Microarchitecture Evaluation
XLOOPS ISA: Ordered-Through-Registers
loop: lw r2, 0(rA) addu rX, r2, rX sw rX, 0(rB) addiu.xi rA, 4 addiu.xi rB, 4 addiu r1, r1, 1 xloop.or r1, rN, loop for ( i=0; i<N; i++ ) X += A[i]; B[i] = X Parallel-Prefix Summation rX - Cross Iteration Register CIRs are guranteed to have the same value as a serial execution Inspired by Multiscalar
Iteration 0 inst0 inst1 inst2 inst3 ... xloop.or Iteration 1 inst0 inst1 inst2 inst3 ... xloop.or Iteration 2 inst0 inst1 inst2 inst3 ... xloop.or Iteration 3 inst0 inst1 inst2 inst3 ... xloop.or Cornell University Shreesha Srinath 10 / 31
Motivation
- XLOOPS ISA •
XLOOPS Compiler XLOOPS Microarchitecture Evaluation
XLOOPS ISA: Ordered-Through-Memory
# r1 = rK # r3 = rA + 4*rK loop: lw r4, 0(r3) lw r5, 0(rA) mul r6, r4, r5 sw r6, 0(r3) addiu.xi r3, 4 addiu.xi rA, 4 addiu r1, r1, 1 xloop.om r1, rN, loop for ( i=0; i<N; i++ ) A[i] = A[i] * A[i-k]; Updates to memory defined by serial iteration order No race conditions
Iteration 0 inst0 inst1 inst2 inst3 ... xloop.om Iteration 1 inst0 inst1 inst2 inst3 ... xloop.om Iteration 2 inst0 inst1 inst2 inst3 ... xloop.om Iteration 3 inst0 inst1 inst2 inst3 ... xloop.om
Inspired by Multiscalar, TLS
Cornell University Shreesha Srinath 11 / 31
Motivation
- XLOOPS ISA •
XLOOPS Compiler XLOOPS Microarchitecture Evaluation
XLOOPS ISA: Dynamic Bound
1 2 3 4 5 6 7
Recursive traversal
Cornell University Shreesha Srinath 12 / 31
Motivation
- XLOOPS ISA •
XLOOPS Compiler XLOOPS Microarchitecture Evaluation
XLOOPS ISA: Dynamic Bound
1 2 3 4 5 6 7
Parallelize across frontier using xloop.uc Recursive traversal
Cornell University Shreesha Srinath 12 / 31
Motivation
- XLOOPS ISA •
XLOOPS Compiler XLOOPS Microarchitecture Evaluation
XLOOPS ISA: Dynamic Bound
Iteration 0 inst0 inst1 inst2 inst3 ... xloop.uc.db Iteration 6 Iteration 7 Iteration 1 inst0 inst1 inst2 inst3 ... xloop.uc.db Iteration 2 inst0 inst1 inst2 inst3 ... xloop.uc.db Iteration 3 inst0 inst1 inst2 inst3 ... xloop.uc.db Iteration 4 inst0 inst1 inst2 inst3 ... xloop.uc.db Iteration 5 inst0 inst1 inst2 inst3 ... xloop.uc.db
Parallelize using xloop.uc.db
1 2 3 4 5 6 7
for ( i=0; i<N; i++ ) ... if ( cond ) N++;
Cornell University Shreesha Srinath 12 / 31
Motivation XLOOPS ISA
- XLOOPS Compiler •
XLOOPS Microarchitecture Evaluation
- 3. XLOOPS Microarchitecture
0.5 1.0 1.5 2.0 2.5
- 4. Evaluation
- 1. XLOOPS ISA
loop: lw r2, 0(rA) lw r3, 0(rB) ... ... addiu.xi rA, 4 addiu.xi rB, 4 addiu r1, r1, 1 xloop.uc r1, rN, loop
- 2. XLOOPS Compiler
#pragma xloops ordered for(i = 0; i < N i++) A[i] = A[i] * A[i-K]; #pragma xloops atomic for(i = 0; i < N; i++) B[ A[i] ]++; D[ C[i] ]++; OoO GPP L1 Data Cache Lanes Lane Manager Mem XBar
Cornell University Shreesha Srinath 13 / 31
Motivation XLOOPS ISA
- XLOOPS Compiler •
XLOOPS Microarchitecture Evaluation
XLOOPS Compiler
Kernel implementing Floyd-Warshall shortest path algorithm for ( int k = 0; k < n; k++ ) #pragma xloops ordered for ( int i = 0; i < n; i++ ) #pragma xloops unordered for ( int j = 0; j < n; j++ ) path[i][j] = min( path[i][j], path[i][k] + path[k][j] );
Cornell University Shreesha Srinath 14 / 31
Motivation XLOOPS ISA
- XLOOPS Compiler •
XLOOPS Microarchitecture Evaluation C++ Mid-level
- ptimization
passes Code Generation xloops binary Modified LSR pass XLOOPS control- dependence analysis pass XLOOPS data- dependence analysis pass Cornell University Shreesha Srinath 15 / 31
Motivation XLOOPS ISA
- XLOOPS Compiler •
XLOOPS Microarchitecture Evaluation C++ Mid-level
- ptimization
passes Code Generation xloops binary Modified LSR pass XLOOPS control- dependence analysis pass XLOOPS data- dependence analysis pass
I Programmer annotations
. unordered: no data-dependences . ordered: preserve data-dependences . atomic: atomic memory updates
Cornell University Shreesha Srinath 15 / 31
Motivation XLOOPS ISA
- XLOOPS Compiler •
XLOOPS Microarchitecture Evaluation C++ Mid-level
- ptimization
passes Code Generation xloops binary Modified LSR pass XLOOPS control- dependence analysis pass XLOOPS data- dependence analysis pass
I Programmer annotations
. unordered: no data-dependences . ordered: preserve data-dependences . atomic: atomic memory updates
I Loop strength reduction pass encodes MIVs as xi instructions
Cornell University Shreesha Srinath 15 / 31
Motivation XLOOPS ISA
- XLOOPS Compiler •
XLOOPS Microarchitecture Evaluation C++ Mid-level
- ptimization
passes Code Generation xloops binary Modified LSR pass XLOOPS control- dependence analysis pass XLOOPS data- dependence analysis pass
I Programmer annotations
. unordered: no data-dependences . ordered: preserve data-dependences . atomic: atomic memory updates
I Loop strength reduction pass encodes MIVs as xi instructions I XLOOPS data-dependence analysis pass
. Register-dependence: analysing use-definition chains through PHI nodes . Memory-dependence: well known dependence analysis techniques
Cornell University Shreesha Srinath 15 / 31
Motivation XLOOPS ISA
- XLOOPS Compiler •
XLOOPS Microarchitecture Evaluation C++ Mid-level
- ptimization
passes Code Generation xloops binary Modified LSR pass XLOOPS control- dependence analysis pass XLOOPS data- dependence analysis pass
I Programmer annotations
. unordered: no data-dependences . ordered: preserve data-dependences . atomic: atomic memory updates
I Loop strength reduction pass encodes MIVs as xi instructions I XLOOPS data-dependence analysis pass
. Register-dependence: analysing use-definition chains through PHI nodes . Memory-dependence: well known dependence analysis techniques
I Detect updates to the loop bound to encode
dynamic-bound control-dependence pattern
Cornell University Shreesha Srinath 15 / 31
Motivation XLOOPS ISA XLOOPS Compiler
- XLOOPS Microarchitecture •
Evaluation
- 3. XLOOPS Microarchitecture
0.5 1.0 1.5 2.0 2.5
- 4. Evaluation
- 1. XLOOPS ISA
loop: lw r2, 0(rA) lw r3, 0(rB) ... ... addiu.xi rA, 4 addiu.xi rB, 4 addiu r1, r1, 1 xloop.uc r1, rN, loop
- 2. XLOOPS Compiler
#pragma xloops ordered for(i = 0; i < N i++) A[i] = A[i] * A[i-K]; #pragma xloops atomic for(i = 0; i < N; i++) B[ A[i] ]++; D[ C[i] ]++; OoO GPP L1 Data Cache Lanes Lane Manager Mem XBar
Cornell University Shreesha Srinath 16 / 31
Motivation XLOOPS ISA XLOOPS Compiler
- XLOOPS Microarchitecture •
Evaluation
Traditional Execution
GPR RF 32 × 32b 2r2w
GPP LLFU D$ Request/Response Crossbar L1 I$ 16 KB L2 Request and Response Crossbars L1 D$ 16 KB SLFU
Minimal changes to a general-purpose processor (GPP)
I xloop → bne I addiu.xi → addiu I addu.xi → addu
Cornell University Shreesha Srinath 17 / 31
Motivation XLOOPS ISA XLOOPS Compiler
- XLOOPS Microarchitecture •
Evaluation
Traditional Execution
GPR RF 32 × 32b 2r2w
GPP LLFU D$ Request/Response Crossbar L1 I$ 16 KB L2 Request and Response Crossbars L1 D$ 16 KB SLFU
Minimal changes to a general-purpose processor (GPP)
I xloop → bne I addiu.xi → addiu I addu.xi → addu
Efficient traditional execution
I Enables gradual adoption I Enables adaptive execution to
migrate an xloop instruction
Cornell University Shreesha Srinath 17 / 31
Motivation XLOOPS ISA XLOOPS Compiler
- XLOOPS Microarchitecture •
Evaluation
Specialized Execution – xloop.uc
GPR RF 32 × 32b 2r2w
GPP LLFU D$ Request/Response Crossbar L1 I$ 16 KB L2 Request and Response Crossbars L1 D$ 16 KB SLFU Cornell University Shreesha Srinath 18 / 31
Motivation XLOOPS ISA XLOOPS Compiler
- XLOOPS Microarchitecture •
Evaluation
Specialized Execution – xloop.uc
GPR RF 32 × 32b 2r2w
GPP LLFU D$ Request/Response Crossbar L1 I$ 16 KB L2 Request and Response Crossbars L1 D$ 16 KB SLFU Lane 3 Lane 1
Lane RF 24 × 32b 2r2w Inst Buf 128× Lane RF 24 × 32b 2r2w Inst Buf 128× Lane RF 24 × 32b 2r2w Inst Buf 128×
Lane SLFU SLFU SLFU IDQ Lane Management Unit IDQ IDQ
Loop Pattern Specialization Unit
I Lane Management Unit (LMU) I Four decoupled in-order lanes I Lanes contain instruction buffers
and index queues
I Lanes and the GPP arbitrate for
data-memory port and long-latency functional unit
Cornell University Shreesha Srinath 18 / 31
Motivation XLOOPS ISA XLOOPS Compiler
- XLOOPS Microarchitecture •
Evaluation
Specialized Execution – xloop.uc
GPR RF 32 × 32b 2r2w
GPP LLFU D$ Request/Response Crossbar L1 I$ 16 KB L2 Request and Response Crossbars L1 D$ 16 KB SLFU Lane 3 Lane 1
Lane RF 24 × 32b 2r2w Inst Buf 128× Lane RF 24 × 32b 2r2w Inst Buf 128× Lane RF 24 × 32b 2r2w Inst Buf 128×
Lane SLFU SLFU SLFU IDQ Lane Management Unit IDQ IDQ
Loop Pattern Specialization Unit
I Lane Management Unit (LMU) I Four decoupled in-order lanes I Lanes contain instruction buffers
and index queues
I Lanes and the GPP arbitrate for
data-memory port and long-latency functional unit Specialized execution
I Scan phase
Cornell University Shreesha Srinath 18 / 31
Motivation XLOOPS ISA XLOOPS Compiler
- XLOOPS Microarchitecture •
Evaluation
Specialized Execution – xloop.uc
GPR RF 32 × 32b 2r2w
GPP LLFU D$ Request/Response Crossbar L1 I$ 16 KB L2 Request and Response Crossbars L1 D$ 16 KB SLFU Lane 3 Lane 1
Lane RF 24 × 32b 2r2w Inst Buf 128× Lane RF 24 × 32b 2r2w Inst Buf 128× Lane RF 24 × 32b 2r2w Inst Buf 128×
Lane SLFU SLFU SLFU IDQ Lane Management Unit IDQ IDQ
Loop Pattern Specialization Unit
I Lane Management Unit (LMU) I Four decoupled in-order lanes I Lanes contain instruction buffers
and index queues
I Lanes and the GPP arbitrate for
data-memory port and long-latency functional unit Specialized execution
I Scan phase
Cornell University Shreesha Srinath 18 / 31
Motivation XLOOPS ISA XLOOPS Compiler
- XLOOPS Microarchitecture •
Evaluation
Specialized Execution – xloop.uc
GPR RF 32 × 32b 2r2w
GPP LLFU D$ Request/Response Crossbar L1 I$ 16 KB L2 Request and Response Crossbars L1 D$ 16 KB SLFU Lane 3 Lane 1
Lane RF 24 × 32b 2r2w Inst Buf 128× Lane RF 24 × 32b 2r2w Inst Buf 128× Lane RF 24 × 32b 2r2w Inst Buf 128×
Lane SLFU SLFU SLFU IDQ Lane Management Unit IDQ IDQ
Loop Pattern Specialization Unit
I Lane Management Unit (LMU) I Four decoupled in-order lanes I Lanes contain instruction buffers
and index queues
I Lanes and the GPP arbitrate for
data-memory port and long-latency functional unit Specialized execution
I Scan phase I Specialized execution phase
Cornell University Shreesha Srinath 18 / 31
Motivation XLOOPS ISA XLOOPS Compiler
- XLOOPS Microarchitecture •
Evaluation
GPP LMU Lane0 Lane1 LLFU
loop: lw r2, 0(rA) lw r3, 0(rB) mul r4, r2, r3 sw r4, 0(rC) addiu.xi rA, 4 addiu.xi rB, 4 addiu.xi rC, 4 addiu r1, r1, 1 xloop.uc r1, rN, loop
- p
- p
Time
Cornell University Shreesha Srinath 19 / 31
Motivation XLOOPS ISA XLOOPS Compiler
- XLOOPS Microarchitecture •
Evaluation
GPP LMU Lane0 Lane1 LLFU
loop: lw r2, 0(rA) lw r3, 0(rB) mul r4, r2, r3 sw r4, 0(rC) addiu.xi rA, 4 addiu.xi rB, 4 addiu.xi rC, 4 addiu r1, r1, 1 xloop.uc r1, rN, loop
- p
- p
Time
xloop
Scan Phase
rename
- p
lw lw mul sw addiu.xi addiu.xi
- p
addiu.xi addiu xloop
- p
rename rename rename rename rename rename rename rename write write write write write write write write write
- p
write write write write write write write write write Cornell University Shreesha Srinath 19 / 31
Motivation XLOOPS ISA XLOOPS Compiler
- XLOOPS Microarchitecture •
Evaluation
GPP LMU Lane0 Lane1 LLFU
loop: lw r2, 0(rA) lw r3, 0(rB) mul r4, r2, r3 sw r4, 0(rC) addiu.xi rA, 4 addiu.xi rB, 4 addiu.xi rC, 4 addiu r1, r1, 1 xloop.uc r1, rN, loop
- p
- p
Time
xloop
Scan Phase
rename
- p
lw lw mul sw addiu.xi addiu.xi
- p
addiu.xi addiu xloop
- p
rename rename rename rename rename rename rename rename write write write write write write write write write
- p
write write write write write write write write write
Specialized Execution Phase
lw Iteration 0 dispatch Cornell University Shreesha Srinath 19 / 31
Motivation XLOOPS ISA XLOOPS Compiler
- XLOOPS Microarchitecture •
Evaluation
GPP LMU Lane0 Lane1 LLFU
loop: lw r2, 0(rA) lw r3, 0(rB) mul r4, r2, r3 sw r4, 0(rC) addiu.xi rA, 4 addiu.xi rB, 4 addiu.xi rC, 4 addiu r1, r1, 1 xloop.uc r1, rN, loop
- p
- p
Time
xloop
Scan Phase
rename
- p
lw lw mul sw addiu.xi addiu.xi
- p
addiu.xi addiu xloop
- p
rename rename rename rename rename rename rename rename write write write write write write write write write
- p
write write write write write write write write write
Specialized Execution Phase
lw Iteration 0 dispatch lw lw Iteration 1 dispatch mul
X
lw mul
X
Sharing LLFU
Cornell University Shreesha Srinath 19 / 31
Motivation XLOOPS ISA XLOOPS Compiler
- XLOOPS Microarchitecture •
Evaluation sw addiu.xi addiu.xi
- p
addiu.xi addiu xloop sw addiu.xi addiu.xi
- p
addiu.xi addiu xloop
GPP LMU Lane0 Lane1 LLFU
loop: lw r2, 0(rA) lw r3, 0(rB) mul r4, r2, r3 sw r4, 0(rC) addiu.xi rA, 4 addiu.xi rB, 4 addiu.xi rC, 4 addiu r1, r1, 1 xloop.uc r1, rN, loop
- p
- p
Time
xloop
Scan Phase
rename
- p
lw lw mul sw addiu.xi addiu.xi
- p
addiu.xi addiu xloop
- p
rename rename rename rename rename rename rename rename write write write write write write write write write
- p
write write write write write write write write write
Specialized Execution Phase
lw Iteration 0 dispatch lw lw Iteration 1 dispatch mul
X
lw mul
X
Specialized logic
Cornell University Shreesha Srinath 19 / 31
Motivation XLOOPS ISA XLOOPS Compiler
- XLOOPS Microarchitecture •
Evaluation lw Iteration 2 Iteration 3 lw dispatch dispatch sw addiu.xi addiu.xi
- p
addiu.xi addiu xloop sw addiu.xi addiu.xi
- p
addiu.xi addiu xloop
GPP LMU Lane0 Lane1 LLFU
loop: lw r2, 0(rA) lw r3, 0(rB) mul r4, r2, r3 sw r4, 0(rC) addiu.xi rA, 4 addiu.xi rB, 4 addiu.xi rC, 4 addiu r1, r1, 1 xloop.uc r1, rN, loop
- p
- p
Time
xloop
Scan Phase
rename
- p
lw lw mul sw addiu.xi addiu.xi
- p
addiu.xi addiu xloop
- p
rename rename rename rename rename rename rename rename write write write write write write write write write
- p
write write write write write write write write write
Specialized Execution Phase
lw Iteration 0 dispatch lw lw Iteration 1 dispatch mul
X
lw mul
X
Cornell University Shreesha Srinath 19 / 31
Motivation XLOOPS ISA XLOOPS Compiler
- XLOOPS Microarchitecture •
Evaluation
Specialized Execution – xloop.or
GPR RF 32 × 32b 2r2w
GPP LLFU D$ Request/Response Crossbar L1 I$ 16 KB L2 Request and Response Crossbars L1 D$ 16 KB SLFU Lane 3 Lane 1
Lane RF 24 × 32b 2r2w Inst Buf 128× Lane RF 24 × 32b 2r2w Inst Buf 128× Lane RF 24 × 32b 2r2w Inst Buf 128×
Lane SLFU SLFU SLFU IDQ Lane Management Unit IDQ IDQ Cornell University Shreesha Srinath 20 / 31
Motivation XLOOPS ISA XLOOPS Compiler
- XLOOPS Microarchitecture •
Evaluation
Specialized Execution – xloop.or
GPR RF 32 × 32b 2r2w
GPP LLFU D$ Request/Response Crossbar L1 I$ 16 KB L2 Request and Response Crossbars L1 D$ 16 KB SLFU Lane 3 Lane 1
Lane RF 24 × 32b 2r2w Inst Buf 128× Lane RF 24 × 32b 2r2w Inst Buf 128× Lane RF 24 × 32b 2r2w Inst Buf 128×
Lane SLFU SLFU SLFU IDQ Lane Management Unit IDQ IDQ CIB 8× CIB 8× CIB 8×
I Cross-iteration buffers (CIBs)
forward register-dependences
I More details in the paper!
Cornell University Shreesha Srinath 20 / 31
Motivation XLOOPS ISA XLOOPS Compiler
- XLOOPS Microarchitecture •
Evaluation
Specialized Execution – xloop.om
GPR RF 32 × 32b 2r2w
GPP LLFU D$ Request/Response Crossbar L1 I$ 16 KB L2 Request and Response Crossbars L1 D$ 16 KB SLFU Lane 3 Lane 1
Lane RF 24 × 32b 2r2w Inst Buf 128× Lane RF 24 × 32b 2r2w Inst Buf 128× Lane RF 24 × 32b 2r2w Inst Buf 128×
Lane SLFU SLFU SLFU IDQ Lane Management Unit IDQ IDQ CIB 8× CIB 8× CIB 8× Cornell University Shreesha Srinath 21 / 31
Motivation XLOOPS ISA XLOOPS Compiler
- XLOOPS Microarchitecture •
Evaluation
Specialized Execution – xloop.om
GPR RF 32 × 32b 2r2w
GPP LLFU D$ Request/Response Crossbar L1 I$ 16 KB L2 Request and Response Crossbars L1 D$ 16 KB SLFU Lane 3 Lane 1
Lane RF 24 × 32b 2r2w Inst Buf 128× Lane RF 24 × 32b 2r2w Inst Buf 128× Lane RF 24 × 32b 2r2w Inst Buf 128×
Lane SLFU SLFU SLFU IDQ Lane Management Unit IDQ IDQ CIB 8× CIB 8× CIB 8× LSQ 16× LSQ 16× LSQ 16×
I LSQ to support hardware
memory disambiguation
I LMU control logic
. Track non-speculative vs. speculative lanes . Promote lanes to be non-speculative
I Lane control logic
. Handle structural hazards . Handle dependence violations
Cornell University Shreesha Srinath 21 / 31
Motivation XLOOPS ISA XLOOPS Compiler
- XLOOPS Microarchitecture •
Evaluation
GPP LMU Lane0 Lane1 LLFU
- p
- p
Time
xloop lw lw xloop sw
. . .
rename
. . .
write
. . .
rename rename write rename write write write write
. . .
write write write write
Scan Phase loop: lw r4, 0(r3) lw r5, 0(rA) ... ... sw r6, 0(r7) addiu r1, r1, 1 xloop.om r1, rN, loop
Cornell University Shreesha Srinath 22 / 31
Motivation XLOOPS ISA XLOOPS Compiler
- XLOOPS Microarchitecture •
Evaluation
GPP LMU Lane0 Lane1 LLFU
- p
- p
Time
xloop lw lw xloop sw
. . .
rename
. . .
write
. . .
rename rename write rename write write write write
. . .
write write write write
Scan Phase loop: lw r4, 0(r3) lw r5, 0(rA) ... ... sw r6, 0(r7) addiu r1, r1, 1 xloop.om r1, rN, loop
lw Iteration 0 dispatch
Specialized Execution Phase
Cornell University Shreesha Srinath 22 / 31
Motivation XLOOPS ISA XLOOPS Compiler
- XLOOPS Microarchitecture •
Evaluation
GPP LMU Lane0 Lane1 LLFU
- p
- p
Time
xloop lw lw xloop sw
. . .
rename
. . .
write
. . .
rename rename write rename write write write write
. . .
write write write write
Scan Phase loop: lw r4, 0(r3) lw r5, 0(rA) ... ... sw r6, 0(r7) addiu r1, r1, 1 xloop.om r1, rN, loop
lw Iteration 0 dispatch
Specialized Execution Phase
dispatch
Non-Speculative Lane Speculative Lane
lw lw Iteration 1 lw Cornell University Shreesha Srinath 22 / 31
Motivation XLOOPS ISA XLOOPS Compiler
- XLOOPS Microarchitecture •
Evaluation
GPP LMU Lane0 Lane1 LLFU
- p
- p
Time
xloop lw lw xloop sw
. . .
rename
. . .
write
. . .
rename rename write rename write write write write
. . .
write write write write
Scan Phase
Iteration 0 dispatch
Specialized Execution Phase
dispatch Iteration 1 lw lw xloop sw
. . .
lw lw
. .
check
Broadcast Store Dependence Check
loop: lw r4, 0(r3) lw r5, 0(rA) ... ... sw r6, 0(r7) addiu r1, r1, 1 xloop.om r1, rN, loop
Cornell University Shreesha Srinath 22 / 31
Motivation XLOOPS ISA XLOOPS Compiler
- XLOOPS Microarchitecture •
Evaluation
GPP LMU Lane0 Lane1 LLFU
- p
- p
Time
xloop lw lw xloop sw
. . .
rename
. . .
write
. . .
rename rename write rename write write write write
. . .
write write write write
Scan Phase
Iteration 0 dispatch
Specialized Execution Phase
dispatch Iteration 1 lw lw xloop sw
. . .
lw lw check Iteration 1 lw lw
. . .
dispatch Iteration 2 lw lw
. . . Non-Speculative Lane Wasted work Speculative Lane
X
loop: lw r4, 0(r3) lw r5, 0(rA) ... ... sw r6, 0(r7) addiu r1, r1, 1 xloop.om r1, rN, loop
Cornell University Shreesha Srinath 22 / 31
Motivation XLOOPS ISA XLOOPS Compiler
- XLOOPS Microarchitecture •
Evaluation
GPP LMU Lane0 Lane1 LLFU
- p
- p
Time
xloop lw lw xloop sw
. . .
rename
. . .
write
. . .
rename rename write rename write write write write
. . .
write write write write
loop: lw r4, 0(r3) lw r5, 0(rA) ... ... sw r6, 0(r7) addiu r1, r1, 1 xloop.om r1, rN, loop Scan Phase
Iteration 0 dispatch
Specialized Execution Phase
dispatch Iteration 1 lw lw xloop sw
. . .
lw lw check Iteration 1 lw lw xloop sw
. . .
Iteration 2 lw lw sw
. . .
xloop dispatch Iteration 3
Useful work Buffered Stores
X
check dispatch Cornell University Shreesha Srinath 22 / 31
Motivation XLOOPS ISA XLOOPS Compiler
- XLOOPS Microarchitecture •
Evaluation
Supporting other patterns
GPR RF 32 × 32b 2r2w
GPP LLFU D$ Request/Response Crossbar L1 I$ 16 KB L2 Request and Response Crossbars L1 D$ 16 KB SLFU Lane 3 Lane 1
Lane RF 24 × 32b 2r2w Inst Buf 128× Lane RF 24 × 32b 2r2w Inst Buf 128× Lane RF 24 × 32b 2r2w Inst Buf 128×
Lane SLFU SLFU SLFU IDQ Lane Management Unit IDQ IDQ CIB 8× CIB 8× CIB 8× LSQ 16× LSQ 16× LSQ 16× DBN Lane Management Unit
I xloop.ua – Using xloop.om
mechanisms
I xloop.orm – Combine xloop.or
and xloop.om mechanisms
I xloop.*.db
. Lanes communicate updates to loop bound . LMU tracks maximum bound and generates additional work
Cornell University Shreesha Srinath 23 / 31
Motivation XLOOPS ISA XLOOPS Compiler
- XLOOPS Microarchitecture •
Evaluation
Adaptive Execution
Iteration
inst0 inst1 inst2 inst3 ... branch
Iteration 1
inst0 inst1 inst2 inst3 ... branch inst0 inst1 inst2 inst3 ... branch
Iteration 2
inst0 inst1 inst2 inst3 ... branch
Iteration 3
inst0 inst1 inst2 inst3 ... branch
Iteration n-1
I Significant intra-iteration and
limited inter-iteration parallelism
I Specialized execution not
beneficial using simple in-order lanes
Cornell University Shreesha Srinath 24 / 31
Motivation XLOOPS ISA XLOOPS Compiler
- XLOOPS Microarchitecture •
Evaluation
Adaptive Execution
Iteration
inst0 inst1 inst2 inst3 ... branch
Iteration 1
inst0 inst1 inst2 inst3 ... branch inst0 inst1 inst2 inst3 ... branch
Iteration 2
inst0 inst1 inst2 inst3 ... branch
Iteration 3
inst0 inst1 inst2 inst3 ... branch
Iteration n-1
OoO GPP L1 Data Cache OoO GPP L1 Data Cache Lanes Lane Manager Mem XBar
I Significant intra-iteration and
limited inter-iteration parallelism
I Specialized execution not
beneficial using simple in-order lanes
I Adaptively migrate to complex
OoO cores
Cornell University Shreesha Srinath 24 / 31
Motivation XLOOPS ISA XLOOPS Compiler
- XLOOPS Microarchitecture •
Evaluation
GPP LMU Lane0 Lane1 LLFU Time
OoO GPP L1 Data Cache Lanes Lane Manager Mem XBar
GPP Profiling
Cornell University Shreesha Srinath 25 / 31
Motivation XLOOPS ISA XLOOPS Compiler
- XLOOPS Microarchitecture •
Evaluation
GPP LMU Lane0 Lane1 LLFU Time
OoO GPP L1 Data Cache Lanes Lane Manager Mem XBar
GPP Profiling LPSU Profiling
Cornell University Shreesha Srinath 25 / 31
Motivation XLOOPS ISA XLOOPS Compiler
- XLOOPS Microarchitecture •
Evaluation
GPP LMU Lane0 Lane1 LLFU Time
OoO GPP L1 Data Cache Lanes Lane Manager Mem XBar
GPP Profiling Traditional Execution LPSU Profiling
Cornell University Shreesha Srinath 25 / 31
Motivation XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture
- Evaluation •
- 3. XLOOPS Microarchitecture
0.5 1.0 1.5 2.0 2.5
- 4. Evaluation
- 1. XLOOPS ISA
loop: lw r2, 0(rA) lw r3, 0(rB) ... ... addiu.xi rA, 4 addiu.xi rB, 4 addiu r1, r1, 1 xloop.uc r1, rN, loop
- 2. XLOOPS Compiler
#pragma xloops ordered for(i = 0; i < N i++) A[i] = A[i] * A[i-K]; #pragma xloops atomic for(i = 0; i < N; i++) B[ A[i] ]++; D[ C[i] ]++; OoO GPP L1 Data Cache Lanes Lane Manager Mem XBar
Cornell University Shreesha Srinath 26 / 31
Motivation XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture
- Evaluation •
Application Kernels
xloop.uc
Color space conversion Dense matrix-multiply String search algorithm Symmetric matrix-multiply Viterbi decoding algorithm Floyd-Warshall shortest path
xloop.or
ADPCM decoder Covriance computation Floyd-Steinberg dithering K-Means clustering SHA-1 encryption kernel Symmetric matrix-multiply
xloop.om
Dynamic-programming K-Nearest neighbors Knapsack kernel Floyd-Warshall shortest path
xloop.orm, xloop.ua
Greedy maximal-matching 2D Stencil computation Binary tree construction Heap-sort computation Huffman entropy coding Radix sort algorithm
xloop.uc.db
Breadth-first search Quick-sort algorithm
25 Kernels from MiBench, PolyBench, PBBS, and Custom
Cornell University Shreesha Srinath 27 / 31
Motivation XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture
- Evaluation •
Cycle-Level Methodology
PyMTL
I LLVM-3.1 based compiler framework I gem5 – in-order and out-of-order processors I PyMTL – LPSU models I McPAT-1.0 – 45nm energy models
Cornell University Shreesha Srinath 28 / 31
Motivation XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture
- Evaluation •
Energy-Efficiency vs. Performance Results
In-order+LPSU vs. In-order Core OOO 2-way+LPSU vs. OOO 2-Way OOO 4-way+LPSU vs. OOO 4-Way
0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
Normalized Performance
0.5 1.0 1.5 2.0 2.5 3.0 3.5
Normalized Energy Efficiency
0.5 1.0 1.5 2.0 2.5 3.0
Normalized Performance
0.5 1.0 1.5 2.0 2.5
Normalized Performance
I Competitive energy efficiency I Higher dynamic power I Always higher performance
Cornell University Shreesha Srinath 29 / 31
Motivation XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture
- Evaluation •
Energy-Efficiency vs. Performance Results
In-order+LPSU vs. In-order Core OOO 2-way+LPSU vs. OOO 2-Way OOO 4-way+LPSU vs. OOO 4-Way
0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
Normalized Performance
0.5 1.0 1.5 2.0 2.5 3.0 3.5
Normalized Energy Efficiency
0.5 1.0 1.5 2.0 2.5 3.0
Normalized Performance
0.5 1.0 1.5 2.0 2.5
Normalized Performance
I Always more energy efficient I Mixed dynamic power I Competitive or higher performance (uc/or/om/ua/db)
Cornell University Shreesha Srinath 29 / 31
Motivation XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture
- Evaluation •
Energy-Efficiency vs. Performance Results
In-order+LPSU vs. In-order Core OOO 2-way+LPSU vs. OOO 2-Way OOO 4-way+LPSU vs. OOO 4-Way
0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
Normalized Performance
0.5 1.0 1.5 2.0 2.5 3.0 3.5
Normalized Energy Efficiency
0.5 1.0 1.5 2.0 2.5 3.0
Normalized Performance
0.5 1.0 1.5 2.0 2.5
Normalized Performance
I Always more energy efficient I Always lower dynamic power I Mixed performance (uc/om/ua/db)
Cornell University Shreesha Srinath 29 / 31
Motivation XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture
- Evaluation •
Energy-Efficiency vs. Performance Results
In-order+LPSU vs. In-order Core OOO 2-way+LPSU vs. OOO 2-Way OOO 4-way+LPSU vs. OOO 4-Way
0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
Normalized Performance
0.5 1.0 1.5 2.0 2.5 3.0 3.5
Normalized Energy Efficiency
0.5 1.0 1.5 2.0 2.5 3.0
Normalized Performance
0.5 1.0 1.5 2.0 2.5
Normalized Performance
I Trade energy efficiency for performance for slower kernels I Profiling and migration cause minimal performance degradtion
Cornell University Shreesha Srinath 29 / 31
Motivation XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture
- Evaluation •
Energy-Efficiency vs. Performance Results
In-order+LPSU vs. In-order Core OOO 2-way+LPSU vs. OOO 2-Way OOO 4-way+LPSU vs. OOO 4-Way
0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
Normalized Performance
0.5 1.0 1.5 2.0 2.5 3.0 3.5
Normalized Energy Efficiency
0.5 1.0 1.5 2.0 2.5 3.0
Normalized Performance
0.5 1.0 1.5 2.0 2.5
Normalized Performance
More results in the paper!
Cornell University Shreesha Srinath 29 / 31
Motivation XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture
- Evaluation •
DCache 16KB SRAM for Cache Lines DCache Tags ICache Tags ICache 16KB SRAM for Cache Lines L0 Instr Buffer L0 Instr Buffer L0 Instr Buffer L0 Instr Buffer Loop Pattern Specialization Unit Scalar Processor 32b IEEE Floating Point Unit 32b Integer Mul/Div Unit
VLSI Implementation
I TSMC 40 nm standard-cell-based implementation I RISC scalar processor with 4-lane LPSU I Supports xloop.uc I ≈40% extra area compared to simple RISC processor
Cornell University Shreesha Srinath 30 / 31
Motivation XLOOPS ISA XLOOPS Compiler XLOOPS Microarchitecture Evaluation
loop: lw r2, 0(rA) lw r3, 0(rB) ... ... addiu.xi rA, 4 addiu.xi rB, 4 addiu r1, r1, 1 xloop.uc r1, rN, loop OoO GPP L1 Data Cache Lanes Lane Manager Mem XBar #pragma xloops ordered for(i = 0; i < N i++) A[i] = A[i] * A[i-K]; #pragma xloops atomic for(i = 0; i < N; i++) B[ A[i] ]++; D[ C[i] ]++;
Take-Away Points
I Elegant new abstraction that enables performance-portable execution of loops I A single-ISA heterogeneous architecture with a new execution paradigm . Traditional Execution . Specialized Execution . Adaptive Execution
This work was supported in part by the National Science Foundation (NSF), the Defense Advanced Research Projects Agency (DARPA), and donations from Intel Corporation, Synopsys, Inc., and Xilinx, Inc.
Cornell University Shreesha Srinath 31 / 31