for General Purpose Processors Zhengrong Wang Prof. Tony Nowatzki - PowerPoint PPT Presentation

Stream-based Memory Specialization for General Purpose Processors Zhengrong Wang Prof. Tony Nowatzki 1

Computation & Memory Specialization SIMD Dataflow + + + + + + New ISA abstraction for / certain computation pattern. Core Acc. - - - - - + b[0] a[b[0]] b[i] New ISA abstraction for Mem Acc. memory access pattern? b[1] a[b[1]] a[b[i]] b[2] a[b[2]] Stream … … 2

Stream: A New ISA Memory Abstraction • Stream: A decoupled memory access pattern. • Higher level abstraction in ISA. – Decouple memory access. – Enable efficient prefetching. – Leverage stream information in cache policies. • 60% memory accesses → streams. • 1.37 × speedup over a traditional O3 processor. Core Acc. b[0] b[i] a[b[0]] b[1] a[b[1]] a[b[i]] b[2] a[b[2]] Stream … … Mem Acc. 3

Outline • Insight & Opportunities. • Stream Characteristics. • Stream ISA Extension. • Stream-Aware Policies. • Microarchitecture Extension. • Evaluation. 4

Conventional Memory Abstraction O3 Core L1 Cache L2 Cache while (i < N) { if (cond) v += a[i]; i++; if Overhead 3: } Assumption on reuse. Overhead 2: addr if Similar address Addr. load addr Miss Hit computation/loads. Addr. load Miss Hit Overhead 1: Hard to prefetch with control flow. Val. add Resp. Resp. Val. br add Resp. Resp. 6 br

Opportunity 1: Prefetch with Ctrl. Flow O3 Core L1 Cache L2 Cache cfg(a[i]); Prefetch. while (i < N) { cfg. SE. Miss Hit if (cond) Before loop. v += a[i]; if i++; Resp. Resp. } addr if Addr. load addr Miss Hit Hit Addr. load Miss Hit Hit Overhead 1: Opportunity 1: Hard to prefetch Prefetch with with control flow. control flow. Val. add Resp. Resp. Val. br add Resp. Resp. 7 br

Opportunity 2: Semi-Binding Prefetch s_a = cfg(); O3 Core L1 Cache L2 Cache while (i < N) { Prefetch. cfg. SE. Miss Hit if (cond) v += s_a; Before loop. i++; if Resp. Resp. } addr if Overhead 2: Opportunity 2: FIFO Semi-binding Similar address Addr. load addr Hit computation/loads. prefetch. Addr. Opportunity 1: load Hit add Resp. Prefetch with Val. br add Resp. control flow. br 8

Opportunity 3: Stream-Aware Policies O3 Core L1 Cache L2 Cache s_a = cfg(); Prefetch. while (i < N) { cfg. SE. Miss Hit if (cond) Before loop. v += s_a; if } Resp. Resp. Opportunity 2: Overhead 2: if add FIFO Opportunity 3: Overhead 3: Repeated address Semi-binding Assumption on reuse. Better policies, e.g. br add prefetch. computation/loads. bypass a cache level Opportunity 1: br if no locality. Prefetch with control flow. 9

Related Work • Decouple access execute. – Outrider [ISCA’11], DeSC [MICRO’15], etc. – Ours: New ISA abstraction for the access engine. • Prefetching. – Stride, IMP [MICRO’15], etc. – Ours: Explicit access pattern in ISA. • Cache bypassing policy. – Counter- based [ICCD’05], LLC bypassing [ISCA’11], etc. – Ours: Incorporate static stream information. 10

Stream Characteristics – Stream Type Trace analysis on CortexSuite/SPEC CPU 2017. • 51.49% affine, 10.19% indirect. • Indirect streams can be as high as 40%. 100% 90% 80% 70% 60% Support indirect stream. 50% 40% 30% 20% 10% 0% Affine Indirect PC Unqualified Outside 12

Stream Characteristics – Stream Length • 51% stream accesses from stream longer than 1k. • Some benchmarks contain short streams. 100% 90% Support longer stream to capture long term behavior. 80% Low overhead to support short streams. 70% 60% 50% 40% 30% 20% 10% 0% pca rbm disparity lbm_s sphinx srr svm xz_s avg. >1k >100 >50 >0 13

Stream Characteristics – Control Flow • 53% stream accesses from loop with control flow. 100% 90% 80% 70% Decouple from control flow. 60% 50% 40% 30% 20% 10% 0% >3 3 2 1 Execution Paths within the Loop 14

Stream ISA Extension – Basic Example Original C Code Stream Decoupled Pseudo Code Stream Dependence Graph int i = 0; stream_cfg(s_i, s_a); while (i < N) { while (s_i < N) { s_i sum += a[i]; sum += s_a; i++; stream_step(s_i); } } s_a stream_end(s_i, s_a); Step. User Stream a[i] Pseudo-Reg Iter. 0 Memory 0x400 s_a i++ 1 Memory 0x404 i++ 2 Memory 0x408 … 16

Stream ISA Extension – Control Flow Original C Code Stream Decoupled Pseudo Code Stream Dependence Graph int i = 0, j = 0; stream_cfg(s_i, s_a, s_j, s_b); s_i s_j while (cond) { while (cond) { if (a[i] < b[j]) if (s_a < s_b) i++; stream_step(s_i); s_a s_b else else j++; stream_step(s_j); } } stream_end(s_i, s_a, s_j, s_b); Step User Stream a[i] Iter. Pseudo-Reg Memory 0x400 0 i++ 1 Memory 0x404 s_a Memory 0x408 2 i++ … 17

Stream ISA Extension – Indirect Stream Original C Code Stream Decoupled Pseudo Code Stream Dependence Graph int i = 0; stream_cfg(s_i, s_a, s_b); s_i while (i < N) { while (s_i < N) { sum += a[b[i]]; sum += s_a; i++; stream_step(s_i); s_b } } stream_end(s_i, s_a, s_b); s_a Iter. Step User a[b[i]] b[i] Pseudo-Reg Pseudo-Reg 0 Memory 0x888 Memory 0x400 i++ 1 s_a Memory 0x668 Memory 0x404 s_b i++ Memory 0x86c Memory 0x408 2 … … 18

Stream ISA Extension – ISA Semantic • New architectural states: – Stream configuration. – Current iteration’s data. • New speculation in ISA: – Stream elements will be used. – Streams are long. • Maintain the memory order. – Load → first use of the pseudo-register after configured/stepped. – Store → every write to the pseudo-register. 19

Stream-Aware Policies Rich Information Better Policies Memory Footprint Reuse Distance Prefetch Throttling Modified? Cache Replacement Compiler (ISA) /Hardware Conditional Used? Cache Bypassing Indirect Sub-Line Transfer … … 21

Stream-Aware Policies – Cache Bypass • Stream: Access Pattern → Precise Memory Footprint. Core while (i < N) while (j < N) while (k < N) L1$ s_b s_a sum += a[k][i] * b[k][j]; L2$ s_b s_a Reuse Dist. 𝑂 Reuse Dist. 𝑂 × 𝑂 a[N][N] b[N][N] 22

Microarchitecture Memory 0x400 Memory 0x404 Pseudo-Reg Memory 0x408 Stream Memory 0x40c Memory 0x410 24

Microarchitecture – Misspeculation • Control misspeculated stream_step. – Decrement the iteration map. – No need to flush the FIFO and re-fetch data (decoupled) ! • Other misspeculation. – Revert the stream states, including stream FIFO. • Memory fault delayed until the use of the element. 25

Outline • Insight & Opportunities. • Stream Characteristics. • Stream ISA Extension. • Microarchitecture Extension. • Stream-Aware Policies. • Evaluation. 26

Methodology • Compiler in LLVM: – Identify stream candidates. – Generate stream configuration. – Transform the program. • Gem5 + McPAT simulation. • 33 Benchmarks: – SPEC2017 C/CPP benchmarks. – CortexSuite. • SimPoint: – 10 million instructions’ simpoints. – ~ 10 simpoints per benchmark. 27

Configurations Baseline. Stream Specialized Processor. • SSP-Non-Bind: • Baseline O3 . – Prefetch only. • Pf-Stride: • SSP-Semi-Bind: – Table-based prefetcher. • Pf-Helper: – + Semi-binding prefetch. • SSP-Cache-Aware: – SMT-based ideal helper thread. – Requires no HW resources (ROB, – + Stream-Aware cache bypassing. etc.). – Exactly 1k instruction before the main thread. 28

Results – Overall Performance 7 6 5 4 3 2 1 0 Pf-Stride SSP-Non-Bind SSP-Semi-Bind SSP-Cache-Aware Pf-Helper 29

Results – Semi-Binding Prefetching Speedup of Semi-Binding Prefetch vs. Non-Binding Prefetch 1.5 1 1 0.8 0.6 0.4 0.2 0 Remain Insts Added Insts 30

Results – Design Space Interaction OOO[2,6,8] Pf-Stride[2,6,8] Pf-Helper[2,6,8] SSP-Cache-Aware[2,6,8] 1.1 1.1 1 1 0.9 0.9 Energy Energy 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 1 1.5 2 2.5 3 1 1.5 2 2.5 3 CortexSuite Speedup SPEC CPU 2017 Speedup 31

Conclusion • Stream as a new memory abstraction in ISA. – ISA/Microarchitecture extension. – Stream-aware cache bypassing. • New paradigm of memory specialization. – New direction for improving cache architectures. – Combine memory and computation specialization. 32

for General Purpose Processors Zhengrong Wang Prof. Tony Nowatzki - PowerPoint PPT Presentation

Stream-based Memory Specialization for General Purpose Processors Zhengrong Wang Prof. Tony Nowatzki 1 Computation & Memory Specialization SIMD Dataflow + + + + + + New ISA abstraction for / certain computation pattern. Core

Wednesday, November 30, 2016 3:41 PM General Page 1 General Page 2 General Page 3 General Page

Today Digital Signal Processors Digital signal processors Microcontrollers are optimized

1 Introduction to ES Architectures Components and Systems Embedded System Architectures

Lect. 4: Shared Memory Multiprocessors Obtained by connecting full processors together

CS 105 Intel x86 (IA32/64) Processors Intel x86 (IA32/64) Processors Tour of the Black Holes

Utilizing commercial graphics processors Utilizing commercial graphics processors in the

VLIW Processors VLIW (very long instruction word) processors instructions are scheduled

Stochastic Processors (or processors that do not always compute correctly by design) Rakesh Kumar

Today Digital signal processors VLIW SHARC details Quick look at audio processing

Today Digital signal processors VLIW SHARC details Quick look at audio processing

Sora: High Performance Software Radio using General Purpose Multi- Core Processors Kun Tan

General Purpose list of instructions Processors Instructions are stored in an external memory

Merit Oil/Fuel Processors Summary of Remedial Investigation and Proposed Remedial Action MARK

The Emerging Power Crisis in Embedded The Emerging Power Crisis in Embedded Processors What Can a

Overview for Payroll Processors and Supervisors August 26, 2015 Office of Student Employment

rMPI: Message Passing on Multicore Processors with On-Chip Interconnect 19. Oktober 2009

Buffer Pools Lecture # 05 Database Systems Andy Pavlo AP AP Computer Science 15-445/15-645

Using RDMA Efficiently for Key-Value Services Anuj Kalia (CMU) Michael Kaminsky (Intel Labs),

Practical Attacks against Mobile Device Management Solutions Michael Shaulov, CEO

Banked Multiported Register Files for High-Frequency Superscalar Microprocessors Jessica H. Tseng

universal composability from essentially any trusted setup Mike Rosulek | | CRYPTO 2012 .

Ticagrelor vs Aspirin in Patients undergoing Coronary- Artery Bypass Grafting Herib eribert

WEN ETA JB? A 2 million dollars problem Date: 05/06/2019 For: SSTIC Presenters: Eloi

BIOS in 2015 Oleksandr Bazhaniuk, Yuriy Bulygin (presenting) , Andrew Furtak , Mikhail Gorobets,