In-Memory Data Parallel Processor Da Daichi Fu Fujiki Sc Scott - PowerPoint PPT Presentation

In-Memory Data Parallel Processor Da Daichi Fu Fujiki Sc Scott Ma Mahlke Re Reetuparna Da Das M-Bi Bits Research Group

“Data movement is what matters, not arithmetic” – Bill Dally GPU CPU DATA-PARALLEL APPLICATIONS MANY THREAD MANY CORE SIMT SIMD SIMD OoO ARITHMETIC DATA COMMUNICATION 40x 1000x 2

In-Memory Computing exposes parallelism while minimizing data movement cost � In-situ computing � Massive parallelism IN-MEMORY GPU CPU SIMD slots over dense memory arrays High bandwidth / Low data movement 3

In-Memory Computing – Reduces Data Movement � In-situ computing � Massive parallelism IN-MEMORY GPU CPU V 1 C 12 C 11 DAC I 11 = V 1 C 11 I 12 = V 1 C 12 V 2 C 22 C 21 DAC I 21 = V 2 C 21 I 22 = V 2 C 22 I 1 =I 11 +I 21 I 2 =I 21 +I 22 – – 4

In-Memory Computing – Exposes Parallelism � In-situ computing � Massive parallelism IN-MEMORY CPU (2 sockets) GPU ReRAM Intel Xeon E5-2597 NVIDIA TITAN Xp Scaled from ISAAC* Area (mm2) 912.24 471 494 TDP (W) 290 250 416 On-chip memory (MB) 78.96 9.14 8,590 SIMD slots 448 3,840 2,097,152 Freq (GHz) 3.6 1.585 0.02 SIMD Freq Product 3,227 6,086 41,953 5

In-Memory Computing Today ReRAM Dot-product Accelerator V V 1 1 C 12 C 12 C 11 C 11 DAC DAC PRIME [Chi 2016, ISCA] • I 12 I 11 I 11 = V 1 C 11 I 12 = V 1 C 12 ISAAC [Shafiee 2016, ISCA] • V V 2 2 Dot-Product Engine [Hu 2016, DAC] • C 22 C 22 C 21 C 21 PipeLayer [Song 2017, HPCA] • DAC DAC I 22 I 21 I 22 = V 2 C 22 I 21 = V 2 C 21 I 1 = I 11 + I 21 I 2 = I 12 + I 22 Multiplication + Summation 6

In-Memory Computing – No Demonstration of General Purpose Computing How to program? IN-MEMORY � No established programming model / execution model � Limited computation primitives 7

In-Memory Data Parallel Processor Overview Memory ISA ADD MOVI HW Microarchitecture DOT MOVG MUL SHIFT{L/R} SUB MASK MOV LUT ISA MOVS REDUCE_SUM ILP Module Execution Model IB1 IB2 IB1 IB2 IB1 IB2 IMP Compiler Compiler Data Flow Graph SW Programming Model DLP 8

HW SW Processor Architecture - ISA - Execution Model - Programming Model - Compiler Computation Primitives Information stored in analog A (cell conductance C = 1/resistance) C A DAC A B Write Read C B DAC C A 9

HW SW Processor Architecture - ISA - Execution Model - Programming Model - Compiler Computation Primitives V dd /2 V dd C A DAC I A V dd Ohm’s law [mult] I A = (V dd /2)C A C B DAC I B Kirchhoff’s law [add] I = ( I A + I B ) (b) Subtraction* (a) Addition 10 * New primitive

HW SW Processor Architecture - ISA - Execution Model - Programming Model - Compiler Computation Primitives # $ ! & = #! + %" " (# %) = #! ! " $! + &" %" % V X C C C A DAC V dd DAC DAC I AX = V A C A I CX = V X C C V 1 V 2 V dd - Multiplier + + " ! V 11 Y C 12 C 11 Multiplicand C D C B A B DAC DAC I BY = V Y C B I DY = V Y C D I 11 =(V dd – V 1 )C 11 I 12 =(V dd – V 2 ) C 12 I 1 =I AX +I BY I 2 =I CX +I DY A B - - X Y (c) Dot-product (c) Element-wise multiplication * d 11 * New primitive

HW SW Processor Architecture - ISA - Execution Model - Programming Model - Compiler Microarchitecture Cluster ... ReRAM ReRAM Reg. PU PU File ... ReRAM ReRAM LUT PU PU DAC ADC ADC DAC RRAM XB S+A Reg S+H Router = Processing Unit RowDecoder + Shift&Add Unit 12

HW SW Processor Architecture - ISA - Execution Model - Programming Model - Compiler Microarchitecture Cluster DAC DAC DAC DAC Array Size 128 x 128 ... ReRAM ReRAM Reg. 512B x 8 Register File R/W Latency 50 ns PU PU File DAC Multi Level Cell 2 DAC ... ReRAM ReRAM LUT ADC Resolution 5 PU PU ALU ALU ALU ALU DAC ADC Frequency 1.2 GSps ALU ALU ALU ALU DAC ADC DAC Resolution 2 DAC ADC DAC RRAM XB S+A Shift and Hold Sample and Hold LUT size 256 x 8 Reg S+H 8 PUs/array Processing Unit 128 4B Regs/PU 13

HW SW Processor Architecture - ISA - Execution Model - Programming Model - Compiler ISA Opcode Format Cycles 3 ADD <MASK> <DST> Computation 18 In-situ DOT <MASK> <REG_MASK> <DST> 18 MUL <SRC> <SRC> <DST> 3 SUB <SRC> <SRC> <DST> 3 MOV <SRC> <DST> 3 Moves MOVS <SRC> <DST> <MASK> R/W 1 MOVI <SRC> <IMM> Variable MOVG <GADDR> <GADDR> 3 SHIFT{L/R} <SRC> <SRC> <IMM> 3 MASK <SRC> <SRC> <IMM> Misc 4 LUT <SRC> <SRC> Variable REDUCE_SUM <SRC> <GADDR> 14

HW SW Processor Architecture - ISA - Execution Model - Programming Model - Compiler 15

HW SW Processor Architecture - ISA - Execution Model - Programming Model - Compiler Programming Model Need a programming language that merges concepts of KEY OBSERVATION Data-Flow and SIMD for maximizing parallelism Explicit dataflow exposes Data-Flow Instruction Level Parallelism SIMD Data Level Parallelism No dependence on shared Side-effect Free memory primitives 16

HW SW Processor Architecture - ISA - Execution Model - Programming Model - Compiler Execution Model … Input Matrix A Input Matrix B … Data Flow Graph 17

HW SW Processor Architecture - ISA - Execution Model - Programming Model - Compiler Execution Model … Input Matrix A … Input Matrix A Input Matrix B … Input Matrix B … Decomposed DFG Unroll innermost dimension ↕ Module Module Data Flow Graph Modularized execution flow Applied to the innermost dimension DLP 18

HW SW Processor Architecture - ISA - Execution Model - Programming Model - Compiler Execution Model … Input Matrix A … Input Matrix A ILP Input Matrix B Input Matrix B … … IB1 IB2 IB IB Data Flow Graph Decomposed Module Data Flow Graph Instruction Block (IB) Partial execution sequence of a Module Mapped to a single array 19

HW SW Processor Architecture - ISA - Execution Model - Programming Model - Compiler Module Execution Model IB1 IB2 IB1 IB2 IB1 IB2 Module Modularized execution flow Applied to the innermost dimension IB1 IB1 IB1 IB2 IB2 IB2 IB IB ReRAM Array Instruction Block (IB) Components of Module Mapped to a single array 20

HW SW Processor Architecture - ISA - Execution Model - Programming Model - Compiler … Execution Model Input Matrix A … Input Matrix B Module … Data Flow Graphs Modularized execution flow Applied to the innermost dimension IB1 IB2 IB1 IB2 IB1 IB2 Modules … IB IB Instruction Block (IB) Components of Module Mapped to a single array ReRAM Array IB1 IB1 IB1 IB2 IB2 IB2 … 21

HW SW Processor Architecture - ISA - Execution Model - Programming Model - Compiler 22

HW SW Processor Architecture - ISA - Execution Model - Programming Model - Compiler Compilation Flow Target Machine Modeling Optimization NodeMerging Instruction IB Semantic CodeGen IB Expansion Lowering Scheduling Analysis Python Pipelining C++ Backend Java Tensor Flow DFG IMP Compiler (Protocol Buffer) 23

HW SW Processor Architecture - ISA - Execution Model - Programming Model - Compiler Optimization1 NodeMerging Exploit multi-operand ADD/SUB • Reduce redundant writebacks • 6 6 2 2 … … … … Semantic Analysis 5 5 3 3 Optimization NodeMerging Place Place Place Place holder holder holder holder IB Expansion Pipelining NodeMerging + 8 Add + Add Reduce 16 + 8 Instruction Lowering + IB Scheduling Reduce 16 + CodeGen 25

HW SW Processor Architecture - ISA - Execution Model - Programming Model - Compiler Optimization2 IB Expansion Expose more parallelism in a module to architecture. 6 2 6 2 … … … … Semantic Analysis 5 3 5 3 Optimization NodeMerging Place Place Place Place holder holder holder holder IB Expansion Unpack + Pipelining IB Expansion + 2 + 6 8 Add Add Add 8 + 8 5 3 8 Instruction Pack Lowering IB Scheduling … … CodeGen 26

In-Memory Data Parallel Processor Da Daichi Fu Fujiki Sc Scott - PowerPoint PPT Presentation

In-Memory Data Parallel Processor Da Daichi Fu Fujiki Sc Scott Ma Mahlke Re Reetuparna Da Das M-Bi Bits Research Group Data movement is what matters, not arithmetic Bill Dally GPU CPU DATA-PARALLEL APPLICATIONS MANY THREAD

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Router Architectures CPU CPU Memory Memory packets NFE NFE Processor Processor Line Card

FPGA co-processor Patrick Dunne for the co-processor group Introduction Co-processor will

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Ch. 5: Processor + Memory December 12, 2008 Ch. 5: Processor + Memory Overview of Implementation

Memory Systems Design & Programming CMPE 310 Memory Address Decoding The processor can

CSC2/458 Parallel and Distributed Systems Parallel Memory Systems: Coherence Sreepathi Pai

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Processor Design Pipelined Processor Hung-Wei Tseng Drawbacks of a single-cycle processor

Parallel Algorithms Parallel Prefix Sums Algorithm Theory WS 2012/13 Fabian Kuhn PRAM Parallel

Processor Design Single Cycle Processor Hung-Wei Tseng Recap: the stored-program computer

Systems Architecture The ARM Processor The ARM Processor p. 1/14 The ARM Processor ARM:

Input/Output Introduction I/O requires cooperation between processor, memory, and devices:

Processor'General'Concepts 1 Basic'Processor1Based'System Processor' Registers core

Shared Memory Programming with OpenMP Lecture 3: Parallel Regions Parallel region directive

Memory Hierarchy Instructor: Jun Yang 1 11/19/2009 Motivation Processor-DRAM Memory Gap

Large Pages May Be Harmful on NUMA Systems Fabien Gaud

Introduction to non-parametric Bayes Introduction to non-parametric Bayes methods 1 Overview

'p.t@<k6'&! ).-.-6^,,qJ&t, I >\-(**y\ .J^,tU r[x] I,Uo" , X \)1."* -

Progress of longitudinal dynamics CEPC-SppC workshop 2016-4-8 Yuemei Peng Outline IBS

Summary of chapters 1-5 (part 1) Ole Christian Lingjrde, Dept of Informatics, UiO 6 October

Session 4 of Module 8: Evaluating an Immunological Correlate of Risk (Long Version, at http: //

... Other types of activation functions ( net = w i x i ) x n w n 1 1 x i : i- th

Bundled Payments for Care Improvement Application Guidance Webinar April 19, 2012 Bundled

In-Memory Data Parallel Processor Da Daichi Fu Fujiki Sc Scott - PowerPoint PPT Presentation

In-Memory Data Parallel Processor Da Daichi Fu Fujiki Sc Scott Ma Mahlke Re Reetuparna Da Das M-Bi Bits Research Group Data movement is what matters, not arithmetic Bill Dally GPU CPU DATA-PARALLEL APPLICATIONS MANY THREAD

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Router Architectures CPU CPU Memory Memory packets NFE NFE Processor Processor Line Card

FPGA co-processor Patrick Dunne for the co-processor group Introduction Co-processor will

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Ch. 5: Processor + Memory December 12, 2008 Ch. 5: Processor + Memory Overview of Implementation

Memory Systems Design &amp; Programming CMPE 310 Memory Address Decoding The processor can

CSC2/458 Parallel and Distributed Systems Parallel Memory Systems: Coherence Sreepathi Pai

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Processor Design Pipelined Processor Hung-Wei Tseng Drawbacks of a single-cycle processor

Parallel Algorithms Parallel Prefix Sums Algorithm Theory WS 2012/13 Fabian Kuhn PRAM Parallel

Processor Design Single Cycle Processor Hung-Wei Tseng Recap: the stored-program computer

Systems Architecture The ARM Processor The ARM Processor p. 1/14 The ARM Processor ARM:

Input/Output Introduction I/O requires cooperation between processor, memory, and devices:

Processor'General'Concepts 1 Basic'Processor1Based'System Processor' Registers core

Shared Memory Programming with OpenMP Lecture 3: Parallel Regions Parallel region directive

Memory Hierarchy Instructor: Jun Yang 1 11/19/2009 Motivation Processor-DRAM Memory Gap

Large Pages May Be Harmful on NUMA Systems Fabien Gaud

Introduction to non-parametric Bayes Introduction to non-parametric Bayes methods 1 Overview

'p.t@&lt;k6'&amp;! ).-.-6^,,qJ&amp;t, I &gt;\-(**y\ .J^,tU r[x] I,Uo&quot; , X \)1.&quot;* -

Progress of longitudinal dynamics CEPC-SppC workshop 2016-4-8 Yuemei Peng Outline IBS

Summary of chapters 1-5 (part 1) Ole Christian Lingjrde, Dept of Informatics, UiO 6 October

Session 4 of Module 8: Evaluating an Immunological Correlate of Risk (Long Version, at http: //

... Other types of activation functions ( net = w i x i ) x n w n 1 1 x i : i- th

Bundled Payments for Care Improvement Application Guidance Webinar April 19, 2012 Bundled

Memory Systems Design & Programming CMPE 310 Memory Address Decoding The processor can

'p.t@<k6'&! ).-.-6^,,qJ&t, I >\-(**y\ .J^,tU r[x] I,Uo" , X \)1."* -