in memory data parallel processor
play

In-Memory Data Parallel Processor Da Daichi Fu Fujiki Sc Scott - PowerPoint PPT Presentation

In-Memory Data Parallel Processor Da Daichi Fu Fujiki Sc Scott Ma Mahlke Re Reetuparna Da Das M-Bi Bits Research Group Data movement is what matters, not arithmetic Bill Dally GPU CPU DATA-PARALLEL APPLICATIONS MANY THREAD


  1. In-Memory Data Parallel Processor Da Daichi Fu Fujiki Sc Scott Ma Mahlke Re Reetuparna Da Das M-Bi Bits Research Group

  2. “Data movement is what matters, not arithmetic” – Bill Dally GPU CPU DATA-PARALLEL APPLICATIONS MANY THREAD MANY CORE SIMT SIMD SIMD OoO ARITHMETIC DATA COMMUNICATION 40x 1000x 2

  3. In-Memory Computing exposes parallelism while minimizing data movement cost � In-situ computing � Massive parallelism IN-MEMORY GPU CPU SIMD slots over dense memory arrays High bandwidth / Low data movement 3

  4. In-Memory Computing – Reduces Data Movement � In-situ computing � Massive parallelism IN-MEMORY GPU CPU V 1 C 12 C 11 DAC I 11 = V 1 C 11 I 12 = V 1 C 12 V 2 C 22 C 21 DAC I 21 = V 2 C 21 I 22 = V 2 C 22 I 1 =I 11 +I 21 I 2 =I 21 +I 22 – – 4

  5. In-Memory Computing – Exposes Parallelism � In-situ computing � Massive parallelism IN-MEMORY CPU (2 sockets) GPU ReRAM Intel Xeon E5-2597 NVIDIA TITAN Xp Scaled from ISAAC* Area (mm2) 912.24 471 494 TDP (W) 290 250 416 On-chip memory (MB) 78.96 9.14 8,590 SIMD slots 448 3,840 2,097,152 Freq (GHz) 3.6 1.585 0.02 SIMD Freq Product 3,227 6,086 41,953 5

  6. In-Memory Computing Today ReRAM Dot-product Accelerator V V 1 1 C 12 C 12 C 11 C 11 DAC DAC PRIME [Chi 2016, ISCA] • I 12 I 11 I 11 = V 1 C 11 I 12 = V 1 C 12 ISAAC [Shafiee 2016, ISCA] • V V 2 2 Dot-Product Engine [Hu 2016, DAC] • C 22 C 22 C 21 C 21 PipeLayer [Song 2017, HPCA] • DAC DAC I 22 I 21 I 22 = V 2 C 22 I 21 = V 2 C 21 I 1 = I 11 + I 21 I 2 = I 12 + I 22 Multiplication + Summation 6

  7. In-Memory Computing – No Demonstration of General Purpose Computing How to program? IN-MEMORY � No established programming model / execution model � Limited computation primitives 7

  8. In-Memory Data Parallel Processor Overview Memory ISA ADD MOVI HW Microarchitecture DOT MOVG MUL SHIFT{L/R} SUB MASK MOV LUT ISA MOVS REDUCE_SUM ILP Module Execution Model IB1 IB2 IB1 IB2 IB1 IB2 IMP Compiler Compiler Data Flow Graph SW Programming Model DLP 8

  9. HW SW Processor Architecture - ISA - Execution Model - Programming Model - Compiler Computation Primitives Information stored in analog A (cell conductance C = 1/resistance) C A DAC A B Write Read C B DAC C A 9

  10. HW SW Processor Architecture - ISA - Execution Model - Programming Model - Compiler Computation Primitives V dd /2 V dd C A DAC I A V dd Ohm’s law [mult] I A = (V dd /2)C A C B DAC I B Kirchhoff’s law [add] I = ( I A + I B ) (b) Subtraction* (a) Addition 10 * New primitive

  11. HW SW Processor Architecture - ISA - Execution Model - Programming Model - Compiler Computation Primitives # $ ! & = #! + %" " (# %) = #! ! " $! + &" %" % V X C C C A DAC V dd DAC DAC I AX = V A C A I CX = V X C C V 1 V 2 V dd - Multiplier + + " ! V 11 Y C 12 C 11 Multiplicand C D C B A B DAC DAC I BY = V Y C B I DY = V Y C D I 11 =(V dd – V 1 )C 11 I 12 =(V dd – V 2 ) C 12 I 1 =I AX +I BY I 2 =I CX +I DY A B - - X Y (c) Dot-product (c) Element-wise multiplication * d 11 * New primitive

  12. HW SW Processor Architecture - ISA - Execution Model - Programming Model - Compiler Microarchitecture Cluster ... ReRAM ReRAM Reg. PU PU File ... ReRAM ReRAM LUT PU PU DAC ADC ADC DAC RRAM XB S+A Reg S+H Router = Processing Unit RowDecoder + Shift&Add Unit 12

  13. HW SW Processor Architecture - ISA - Execution Model - Programming Model - Compiler Microarchitecture Cluster DAC DAC DAC DAC Array Size 128 x 128 ... ReRAM ReRAM Reg. 512B x 8 Register File R/W Latency 50 ns PU PU File DAC Multi Level Cell 2 DAC ... ReRAM ReRAM LUT ADC Resolution 5 PU PU ALU ALU ALU ALU DAC ADC Frequency 1.2 GSps ALU ALU ALU ALU DAC ADC DAC Resolution 2 DAC ADC DAC RRAM XB S+A Shift and Hold Sample and Hold LUT size 256 x 8 Reg S+H 8 PUs/array Processing Unit 128 4B Regs/PU 13

  14. HW SW Processor Architecture - ISA - Execution Model - Programming Model - Compiler ISA Opcode Format Cycles 3 ADD <MASK> <DST> Computation 18 In-situ DOT <MASK> <REG_MASK> <DST> 18 MUL <SRC> <SRC> <DST> 3 SUB <SRC> <SRC> <DST> 3 MOV <SRC> <DST> 3 Moves MOVS <SRC> <DST> <MASK> R/W 1 MOVI <SRC> <IMM> Variable MOVG <GADDR> <GADDR> 3 SHIFT{L/R} <SRC> <SRC> <IMM> 3 MASK <SRC> <SRC> <IMM> Misc 4 LUT <SRC> <SRC> Variable REDUCE_SUM <SRC> <GADDR> 14

  15. HW SW Processor Architecture - ISA - Execution Model - Programming Model - Compiler 15

  16. HW SW Processor Architecture - ISA - Execution Model - Programming Model - Compiler Programming Model Need a programming language that merges concepts of KEY OBSERVATION Data-Flow and SIMD for maximizing parallelism Explicit dataflow exposes Data-Flow Instruction Level Parallelism SIMD Data Level Parallelism No dependence on shared Side-effect Free memory primitives 16

  17. HW SW Processor Architecture - ISA - Execution Model - Programming Model - Compiler Execution Model … Input Matrix A Input Matrix B … Data Flow Graph 17

  18. HW SW Processor Architecture - ISA - Execution Model - Programming Model - Compiler Execution Model … Input Matrix A … Input Matrix A Input Matrix B … Input Matrix B … Decomposed DFG Unroll innermost dimension ↕ Module Module Data Flow Graph Modularized execution flow Applied to the innermost dimension DLP 18

  19. HW SW Processor Architecture - ISA - Execution Model - Programming Model - Compiler Execution Model … Input Matrix A … Input Matrix A ILP Input Matrix B Input Matrix B … … IB1 IB2 IB IB Data Flow Graph Decomposed Module Data Flow Graph Instruction Block (IB) Partial execution sequence of a Module Mapped to a single array 19

  20. HW SW Processor Architecture - ISA - Execution Model - Programming Model - Compiler Module Execution Model IB1 IB2 IB1 IB2 IB1 IB2 Module Modularized execution flow Applied to the innermost dimension IB1 IB1 IB1 IB2 IB2 IB2 IB IB ReRAM Array Instruction Block (IB) Components of Module Mapped to a single array 20

  21. HW SW Processor Architecture - ISA - Execution Model - Programming Model - Compiler … Execution Model Input Matrix A … Input Matrix B Module … Data Flow Graphs Modularized execution flow Applied to the innermost dimension IB1 IB2 IB1 IB2 IB1 IB2 Modules … IB IB Instruction Block (IB) Components of Module Mapped to a single array ReRAM Array IB1 IB1 IB1 IB2 IB2 IB2 … 21

  22. HW SW Processor Architecture - ISA - Execution Model - Programming Model - Compiler 22

  23. HW SW Processor Architecture - ISA - Execution Model - Programming Model - Compiler Compilation Flow Target Machine Modeling Optimization NodeMerging Instruction IB Semantic CodeGen IB Expansion Lowering Scheduling Analysis Python Pipelining C++ Backend Java Tensor Flow DFG IMP Compiler (Protocol Buffer) 23

  24. HW SW Processor Architecture - ISA - Execution Model - Programming Model - Compiler Compilation Flow Target Machine Modeling Optimization NodeMerging Instruction IB Semantic CodeGen IB Expansion Lowering Scheduling Analysis Python Pipelining C++ Backend Java Tensor Flow DFG IMP Compiler (Protocol Buffer) 24

  25. HW SW Processor Architecture - ISA - Execution Model - Programming Model - Compiler Optimization1 NodeMerging Exploit multi-operand ADD/SUB • Reduce redundant writebacks • 6 6 2 2 … … … … Semantic Analysis 5 5 3 3 Optimization NodeMerging Place Place Place Place holder holder holder holder IB Expansion Pipelining NodeMerging + 8 Add + Add Reduce 16 + 8 Instruction Lowering + IB Scheduling Reduce 16 + CodeGen 25

  26. HW SW Processor Architecture - ISA - Execution Model - Programming Model - Compiler Optimization2 IB Expansion Expose more parallelism in a module to architecture. 6 2 6 2 … … … … Semantic Analysis 5 3 5 3 Optimization NodeMerging Place Place Place Place holder holder holder holder IB Expansion Unpack + Pipelining IB Expansion + 2 + 6 8 Add Add Add 8 + 8 5 3 8 Instruction Pack Lowering IB Scheduling … … CodeGen 26

  27. HW SW Processor Architecture - ISA - Execution Model - Programming Model - Compiler Compilation Flow Target Machine Modeling Optimization NodeMerging Instruction IB Semantic CodeGen IB Expansion Lowering Scheduling Analysis Python Pipelining C++ Backend Java Tensor Flow DFG IMP Compiler (Protocol Buffer) 28

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend