In-Memory Data Parallel Processor Da Daichi Fu Fujiki Sc Scott - - PowerPoint PPT Presentation
In-Memory Data Parallel Processor Da Daichi Fu Fujiki Sc Scott - - PowerPoint PPT Presentation
In-Memory Data Parallel Processor Da Daichi Fu Fujiki Sc Scott Ma Mahlke Re Reetuparna Da Das M-Bi Bits Research Group Data movement is what matters, not arithmetic Bill Dally GPU CPU DATA-PARALLEL APPLICATIONS MANY THREAD
2 CPU GPU
DATA-PARALLEL APPLICATIONS ARITHMETIC DATA COMMUNICATION
MANY CORE SIMD OoO MANY THREAD SIMT SIMD
“Data movement is what matters, not arithmetic”
– Bill Dally
1000x 40x
In-Memory Computing exposes parallelism while minimizing data movement cost
3 CPU GPU IN-MEMORY In-situ computing Massive parallelism
SIMD slots over dense memory arrays High bandwidth / Low data movement
In-Memory Computing – Reduces Data Movement
4 CPU GPU IN-MEMORY
– –
C11 C12 V1
I12= V1C12 DAC I11= V1C11
C21 C22
I22= V2C22 DAC
V2
I21= V2C21
I1=I11+I21 I2=I21+I22
In-situ computing Massive parallelism
5
CPU (2 sockets)
Intel Xeon E5-2597
GPU
NVIDIA TITAN Xp
ReRAM
Scaled from ISAAC*
Area (mm2) 912.24 471 494 TDP (W) 290 250 416 On-chip memory (MB) 78.96 9.14 8,590 SIMD slots 448 3,840 2,097,152 Freq (GHz) 3.6 1.585 0.02 SIMD Freq Product 3,227 6,086 41,953
In-Memory Computing – Exposes Parallelism
IN-MEMORY In-situ computing Massive parallelism
In-Memory Computing Today
6
C11 C12 V
1
I11= V1C11
DAC
C21 C22
I22= V2C22
DAC
V
2
C11 C12 V
1
DAC
C21 C22
DAC
V
2
I12= V1C12 I21= V2C21
I11 I12 I21 I22
I1 = I11+ I21 I2 = I12+ I22
Multiplication + Summation
ReRAM Dot-product Accelerator
- PRIME [Chi 2016, ISCA]
- ISAAC [Shafiee 2016, ISCA]
- Dot-Product Engine [Hu 2016, DAC]
- PipeLayer [Song 2017, HPCA]
IN-MEMORY
In-Memory Computing – No Demonstration of General Purpose Computing
No established programming model / execution model Limited computation primitives
How to program?
7
In-Memory Data Parallel Processor Overview
Microarchitecture ISA Execution Model Programming Model Compiler
HW SW
8
Memory ISA ADD MOVI DOT MOVG MUL SHIFT{L/R} SUB MASK MOV LUT MOVS REDUCE_SUM
Data Flow Graph
IB1 IB2
IMP Compiler ILP
IB1 IB2 IB1 IB2
Module DLP
Computation Primitives
CA CB A B
DAC DAC
HW SW
Processor Architecture - ISA - Execution Model - Programming Model - Compiler 9 Information stored in analog
(cell conductance C = 1/resistance)
A CA
Read Write
Computation Primitives
Vdd
CA CB
I = ( IA + IB )
Vdd/2
DAC DAC
HW SW
Processor Architecture - ISA - Execution Model - Programming Model - Compiler 10
(a) Addition
(b) Subtraction*
IA IA = (Vdd/2)CA
Ohm’s law [mult] Kirchhoff’s law [add]
IB
Vdd
* New primitive
Computation Primitives
(c) Dot-product
CA CC
DAC
CB CD
IDY= VYCD DAC
V I1=IAX+IBY I2=ICX+IDY
ICX= VXCC
VX
HW SW
Processor Architecture - ISA - Execution Model - Programming Model - Compiler 11
Y
IAX= VACA IBY= VYCB
C11 C12 V1
I11=(Vdd – V1)C11
V2
I12=(Vdd – V2) C12 DAC DAC DAC
(c) Element-wise multiplication 11
Vdd
Vdd - Multiplier Multiplicand ! " # $ % & = #! + %" $! + &" ! " (# %) = #! %"
B A A B ! + " +
X
- Y
- d
* New primitive *
Microarchitecture
HW SW
Processor Architecture - ISA - Execution Model - Programming Model - Compiler
LUT ReRAM PU ReRAM PU
...
ReRAM PU
...
ReRAM PU Reg. File
Cluster
RRAM XB S+H DAC DAC ADC ADC S+A Reg
Processing Unit
Router = RowDecoder + Shift&Add Unit 12
Microarchitecture
HW SW
Processor Architecture - ISA - Execution Model - Programming Model - Compiler
LUT ReRAM PU ReRAM PU
...
ReRAM PU
...
ReRAM PU Reg. File
Cluster
RRAM XB S+H DAC DAC ADC ADC S+A Reg
Processing Unit
Array Size 128 x 128 R/W Latency 50 ns Multi Level Cell 2 ADC Resolution 5 ADC Frequency 1.2 GSps DAC Resolution 2 LUT size 256 x 8 13
DAC DAC Shift and Hold DAC DAC DAC DAC DAC DAC
Sample and Hold
8 PUs/array 128 4B Regs/PU 512B x 8 Register File ALU ALU ALU ALU ALU ALU ALU ALU
ISA
HW SW
Processor Architecture - ISA - Execution Model - Programming Model - Compiler Opcode Format Cycles ADD <MASK> <DST> 3 DOT <MASK> <REG_MASK> <DST> 18 MUL <SRC> <SRC> <DST> 18 SUB <SRC> <SRC> <DST> 3 MOV <SRC> <DST> 3 MOVS <SRC> <DST> <MASK> 3 MOVI <SRC> <IMM> 1 MOVG <GADDR> <GADDR> Variable SHIFT{L/R} <SRC> <SRC> <IMM> 3 MASK <SRC> <SRC> <IMM> 3 LUT <SRC> <SRC> 4 REDUCE_SUM <SRC> <GADDR> Variable In-situ Computation Moves R/W Misc 14
HW SW
Processor Architecture - ISA - Execution Model - Programming Model - Compiler 15
Programming Model
Need a programming language that merges concepts of Data-Flow and SIMD for maximizing parallelism
HW SW
Processor Architecture - ISA - Execution Model - Programming Model - Compiler
KEY OBSERVATION
Data-Flow
Explicit dataflow exposes Instruction Level Parallelism
SIMD
Data Level Parallelism
16 Side-effect Free
No dependence on shared memory primitives
Execution Model
HW SW
Processor Architecture - ISA - Execution Model - Programming Model - Compiler 17 Data Flow Graph
… …
Input Matrix A Input Matrix B
… …
Input Matrix A Input Matrix B
Execution Model
HW SW
Processor Architecture - ISA - Execution Model - Programming Model - Compiler 18 DLP Data Flow Graph
… …
Input Matrix A Input Matrix B Decomposed DFG ↕ Module Module Unroll innermost dimension
Modularized execution flow Applied to the innermost dimension
Execution Model
HW SW
Processor Architecture - ISA - Execution Model - Programming Model - Compiler Decomposed Data Flow Graph
IB1 IB2
Module 19 ILP Data Flow Graph
… …
Input Matrix A Input Matrix B
… …
Input Matrix A Input Matrix B IB IB
Instruction Block (IB) Partial execution sequence of a Module Mapped to a single array
Execution Model
HW SW
Processor Architecture - ISA - Execution Model - Programming Model - Compiler ReRAM Array
IB1 IB2
Module 20
IB1 IB2 IB1 IB2
IB1 IB1 IB1 IB2 IB2 IB2 Module Modularized execution flow Applied to the innermost dimension IB IB Instruction Block (IB) Components of Module Mapped to a single array
Execution Model
HW SW
Processor Architecture - ISA - Execution Model - Programming Model - Compiler Data Flow Graphs
… …
IB1 IB2
Modules 21
IB1 IB2 IB1 IB2
ReRAM Array IB1 IB1 IB1 IB2 IB2 IB2 Input Matrix A Input Matrix B
… …
Module Modularized execution flow Applied to the innermost dimension IB IB Instruction Block (IB) Components of Module Mapped to a single array
…
HW SW
Processor Architecture - ISA - Execution Model - Programming Model - Compiler 22
Compilation Flow
HW SW
Processor Architecture - ISA - Execution Model - Programming Model - Compiler Python C++ Java Semantic Analysis Optimization NodeMerging IB Expansion Pipelining Instruction Lowering IB Scheduling CodeGen Target Machine Modeling IMP Compiler Tensor Flow DFG (Protocol Buffer) Backend 23
Compilation Flow
HW SW
Processor Architecture - ISA - Execution Model - Programming Model - Compiler Python C++ Java Semantic Analysis Optimization NodeMerging IB Expansion Pipelining Instruction Lowering IB Scheduling CodeGen Target Machine Modeling IMP Compiler Tensor Flow DFG (Protocol Buffer) Backend 24
Place holder Place holder
Add
Reduce
8 8
16
Optimization1 NodeMerging
HW SW
Processor Architecture - ISA - Execution Model - Programming Model - Compiler
Place holder
5 6 3 2 … … 5 6 3 2 … …
Place holder Add + Reduce 16
+ + + +
Semantic Analysis
Optimization NodeMerging IB Expansion Pipelining Instruction Lowering IB Scheduling CodeGen 25
- Exploit multi-operand ADD/SUB
- Reduce redundant writebacks
NodeMerging
Place holder Place holder
Optimization2 IB Expansion
HW SW
Processor Architecture - ISA - Execution Model - Programming Model - Compiler 5 3 6 2 … …
Place holder Place holder
Add
…
8 8 3 2 … …
Add
8 8
Add
Unpack Pack 3 2 5 6 + + + 5 6 +
Semantic Analysis
Optimization NodeMerging IB Expansion Pipelining Instruction Lowering IB Scheduling CodeGen 26
…
IB Expansion Expose more parallelism in a module to architecture.
Compilation Flow
HW SW
Processor Architecture - ISA - Execution Model - Programming Model - Compiler Python C++ Java Semantic Analysis Optimization NodeMerging IB Expansion Pipelining Target Machine Modeling IMP Compiler Tensor Flow DFG (Protocol Buffer) Backend Instruction Lowering IB Scheduling CodeGen 28
Compiler Backend
HW SW
Processor Architecture - ISA - Execution Model - Programming Model - Compiler
Semantic Analysis
Optimization NodeMerging IB Expansion Pipelining Instruction Lowering IB Scheduling CodeGen Instruction Lowering
Instruction Lowering: Transform high level TF insts into memory ISA
29 Inst Lowering Add Mul LUT Add Sub Mul Div Sqrt Exp Sum Conv 2D Less
…
Memory ISA
…
Supported TF operation nodes
Division Algorithm
q = a / b 1. 12 = 345 6
(LUT)
2. 82 = 912 3. ;2 = 1 − 612 4. 8> = 82 + ;282 5. ;> = ;2
@
6. 8@ = 8> + ;>8>
High-level TF Node Newton-Raphson / Maclaurin Div
Compiler Backend
HW SW
Processor Architecture - ISA - Execution Model - Programming Model - Compiler Optimization NodeMerging IB Expansion Pipelining Instruction Lowering IB Scheduling CodeGen
Semantic Analysis
IB Scheduling DFG IB Scheduling
Target # of IBs
= 1
IB1
Large Execution Time 30
Compiler Backend
HW SW
Processor Architecture - ISA - Execution Model - Programming Model - Compiler Optimization NodeMerging IB Expansion Pipelining Instruction Lowering IB Scheduling CodeGen
Semantic Analysis
IB Scheduling DFG IB Scheduling
IB1 IB2 IB1 IB2
Good :) Bad :(
Target # of IBs
= 2 Large Execution Time Network Delay 31
Compiler Backend
HW SW
Processor Architecture - ISA - Execution Model - Programming Model - Compiler Bottom-Up Greedy Collect candidate assignments Make final assignments Minimize data transfer latency by taking both
- perand & successor
location into consideration Optimization NodeMerging IB Expansion Pipelining Instruction Lowering IB Scheduling CodeGen
Semantic Analysis
IB Scheduling 32
[Ellis 1986]
Compiler Backend
HW SW
Processor Architecture - ISA - Execution Model - Programming Model - Compiler Bottom-Up Greedy Collect candidate assignments Make final assignments Minimize data transfer latency by taking both
- perand & successor
location into consideration Optimization NodeMerging IB Expansion Pipelining Instruction Lowering IB Scheduling CodeGen
Semantic Analysis
IB Scheduling 33
[Ellis 1986]
Compiler Backend
HW SW
Processor Architecture - ISA - Execution Model - Programming Model - Compiler Bottom-Up Greedy Collect candidate assignments Make final assignments Minimize data transfer latency by taking both
- perand & successor
location into consideration Optimization NodeMerging IB Expansion Pipelining Instruction Lowering IB Scheduling CodeGen
Semantic Analysis
IB Scheduling 1 1 34
[Ellis 1986]
IB1 IB2 Time
IB1 is chosen because …
Closer to operand locations
Compiler Backend
HW SW
Processor Architecture - ISA - Execution Model - Programming Model - Compiler Bottom-Up Greedy Collect candidate assignments Make final assignments Minimize data transfer latency by taking both
- perand & successor
location into consideration Optimization NodeMerging IB Expansion Pipelining Instruction Lowering IB Scheduling CodeGen
Semantic Analysis
IB Scheduling 2 2 2 1 1 35
[Ellis 1986]
IB1 IB2 Time
IB2 is chosen because …
Earlier slots available
Compiler Backend
HW SW
Processor Architecture - ISA - Execution Model - Programming Model - Compiler Bottom-Up Greedy Collect candidate assignments Make final assignments Minimize data transfer latency by taking both
- perand & successor
location into consideration Optimization NodeMerging IB Expansion Pipelining Instruction Lowering IB Scheduling CodeGen
Semantic Analysis
IB Scheduling 2 2 2 1 1 1 1 36
[Ellis 1986]
IB1 IB2 Time
IB1 is chosen because …
Better overlap of comm. and computation
Evaluation Methodology
Benchmarks
- PARSEC 3.0
− Blackscholes, Canneal, Fluidanimate
- Rodinia
− Backprop, Hotspot, Kmeans, Streamcluster
CPU (2 sockets) GPU (1 card) IMP Processor Intel Xeon E5-2597 v3, 3.6GHz, 28 cores, 56 threads NVIDIA Titan Xp, 1.6GHz, 3840 cuda cores 20MHz ReRAM, 4096 Tiles, 64 ReRAM PU / Tile On-chip memory 78.96 MB 9.14 MB 8,590 MB Off-chip memory 64 GB DRAM 12 GB DRAM Profiler / Simulator (Performance) Intel VTune Amplifier NVPROF Cycle accurate simulator (Booksim Integrated) Profiler / Simulator (Power) Inter RAPL Interface NVIDIA System Management Interface Trace based simulation
Methodology
37
Offloaded Kernel / Application Speedup (CPU)
- Capacity limitation of IMP settles the upper-bound of performance improvement.
38 Offloaded Kernel Speedup Application Speedup
0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00
1 2 3 4 5 6 7 8 9 10
Normalized Execution Time
Series1 Series2 Series3 Series4
7.5x
1 10 100
1 2 3 4 5
Offloaded Kernel Speedup 41x
Kernel Speedup (GPU)
39
1 10 100 1,000 10,000 1 2 3 4 5 Kernel Speedup
763x
- GPU benchmarks are able to use higher DLP, dot product operations, and multi-row addition.
Summary
Microarchitecture ISA Execution Model Programming Model Compiler
HW SW
Contributions In-memory computing stack for general purpose programming
- Used Tensor Flow for the programming frontend
- Developed a compiler for in-memory computing on ReRAM
- Developed ISA and computation primitives
Results
763x speedup 440x energy efficiency .. over server class GPGPU 40