What can in-memory computing deliver, and what are the barriers? - PowerPoint PPT Presentation

What can in-memory computing deliver, and what are the barriers? Naveen Verma (nverma@princeton.edu), L.-Y. Chen, H. Jia, M. Ozatay, Y. Tang, H. Valavi, B. Zhang, J. Zhang March 20 th , 2019

The memory wall • Separating memory from compute fundamentally raises a communication cost Energy per Access 64b Word (pJ) MULT (FP32): 5pJ MULT (INT32): 3pJ MULT (INT8): 0.3pJ MULT (INT4): 0.1pJ Memory Size ( ! ) More data → bigger array → larger comm. distance → more comm. energy

So, we should amortize data movement • Reuse accessed data for compute • Specialized (memory-compute integrated) operations architectures ! = . × , ⃗ & "," … & ",) ! " , " Memory Compute ⋮ ⋮ ⋱ ⋮ ⋮ = Bound Bound ! $ & $," ⋯ & $,) , ) Processing Comp. Intensity E MEM = Element (PE) OPS/W COMP

In-memory computing (IMC) ( $,$ … ( $,+ # $ . $ # = 0. ⃗ ⇒ ⋮ ⋮ ⋱ ⋮ ⋮ = # & ( &,$ ⋯ ( &,+ . + • In SRAM mode, matrix A stored in bit cells row-by-row • In IMC mode, many WLs driven simultaneously ! → amortize comm. cost inside array • Can apply to diff. mem. Technologies → enhanced scalability → embedded non-volatility IMC Mode SRAM Mode " [J. Zhang, VLSI ’16][J. Zhang, JSSC ’17]

The basic tradeoffs CONSIDER: Accessing ! bits of data associated with computation, from array with ! columns ⨉ ! rows. Metric Traditional In-memory Traditional IMC 1/D 1/ 1/2 Bandwidth 1/ 1 D 1 Latency Memory & Memory Computation ( D 1/2 × D 1/2 array) D 3/ 3/2 2 Energy ~D ~D ( D 1/2 × D 1/2 array) 1 ~1/D 1/ ~1 1/2 SNR D 1/2 • IMC benefits energy/delay at cost of SNR Computation • SNR-focused systems design is critical (circuits, architectures, algorithms)

IMC as a spatial architecture ! " = $! Data Movement: 1. % &,( ′* broadcast min. distance 2. (Many) + ,,& ′* stationary 3. High-dynamic-range analog - ,,( ′* due to high-density bit cells in high-density bit cells computed in distributed manner

IMC as a spatial architecture PRE PRE PRE PRE b11 Assume: [3] a11 a11 a11 a11 • 1k dimensionality [3] [2] [1] [0] b21 • 4-b multiplies [3] a12 a12 a12 a12 [3] [2] [1] [0] • 45nm CMOS c11(2 3 ) c11(2 2 ) c11(2 1 ) c11(2 0 ) Operation Digital-PE Energy (fJ) Bit-cell Energy (fJ) Storage 250 Multiplication 100 50 Accumulation 200 Communication 40 5 Total 590 55

Where does IMC stand today? • Potential for 10× higher efficiency & • Limited scale, robustness, configurability throughput 10e4 10e3 Valavi, Khwa, ISSCC’18, 65nm Lee, ISSCC’18, VLSI’18, 65nm 65nm On-chip Memory Size (kB) Yin, VLSI’17, 65nm Zhang, VLSI’16, 130nm IMC Normalized Throughput Bankman, Yuan, VLSI’18, 65nm Not IMC Moons, 10e2 ISSCC’18, 28nm Valavi, VLSI’18, 65nm Ando, VLSI’17, 65nm 10e3 ISSCC’17, Chen, ISSCC’16, (GOPS/mm 2 ) 28nm Jiang, 65nm Lee, ISSCC’18, 65nm VLSI’18, 65nm Gonug, ISSCC’18, 65nm Moons, ISSCC’17, 28nm 10 10e2 Ando, VLSI’17, 65nm Jiang, VLSI’18, Biswas, 65nm Shin, ISSCC’17, 65nm ISSCC’18, 65nm Biswas, ISSCC’18, 65nm 1 Zhang, Bankman, Yin, VLSI’17, 65nm VLSI’16, ISSCC’18, 28nm Khwa, ISSCC’18, 65nm 10 Chen, ISSCC’16, 65nm 130nm Gonug, ISSCC’18, 65nm Yuan, VLSI’18, 65nm 10e-2 10e-1 1 10 10e2 10e3 10e-2 10e3 10e-1 1 10 10e2 Energy Efficiency (TOPS/W) Energy Efficiency (TOPS/W)

Challenge 1: analog computation • Use analog circuits to ‘fit’ compute in bit cells ⟶ SNR limited by analog-circuit non-idealities Ideal transfer curve ⟶ Must be feasible/competitive @ 16/12/7nm 0.06 Δ V BL (V) 0.04 I-DAC 0.02 V BIAS V BIAS,O 1x 2x 16x Nominal transfer curve 0 X Offset X[0] X[1] X[4] 5 10 15 20 25 30 35 WL WLDAC Code M A,R M A CLASS_EN WL_RESET M D,R M D BLB BL Bit-cell replica Bit-cell [J. Zhang, VLSI ’16][J. Zhang, JSSC ’17]

Algorithmic co-design(?) • Chip-specific weight tuning • Chip-generalized weight tuning Trainer Parameters Training Inference $(&, (, ℒ) • • •• G g i • • WEAK classifier 1 • • • • • • • • • • • • • • • • • • WEAK classifier 2 • • • • • Feature 2 • • • • • • • • • Weighted • • • • • • Voter • • • • • • • • • • • • Feature 1 • • • • • • • • WEAK classifier K • [Z. Wang, TVLSI’ 15] • • • E.g.: BNN Model (applied to CIFAR-10) Classifier [Z. Wang, TCAS-I ’15] 100 90 -(&, $, ()| 0 L = |- − / 80 Accuracy 70 60 50 40 -(&, $)| 0 L = |- − / 30 20 10 1 2 3 4 5 6 7 8 9 10 Normalized MRAM cell standard dev. [B. Zhang, ICASSP 2019] [S. Gonu., ISSCC’ 18]

Challenge 2: programmability • Matrix-vector multiply is only 70-90% of operations ⟶ IMC must integrate in programmable, heterogenous architectures [B. Fleischer, VLSI ’18] General Matrix Multiply Single/few-word operands (~256 ⨉ 2300=590k elements) (traditional, near-mem. acceleration)

Challenge 3: efficient application mappings • IMC engines must be ‘virtualized’ ⟶ IMC amortizes MVM costs, not weight loading. But… ⟶ Need new mapping algorithms (physical tradeoffs very diff. than digital engines) Activation Accessing Weight Accessing ' " #,%,& • E DRAM→IMC /4-bit: 40pJ • E DRAM→IMC /4-bit: 40pJ ( N - I ⨉ J ⨉ K filters) • Reuse: .×0×1 (10-20 lyrs) • Reuse: 2×3 • E MAC,4-b : 50fJ • E MAC,4-b :50fJ )* +,,,- Memory Compute ( X ⨉ Y ⨉ Z input Bound Bound activations) (output activations) Reuse ≈ 1k

Path forward: charge-domain analog computing ~1.2fF metal capacitor (on top of bit cell) 1. Digital multiplication 2. Analog accumulation [H. Valavi, VLSI ’18]

2.4Mb, 64-tile IMC Moons, Bang, Ando, Bankman, Valavi, ISSCC’17 ISSCC’17 VLSI’17 ISSCC’18 VLSI’18 Technology 28nm 40nm 65nm 28nm 65nm Area ( !! " ) 1.87 7.1 12 6 17.6 0.8/0.8 Operating VDD 1 0.63-0.9 0.55-1 0.94/0.68/1.2 (0.6/0.5) Bit precision 4-16b 6-32b 1b 1b 1b on-chip Mem. 128kB 270kB 100kB 328kB 295kB Throughput 400 108 1264 400 (60) 18,876 (GOPS) TOPS/W 10 0.384 6 532 (772) 866 • 10-layer CNN demos for MNIST/CIFAR-10/SVHN at energies of 0.8/3.55/3.55 μJ/image • Equivalent performance to software implementation [H. Valavi, VLSI ’18]

Programmable IMC A To E 2 PROM To DRAM Controller 32b (data) (addr.) Data Mask (data/addr.) Memory Read/Write I/F 8 13 32 x 0 xb 0 Ext. Boot- <0> Bit Mem. I/F Sparsity/AND-logic Controller loader Cell Compute-In-Memory Row Decoder/ WL Drivers w2b Reshaping Buffer Unit (CIMU) Compute-In- • 590 kb Memory Array • 16 bank Program Data x (CIMA) Memory Memory 32b (128 kB) (128 kB) Config. Regs. APB Bus 32 <767> x 2303 xb 2303 <63> <255> & ABN & ABN ADC ADC 32 <192> <0> 8b AXI Bus Near-Mem. Near-Mem. Data Path Data Path <0> <31> CPU Config 32b Timers GPIO UART (RISC-V) DMA f(y = A x) 32 Tx Rx [H. Jia, arXiv :1811.04047]

Bit-scalable mixed-signal compute • SQNR different that standard integer compute 40 B x =8 30 20 N=2304, 2000, 1500, 10 1000, 500, 255 18 B x =4 SQNR (dB) 14 N=2304, 2000, 1500, 10 1000, 500, 255 6 6 B x =2 N=2304, 2000, 1500, 4 1000, 500, 255 2 2 3 4 5 6 7 8 B A [H. Jia, arXiv :1811.04047]

Development board To Host Processor

Design flow 2. Deep-learning Inference Libraries 1. Deep-learning Training Libraries (Python, MATLAB, C) (Keras) High-level network build (Python): Standard Keras libs: chip_mode = True Dense(units, ...) outputs = QuantizedConv2D(inputs, Conv2D(filters, kernel_size, ...) weights, biases, layer_params) ... outputs = BatchNormalization(inputs, layer_params) ... Custom libs: Function calls to chip (Python): (INT/CHIP quant.) chip.load_config(num_tiles, nb_input=4, QuantizedDense(units, nb_input=4, nb_weight=4, nb_weight=4) chip_quant=False, ...) chip.load_weights(weights2load) QuantizedConv2D(filters, kernel_size, nb_input=4, chip.load_image(image2load) nb_weight=4, chip_quant=False, ...) outputs = chip.image_filter() ... Embedded C: chip_command = get_uart_word(); QuantizedDense(units, nb_input=4, nb_weight=4, chip_config(); chip_quant=True, ...) load_weights(); load_image(); QuantizedConv2D(filters, kernel_size, nb_input=4, image_filter(chip_command); nb_weight=4, chip_quant=True, ...) read_dotprod_result(image_filter_command); ...

What can in-memory computing deliver, and what are the barriers? - PowerPoint PPT Presentation

What can in-memory computing deliver, and what are the barriers? Naveen Verma (nverma@princeton.edu), L.-Y. Chen, H. Jia, M. Ozatay, Y. Tang, H. Valavi, B. Zhang, J. Zhang March 20 th , 2019 The memory wall Separating memory from compute

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

SMART GOVERNMENT INVOICING: INVOICE PROCESSING PLATFORM LEAD. TRANSFORM. DELIVER LEAD. TRANSFORM.

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Memory Management Memory Manager Requirements Minimize primary memory access time

Virtual Memory and Virtual Memory and Demand Paging Demand Paging Virtual Memory Illustrated

Dynamic Memory Management 333 Dynamic Memory Management Process Memory Layout Process Memory

Memory Management Ideally programmers want memory that is large fast non

Memory Systems Design & Programming CMPE 310 Memory Address Decoding The processor can

UNIFIED MEMORY IN CUDA 6 MARK HARRIS NVIDIA CONFIDENTIAL Unified Memory Dramatically Lower

Lecture 11: Persistent Memory Databases 1 / 71 Persistent Memory Databases Recap

Memory Hierarchy: Caching CSE 141, S2'06 Jeff Brown The memory subsystem Computer Control

28.05.04 09:50 Memory Management The computer memory is a limited resource so the Memory

Improving Key Recovery to 784 and 799 rounds of Trivium using Optimized Cube Attacks Pierre-Alain

Unit 18 Field Programmable Gate Arrays (FPGAs) HARDWARE IMPLEMENTATION Implementing Logic

Memories and SRAM 1 Silicon Memories Why store things in silicon? Its fast!!!

Graphics and Visualization Yuriy Tymchuk (almost) Alain Plantec Guillaume Larcheveque What are

The Memory Hierarchy Today Storage technologies and trends Locality of reference

Instruction Set 2 Architecting a vocabulary for the HW INSTRUCTION SET OVERVIEW 3 Instruction

Lecture 28 of 41 Collision Handling Part 2 of 2: Dynamic Collision Response, Particle Systems

MLC/TLC NAND support: (new ?) challenges for the MTD/NAND subsystem Free Electrons - Embedded

What can in-memory computing deliver, and what are the barriers? - PowerPoint PPT Presentation

What can in-memory computing deliver, and what are the barriers? Naveen Verma (nverma@princeton.edu), L.-Y. Chen, H. Jia, M. Ozatay, Y. Tang, H. Valavi, B. Zhang, J. Zhang March 20 th , 2019 The memory wall Separating memory from compute

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

SMART GOVERNMENT INVOICING: INVOICE PROCESSING PLATFORM LEAD. TRANSFORM. DELIVER LEAD. TRANSFORM.

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Memory Management Memory Manager Requirements Minimize primary memory access time

Virtual Memory and Virtual Memory and Demand Paging Demand Paging Virtual Memory Illustrated

Dynamic Memory Management 333 Dynamic Memory Management Process Memory Layout Process Memory

Memory Management Ideally programmers want memory that is large fast non

Memory Systems Design &amp; Programming CMPE 310 Memory Address Decoding The processor can

UNIFIED MEMORY IN CUDA 6 MARK HARRIS NVIDIA CONFIDENTIAL Unified Memory Dramatically Lower

Lecture 11: Persistent Memory Databases 1 / 71 Persistent Memory Databases Recap

Memory Hierarchy: Caching CSE 141, S2'06 Jeff Brown The memory subsystem Computer Control

28.05.04 09:50 Memory Management The computer memory is a limited resource so the Memory

Improving Key Recovery to 784 and 799 rounds of Trivium using Optimized Cube Attacks Pierre-Alain

Unit 18 Field Programmable Gate Arrays (FPGAs) HARDWARE IMPLEMENTATION Implementing Logic

Memories and SRAM 1 Silicon Memories Why store things in silicon? Its fast!!!

Graphics and Visualization Yuriy Tymchuk (almost) Alain Plantec Guillaume Larcheveque What are

The Memory Hierarchy Today Storage technologies and trends Locality of reference

Instruction Set 2 Architecting a vocabulary for the HW INSTRUCTION SET OVERVIEW 3 Instruction

Lecture 28 of 41 Collision Handling Part 2 of 2: Dynamic Collision Response, Particle Systems

MLC/TLC NAND support: (new ?) challenges for the MTD/NAND subsystem Free Electrons - Embedded

Memory Systems Design & Programming CMPE 310 Memory Address Decoding The processor can