Deep In-memory Architectures for NAND Flash Memory Sujan K - PowerPoint PPT Presentation

Deep In-memory Architectures for NAND Flash Memory Sujan K Gonugondla, Mingu Kang, Yongjune Kim, Naresh Shanbhag University of Illinois at Urbana-Champaign Mark Helm, Sean Eilert Micron Technology, Inc. 1

Machines are Beating Humans at Complex Inference Tasks • game of Go is complex → huge search space: ~250 &'( (Go) vs. ~35 *( (Chess) • AlphaGo machine: 1202 CPUs+176 GPUs • HUGE Energy (and latency) Cost ~10,000× more than human brain • Critical issue at the Edge – IoT, wearables, autonomous The Economist, March 2016

Energy Cost - Memory Access vs. Computation ! "#" ! "$% ≈ '(× to '((× (in SRAMs) [Horowitz, ISSCC14’] ML kernel-level energy breakdown Computation energy (45nm) for TM (8-b operands) Integer ADD Mult [Kang, Shanbhag] 8 bits 0.03 pJ 0.2 pJ Dot product 32 bits 0.1 pJ 3 pJ Memory energy (45nm) Memory 64 bits Cache 8 KB 10 pJ Cache 32 KB 20 pJ Cache 1 MB 100 pJ *Post-layout simulations with SRAM + DRAM 1.2 – 2.6 nJ synthesized logic in 65nm CMOS 3

Energy Costs in the Memory Hierarchy Bandwidth Latency $" − -".& DDR CPU 25.6Gb/s DRAM 20-60-clock cycles Relative Energy Costs Cost per bit ($) !" − $""%& PCIe NAND 16Gb/s • SRAM vs. compute: ≈ ($( − Flash (")×$" + -clock $"×−$""× cycles $,& • DRAM vs. compute: ≈ SATA SATA 8-12Gb/s PCH $" ! -clock SSD 0""× [Sze, ISSCC’16] cycles • Flash vs. compute: ≈ SATA HDD $"""× Access time Density Energy per (ns) (MB/mm 2 ) Key Question bit (pJ) 100000 1000 100 1000 How to reduce memory access 10 100 1 10 10 costs? 0.1 1 0.1 0.01 0.001 0.1 0.001 M M H M M H S S A A A A M M H A A R R R R S L L A A 4 S D S D A F F R R L S D F [Yang, J. Joshua, Nature Nano, 2013]

Proposed Solution Deep In-memory Architecture (DIMA) https://spectrum.ieee.org/computing/hardware/ to-speed-up-ai-mix-memory-and-processing “breaching the memory wall” 5

The Deep In-memory Architecture (DIMA) [Kang, et al., -ICASSP-2014] [Kang, et al.,-JSSC-2018] Precharge/Column Mux/Y-DEC multi-row functional READ (reads multiple-bits/col/precharge) X-DEC X-DEC analog, mixed-signal low-SNR fabric bitline processing (SIMD analog processing) BLP BLP BLP BLP BLP BLP cross bitline processing Cross Bitline Processor (analog averaging enhances SNR) Residual Digital Unit low complexity, low (decision) inference/decisions rate digital output 6

Functional Read (FR) – Voltage Mode (for SRAM) Precharge_b • single FR → vector inner product column-major word V 3 • multi-row access per precharge WL 3 d 3 T 3 • PWM, PAM, PWAM access pulses V 2 WL 2 Per-column dot-product d 2 T 2 123 V 1 WL 1 " ()* d 1 ∆" %& = - 4(" . )7 . 8 . T 1 + %& , %& ./0 WL 0 V 0 BLB BL d 0 T 0 (9 : ≪ < => ? => ) " ∆" #$ ∆" #$# ···· BC => ∝ A multiple-bits per column 7

Bit-line Analog Processors Precharge_b D column pitch-matched WL 3 computations dominating bitline processors d 3 machine learning 6T bit cell WL 2 0.915 ! m d 2 Kernel BLP CBLP Manhattan subtract- WL 1 aggregation distance compare d 1 Euclidean subtract- WL 0 aggregation distance square d 0 subtractor 15.4 ! m BLB BL ···· weighted Dot product multiply P aggregation RWL 3~0 Replica Cell Hamming XOR aggregation distance ∆V BL ∝ D+P ∆V BLB ∝ D+P MAX charge-redistribution based cross BLP BL bitline aggregation processor 2.11 ! m (BLP) BL254 BL255 BL1 BL0 - ø 1 ø 1 ø 1 ø 1 cross BL !"#(#, &) = ) |# * − & * | processor Sum ø 2 ø 2 ø 2 *+, ø 2 (CBLP) 8

SRAM DIMA IC Prototypes with 16kB standard 6T SRAM in 65nm CMOS multi-functional FIRST random forest IC on-chip learning TRAINER ADC Test block BLP-CBLP 64b Bus SRAM Bitcell CTRL Array normal R/W Interface iso-accuracy comparisons wrt. post-layout 8-b digital processor & measured SRAM read energy 8b, 128-dim SVM; MIT- SVM, TM, k-NN, MF; RF with 64 trees; CBCL dataset; MIT-CBCL, MNIST, ..; KUL traffic sign; SGD-based learning; energy savings = 10X; energy savings = 3X energy savings = 21X; EDP reduction = 50X; EDP reduction = 7X EDP reduction = 100X; [ESSCIRC 2017, [ISSCC 2018, JSSC special [JSSC January 2018] JSSC (special issue) May 2018] issue (Invited)]

ISCAS 2018 - Migrate DIMA into Flash Challenges in Flash (relative to SRAM) → DIMA Solution • NAND flash bit cell is ( !"× ) smaller than SRAM → multi-col processing • Much larger BL caps → use current sensing • High % & variations → use high dimensional vector processing • Slower logic devices → use mixed-signal analog circuits 10

NAND Flash-based DIMA Weights 16kB Input/Weights Reg. WBL [ 0 ] WBL [ 255 ] Out A/D Cross BL processor ( CBLP ) Processor Analog BL Processor BL Processor ( BLP ) ( BLP ) 16k Multiplier (multIplier) (multiplier) • Multi-col functional read: converts a stored word into Functional Read (D/A) output voltage • Multi-BL Processor (MBLP): X-Dec. & Pulse Gen. X-Dec. & Pulse Gen. WL Driver WL Driver performs scalar mixed signal multiplication 128k X 192k SLC NAND • Cross BL Processing: Flash Array aggregation via charge sharing 11

MC Functional Read – Current Mode (./ ! "" #$% = ' #( ∆" * 0 + 1 + ∝ 3 PCH ) #$% C OUT +,- OUT Precharge Evaluation Phase Phase SEL2 SEL3 SEL0 SEL1 ! (( + ! &' BL 3 PCH BL 1 BL 2 BL 0 BSL ! "#$ + ! &' SEL3 ) V pass - WL<63> SEL2 ) , V read WL<62> SEL1 1 - 1 / 1 4 1 5 ) + V pass WL<0> SEL0 ) * SSL ! BL 1,2,3,4 "#$ ! (( OUT ! 234 • Current sensing - BL not discharged SSL • Binary word stored horizontally BSL • Use time modulated SEL signals ) ) $./0 1#$ 12

Simulation Methodology Challenge: need to reflect device/process non-idealities at the system level Solution : use device models + array parameters to estimate energy, delay and behavior • behavioral models - ! " Behavioral System level System Simulations variations, ICI, pattern Model performance dependency, read/pgm Model Parameters disturbance Compare BSIM Level 49 SPICE • energy and delay models : Model Simulations estimated from circuit Model Delay Verification BLP Energy simulations & analysis System level Energy NAND flash Energy and and througput Energy Models throughput Estimation 13

̅ Architectural Set-up IO & % Dot Product Dot Product Dot Product Dot Product 16kB Page 16kB Page 16kB Page 16kB Page ( NAND Flash Plane NAND Flash Plane NAND Flash Plane NAND Flash Plane • 32nm node; 16kB/page; 64 pages/block; 3000 blocks/plane; 4 planes/IC • 200×320 8-b images; one image stored in 4 pages in 4 planes • I/O limited to 800MB/s (ONFI 4) 14

Machine Learning Applications Face detection via Face recognition via linear SVM k- NN Face database ( " ) stored in NAND flash decision rule Person 1 # $ % + ' ≥ 0 → face else → no face Person 2 200 Query 320 image ( ! ) Person 28 “Based on 3 closest images pick person 1” • extended Yale B dataset (2336 test • Caltech 101 dataset images; 28 classes) • input buffer stored • input buffer stores reference image weights 15

Simulation Results - Accuracy typical range SVM k -NN Detection Accuracy ( 𝑄 𝑒𝑓𝑢 ) Detection Accuracy ( 𝑄 𝑒𝑓𝑢 ) 𝑄 𝑒𝑓𝑢 𝑄 𝑒𝑓𝑢 PSNR PSNR �� 𝜏 𝑊 𝑈𝐼 /Δ 𝑊 𝑈𝐼 𝜏 𝑊 𝑈𝐼 /Δ 𝑊 𝑈𝐼 𝜏 𝑊 𝑈𝐼 /Δ 𝑊 𝑈𝐼 𝜏 𝑊 𝑈𝐼 /Δ 𝑊 𝑈𝐼 • Detection accuracy robust to ! " variations in the typical range • SVM accuracy: 92%; k -NN accuracy (top-3): 95% 16

Energy & Single NAND IC SSD (4 planes/IC) (16 ICs/package) Throughput Planes NAND NAND NAND NAND MBLP/ CBLP Benefits Plane Plane Plane Plane Host I/O NAND Controller BLP-CBLP BLP-CBLP BLP-CBLP BLP-CBLP IO NAND NAND NAND NAND Normalized Energy Normalized Energy 1 1 0.8 0.8 other compute 0.6 8.3 × 0.6 23 × 0.4 I/O string current 0.4 0.2 0.2 word-line bit-line 0 0 conventional DIMA conventional DIMA !× #$× → throughput enhancement 17

Summary & Future Work • Deep In-memory Architecture – energy ( 23× ) and throughput ( 15× ) enhancement for SSDs. • DIMA in other technologies – FD-SOI , MRAM , eDRAM, DRAM, emerging devices (e.g., RRAM) • Scaled-up DIMA – multi-bank architectures for DNNs • Robustifying DIMA – using Shannon-inspired statistical error compensation, on-chip learning [ISSCC 2018] • Programmable DIMA – programming models, compilers (with Adve, Kim (UIUC)) [ISCA 2018] • Inference algorithms for DIMA – analog data flow • DIMA physical compilers – auto. synth. of DIMA cores 18

Acknowledgements This work was supported in part by Systems on Nanoscale Information fabriCs (SONIC), one of the six SRC STARnet Centers, sponsored by MARCO and DARPA. http://shanbhag.ece.illinois.edu 19

Deep In-memory Architectures for NAND Flash Memory Sujan K - PowerPoint PPT Presentation

Deep In-memory Architectures for NAND Flash Memory Sujan K Gonugondla, Mingu Kang, Yongjune Kim, Naresh Shanbhag University of Illinois at Urbana-Champaign Mark Helm, Sean Eilert Micron Technology, Inc. 1 Machines are Beating Humans at

2004: Poisson Matting 2004: Flash/No-Flash 2004: Flash/No-Flash 2004: Flash/No-Flash 2004: The

Partitioned Real-Time NAND Flash Storage Katherine Missimer and Rich West Introduction 2

IEE5008 Autumn 2012 Memory Systems 3D Nand Flash Memory Pranav Arya Department of

Flash Memory and Micro SD Card Presented by: Krishna Goyal (200601195) Anirudh Tripathi

A Case for Flash Memory SSD in A Case for Flash Memory SSD in A Case for Flash Memory SSD in

Arc Flash Protection Arc Flash Protection Electrical Reliability Services Arc Flash Hazard Arc

FlexFS: A Flexible Flash File System for MLC NAND Flash Memory Sungjin Lee , Keonsoo Ha, Kangwon

ReFlex: Remote Flash Local Flash Ana Klimovic Heiner Litz Christos Kozyrakis NVMW18

The Basics Of Flash Building A Web Application With Flash What is Flash? Introduction

Program Interference in MLC NAND Flash Memory: Characterization, Modeling, and Mitigation Yu Cai

Flash Memory For Automative 2016 10 Do not distribute without permission 1 2007

MLC/TLC NAND support: (new ?) challenges for the MTD/NAND subsystem Free Electrons - Embedded

Arc Flash Arc Flash Mitigation Mitigation Remote Racking and Switching for Arc Flash danger

Flash Presentation The flash web designs which we make are attractive to captivate your website

Design of Flash- -Based DBMS: Based DBMS: Design of Flash Design of Flash-Based DBMS: An In-

Basics of Off-Camera Flash Off-Camera Flash www.jedi.com * What is it & why do we use it? *

Pipelined Analog-to-Digital Converters Vishal Saxena Vishal Saxena -1- Multi-Step A/D

A 12-bit low-power ADC for SKIROC Laurent ROYER, Samuel MANEN, Pascal GAY LPC Clermont-Ferrand

Toward General-Purpose Code Accelera4on with Analog Computa4on

A First Course in Digital Communications Ha H. Nguyen and E. Shwedyk February 2009 A First

Benefits of Asynchronous Control for Analog Electronics: Multiphase Buck Case Study Danil

An analog of Chang inversion formula for weighted Radon transforms in multidimensions F.O.

EE456 Digital Communications Ha H. Nguyen September 2014 EE456 Digital Communications

Routability-driven Placement Algorithm for Analog Integrated Circuits Cheng-Wu Lin, Cheng-Chung

Deep In-memory Architectures for NAND Flash Memory Sujan K - PowerPoint PPT Presentation

Deep In-memory Architectures for NAND Flash Memory Sujan K Gonugondla, Mingu Kang, Yongjune Kim, Naresh Shanbhag University of Illinois at Urbana-Champaign Mark Helm, Sean Eilert Micron Technology, Inc. 1 Machines are Beating Humans at

2004: Poisson Matting 2004: Flash/No-Flash 2004: Flash/No-Flash 2004: Flash/No-Flash 2004: The

Partitioned Real-Time NAND Flash Storage Katherine Missimer and Rich West Introduction 2

IEE5008 Autumn 2012 Memory Systems 3D Nand Flash Memory Pranav Arya Department of

Flash Memory and Micro SD Card Presented by: Krishna Goyal (200601195) Anirudh Tripathi

A Case for Flash Memory SSD in A Case for Flash Memory SSD in A Case for Flash Memory SSD in

Arc Flash Protection Arc Flash Protection Electrical Reliability Services Arc Flash Hazard Arc

FlexFS: A Flexible Flash File System for MLC NAND Flash Memory Sungjin Lee , Keonsoo Ha, Kangwon

ReFlex: Remote Flash Local Flash Ana Klimovic Heiner Litz Christos Kozyrakis NVMW18

The Basics Of Flash Building A Web Application With Flash What is Flash? Introduction

Program Interference in MLC NAND Flash Memory: Characterization, Modeling, and Mitigation Yu Cai

Flash Memory For Automative 2016 10 Do not distribute without permission 1 2007

MLC/TLC NAND support: (new ?) challenges for the MTD/NAND subsystem Free Electrons - Embedded

Arc Flash Arc Flash Mitigation Mitigation Remote Racking and Switching for Arc Flash danger

Flash Presentation The flash web designs which we make are attractive to captivate your website

Design of Flash- -Based DBMS: Based DBMS: Design of Flash Design of Flash-Based DBMS: An In-

Basics of Off-Camera Flash Off-Camera Flash www.jedi.com * What is it &amp; why do we use it? *

Pipelined Analog-to-Digital Converters Vishal Saxena Vishal Saxena -1- Multi-Step A/D

A 12-bit low-power ADC for SKIROC Laurent ROYER, Samuel MANEN, Pascal GAY LPC Clermont-Ferrand

Toward General-Purpose Code Accelera4on with Analog Computa4on

A First Course in Digital Communications Ha H. Nguyen and E. Shwedyk February 2009 A First

Benefits of Asynchronous Control for Analog Electronics: Multiphase Buck Case Study Danil

An analog of Chang inversion formula for weighted Radon transforms in multidimensions F.O.

EE456 Digital Communications Ha H. Nguyen September 2014 EE456 Digital Communications

Routability-driven Placement Algorithm for Analog Integrated Circuits Cheng-Wu Lin, Cheng-Chung

Basics of Off-Camera Flash Off-Camera Flash www.jedi.com * What is it & why do we use it? *