Deep In-memory Architectures for NAND Flash Memory Sujan K - - PowerPoint PPT Presentation

deep in memory architectures for nand flash memory
SMART_READER_LITE
LIVE PREVIEW

Deep In-memory Architectures for NAND Flash Memory Sujan K - - PowerPoint PPT Presentation

Deep In-memory Architectures for NAND Flash Memory Sujan K Gonugondla, Mingu Kang, Yongjune Kim, Naresh Shanbhag University of Illinois at Urbana-Champaign Mark Helm, Sean Eilert Micron Technology, Inc. 1 Machines are Beating Humans at


slide-1
SLIDE 1

Deep In-memory Architectures for NAND Flash Memory

Sujan K Gonugondla, Mingu Kang, Yongjune Kim, Naresh Shanbhag University of Illinois at Urbana-Champaign Mark Helm, Sean Eilert Micron Technology, Inc.

1

slide-2
SLIDE 2

Machines are Beating Humans at Complex Inference Tasks

  • game of Go is complex →

huge search space: ~250&'( (Go) vs. ~35*( (Chess)

  • AlphaGo machine: 1202

CPUs+176 GPUs

  • HUGE Energy (and latency)

Cost ~10,000× more than human brain

  • Critical issue at the Edge –

IoT, wearables, autonomous

The Economist, March 2016

slide-3
SLIDE 3

Energy Cost - Memory Access vs. Computation

[Horowitz, ISSCC14’]

Integer ADD Mult 8 bits 0.03 pJ 0.2 pJ 32 bits 0.1 pJ 3 pJ

Computation energy (45nm) Memory energy (45nm)

Memory 64 bits Cache 8 KB 10 pJ Cache 32 KB 20 pJ Cache 1 MB 100 pJ DRAM 1.2 – 2.6 nJ

*Post-layout simulations with SRAM + synthesized logic in 65nm CMOS

Dot product

ML kernel-level energy breakdown for TM (8-b operands) [Kang, Shanbhag]

!"#" !"$% ≈

'(× to '((×

3

(in SRAMs)

slide-4
SLIDE 4

Energy Costs in the Memory Hierarchy

CPU

DRAM NAND Flash SATA SSD

DDR PCIe SATA

HDD

Cost per bit ($)

Latency

PCH

SATA

0.001 0.01 0.1 1 10 100 S R A M D R A M F L A S H

Energy per bit (pJ)

0.1 1 10 100 1000 S R A M D R A M F L A S H

Density (MB/mm2)

0.001 0.1 10 1000 100000 S R A M D R A M F L A S H

Access time (ns)

Bandwidth

!" − $""%& ($( − (")×$"+-clock cycles $,& $"!-clock cycles $" − -".& 20-60-clock cycles 16Gb/s 8-12Gb/s 25.6Gb/s

  • SRAM vs. compute: ≈

$"×−$""×

  • DRAM vs. compute: ≈

0""× [Sze, ISSCC’16]

  • Flash vs. compute: ≈

$"""×

4 [Yang, J. Joshua, Nature Nano, 2013]

Relative Energy Costs

Key Question How to reduce memory access costs?

slide-5
SLIDE 5

5

https://spectrum.ieee.org/computing/hardware/ to-speed-up-ai-mix-memory-and-processing

Proposed Solution

Deep In-memory Architecture (DIMA)

“breaching the memory wall”

slide-6
SLIDE 6

The Deep In-memory Architecture (DIMA)

6

BLP BLP BLP BLP BLP BLP Cross Bitline Processor Residual Digital Unit Precharge/Column Mux/Y-DEC X-DEC X-DEC

inference/decisions bitline processing (SIMD analog processing) multi-row functional READ (reads multiple-bits/col/precharge) cross bitline processing (analog averaging enhances SNR) low complexity, low (decision) rate digital output analog, mixed-signal low-SNR fabric

[Kang, et al., -ICASSP-2014] [Kang, et al.,-JSSC-2018]

slide-7
SLIDE 7

Functional Read (FR) – Voltage Mode (for SRAM)

∆"#$ ∆"#$#

WL3

BL BLB

Precharge_b

d3 d2 d0

WL2 WL0 T3 T2 T1

d1

WL1 T0 V3 V1 V0 V2

····

column-major word

∆"

%& =

"

()*

+%&,%&

  • ./0

123

4("

.)7.8.

(9: ≪ <=>?=>)

  • single FR → vector inner product
  • multi-row access per precharge
  • PWM, PAM, PWAM access pulses

7

Per-column dot-product

" A

BC=> ∝ multiple-bits per column

slide-8
SLIDE 8

Bit-line Analog Processors

WL3

Precharge_b d3 d2 d0

WL2 WL0

d1

WL1

Replica Cell

MAX

BL BLB

RWL3~0

∆VBL∝D+P ∆VBLB∝D+P

····

D P

8

BL0 BL1 BL254 BL255

ø1 ø1 ø1 ø1 ø2 ø2 ø2 ø2

Sum

BLP cross BL processor (CBLP)

!"#(#, &) = )

*+,

  • |#* − &*|

Kernel BLP CBLP Manhattan distance subtract- compare aggregation Euclidean distance subtract- square aggregation Dot product multiply weighted aggregation Hamming distance XOR aggregation

charge-redistribution based cross bitline aggregation computations dominating machine learning

15.4 !m 6T bit cell 2.11 !m 0.915 !m

BL processor (BLP)

column pitch-matched bitline processors subtractor

slide-9
SLIDE 9

SRAM DIMA IC Prototypes

SVM, TM, k-NN, MF; MIT-CBCL, MNIST, ..; energy savings = 10X; EDP reduction = 50X;

[JSSC January 2018]

Test block

TRAINER

SRAM Bitcell Array

normal R/W Interface BLP-CBLP CTRL ADC 64b Bus

multi-functional

[ESSCIRC 2017, JSSC (special issue) May 2018] [ISSCC 2018, JSSC special issue (Invited)]

FIRST random forest IC

  • n-chip learning

iso-accuracy comparisons wrt. post-layout 8-b digital processor & measured SRAM read energy

RF with 64 trees; KUL traffic sign; energy savings = 3X EDP reduction = 7X with 16kB standard 6T SRAM in 65nm CMOS 8b, 128-dim SVM; MIT- CBCL dataset; SGD-based learning; energy savings = 21X; EDP reduction = 100X;

slide-10
SLIDE 10

ISCAS 2018 - Migrate DIMA into Flash

  • NAND flash bit cell is (!"×) smaller than SRAM →

multi-col processing

  • Much larger BL caps → use current sensing
  • High %& variations → use high dimensional vector

processing

  • Slower logic devices → use mixed-signal analog circuits

10

Challenges in Flash (relative to SRAM) → DIMA Solution

slide-11
SLIDE 11

NAND Flash-based DIMA

  • Multi-col functional read:

converts a stored word into

  • utput voltage
  • Multi-BL Processor (MBLP):

performs scalar mixed signal multiplication

  • Cross BL Processing:

aggregation via charge sharing

Functional Read (D/A)

X-Dec. & Pulse Gen.

16kB Input/Weights Reg.

BL Processor (BLP)

(multIplier)

Cross BL processor (CBLP)

BL Processor (BLP)

(multiplier)

WBL[0] WBL[255]

Analog Processor

WL Driver

16k Multiplier 128k X 192k SLC NAND Flash Array Out Weights

X-Dec. & Pulse Gen.

WL Driver

A/D

11

slide-12
SLIDE 12

BSL

BL0

WL<63> WL<0>

SSL

WL<62>

BL1 BL2 BL3

SEL0 SEL1 SEL2 SEL3 Vread Vpass Vpass PCH OUT

!

""

MC Functional Read – Current Mode

  • Current sensing - BL not discharged
  • Binary word stored horizontally
  • Use time modulated SEL signals

∆"

#$% = '#(

)#$% *

+,- (./

0+1+ ∝ 3 1- 1/ 14 15

COUT

12

!

"#$ + !&'

!

(( + !&'

!

"#$

!

((

)

*

)

+

)

,

)

  • )

$./0

)

1#$

!

234

SEL3 PCH BL1,2,3,4 OUT SEL2 SEL1 SEL0 SSL BSL Evaluation Phase Precharge Phase

slide-13
SLIDE 13

Simulation Methodology

SPICE Simulations System Simulations Behavioral Model BSIM Level 49 Model

Compare

Model Verification Model Parameters NAND flash Energy Models System level Energy and througput Estimation BLP Energy Energy and throughput Delay System level performance

  • behavioral models - !

"

variations, ICI, pattern dependency, read/pgm disturbance

  • energy and delay models:

estimated from circuit simulations & analysis

13

Challenge: need to reflect device/process non-idealities at the system level Solution: use device models + array parameters to estimate energy, delay and behavior

slide-14
SLIDE 14

Architectural Set-up

14

  • 32nm node; 16kB/page; 64 pages/block; 3000 blocks/plane; 4

planes/IC

  • 200×320 8-b images; one image stored in 4 pages in 4 planes
  • I/O limited to 800MB/s (ONFI 4)

NAND Flash Plane Dot Product IO

16kB Page

NAND Flash Plane Dot Product

16kB Page

NAND Flash Plane Dot Product

16kB Page

NAND Flash Plane Dot Product

16kB Page

% & ̅ (

slide-15
SLIDE 15

Machine Learning Applications

  • Caltech 101 dataset
  • input buffer stored

weights

15

decision rule

320 200

Query image (!)

Person 1 Person 2 Person 28

Face database (") stored in NAND flash

“Based on 3 closest images pick person 1”

Face detection via linear SVM Face recognition via k-NN

#$% + ' ≥ 0 → face else → no face

  • extended Yale B dataset (2336 test

images; 28 classes)

  • input buffer stores reference image
slide-16
SLIDE 16

Detection Accuracy (𝑄𝑒𝑓𝑢) 𝑄𝑒𝑓𝑢 𝜏𝑊𝑈𝐼/Δ𝑊𝑈𝐼 𝜏𝑊𝑈𝐼/Δ𝑊𝑈𝐼 PSNR 𝑄𝑒𝑓𝑢 Detection Accuracy (𝑄𝑒𝑓𝑢) 𝜏𝑊𝑈𝐼/Δ𝑊𝑈𝐼 𝜏𝑊𝑈𝐼/Δ𝑊𝑈𝐼 PSNR

Simulation Results - Accuracy

  • Detection accuracy robust to !

" variations in the typical range

  • SVM accuracy: 92%; k-NN accuracy (top-3): 95%

16

SVM

k-NN typical range

slide-17
SLIDE 17

Energy & Throughput Benefits

!× #$× → throughput enhancement

0.2 0.4 0.6 0.8 1

conventional DIMA

0.2 0.4 0.6 0.8 1

conventional DIMA

  • ther

compute I/O string current word-line bit-line

Single NAND IC (4 planes/IC) SSD (16 ICs/package)

Normalized Energy Normalized Energy 8.3× 23×

17 Plane Plane Plane Plane

BLP-CBLP BLP-CBLP BLP-CBLP BLP-CBLP IO

NAND NAND NAND NAND NAND NAND NAND NAND NAND Controller Host I/O

Planes

MBLP/ CBLP

slide-18
SLIDE 18

Summary & Future Work

  • Deep In-memory Architecture – energy (23×) and

throughput (15×) enhancement for SSDs.

  • DIMA in other technologies – FD-SOI, MRAM, eDRAM,

DRAM, emerging devices (e.g., RRAM)

  • Scaled-up DIMA – multi-bank architectures for DNNs
  • Robustifying DIMA – using Shannon-inspired statistical

error compensation, on-chip learning [ISSCC 2018]

  • Programmable DIMA – programming models, compilers

(with Adve, Kim (UIUC)) [ISCA 2018]

  • Inference algorithms for DIMA – analog data flow
  • DIMA physical compilers – auto. synth. of DIMA cores

18

slide-19
SLIDE 19

Acknowledgements

This work was supported in part by Systems on Nanoscale Information fabriCs (SONIC), one of the six SRC STARnet Centers, sponsored by MARCO and DARPA.

19

http://shanbhag.ece.illinois.edu