Deep In-memory Architectures for NAND Flash Memory
Sujan K Gonugondla, Mingu Kang, Yongjune Kim, Naresh Shanbhag University of Illinois at Urbana-Champaign Mark Helm, Sean Eilert Micron Technology, Inc.
1
Deep In-memory Architectures for NAND Flash Memory Sujan K - - PowerPoint PPT Presentation
Deep In-memory Architectures for NAND Flash Memory Sujan K Gonugondla, Mingu Kang, Yongjune Kim, Naresh Shanbhag University of Illinois at Urbana-Champaign Mark Helm, Sean Eilert Micron Technology, Inc. 1 Machines are Beating Humans at
1
The Economist, March 2016
[Horowitz, ISSCC14’]
Integer ADD Mult 8 bits 0.03 pJ 0.2 pJ 32 bits 0.1 pJ 3 pJ
Computation energy (45nm) Memory energy (45nm)
Memory 64 bits Cache 8 KB 10 pJ Cache 32 KB 20 pJ Cache 1 MB 100 pJ DRAM 1.2 – 2.6 nJ
*Post-layout simulations with SRAM + synthesized logic in 65nm CMOS
Dot product
ML kernel-level energy breakdown for TM (8-b operands) [Kang, Shanbhag]
3
CPU
DRAM NAND Flash SATA SSD
DDR PCIe SATA
HDD
Cost per bit ($)
Latency
PCH
SATA
0.001 0.01 0.1 1 10 100 S R A M D R A M F L A S H
Energy per bit (pJ)
0.1 1 10 100 1000 S R A M D R A M F L A S H
Density (MB/mm2)
0.001 0.1 10 1000 100000 S R A M D R A M F L A S H
Access time (ns)
Bandwidth
!" − $""%& ($( − (")×$"+-clock cycles $,& $"!-clock cycles $" − -".& 20-60-clock cycles 16Gb/s 8-12Gb/s 25.6Gb/s
4 [Yang, J. Joshua, Nature Nano, 2013]
5
https://spectrum.ieee.org/computing/hardware/ to-speed-up-ai-mix-memory-and-processing
“breaching the memory wall”
6
BLP BLP BLP BLP BLP BLP Cross Bitline Processor Residual Digital Unit Precharge/Column Mux/Y-DEC X-DEC X-DEC
inference/decisions bitline processing (SIMD analog processing) multi-row functional READ (reads multiple-bits/col/precharge) cross bitline processing (analog averaging enhances SNR) low complexity, low (decision) rate digital output analog, mixed-signal low-SNR fabric
[Kang, et al., -ICASSP-2014] [Kang, et al.,-JSSC-2018]
∆"#$ ∆"#$#
WL3
BL BLB
Precharge_b
d3 d2 d0
WL2 WL0 T3 T2 T1
d1
WL1 T0 V3 V1 V0 V2
column-major word
%& =
()*
123
.)7.8.
7
" A
BC=> ∝ multiple-bits per column
WL3
Precharge_b d3 d2 d0
WL2 WL0
d1
WL1
Replica Cell
MAX
BL BLB
RWL3~0
∆VBL∝D+P ∆VBLB∝D+P
D P
8
BL0 BL1 BL254 BL255
ø1 ø1 ø1 ø1 ø2 ø2 ø2 ø2
Sum
BLP cross BL processor (CBLP)
!"#(#, &) = )
*+,
Kernel BLP CBLP Manhattan distance subtract- compare aggregation Euclidean distance subtract- square aggregation Dot product multiply weighted aggregation Hamming distance XOR aggregation
charge-redistribution based cross bitline aggregation computations dominating machine learning
15.4 !m 6T bit cell 2.11 !m 0.915 !m
BL processor (BLP)
column pitch-matched bitline processors subtractor
SVM, TM, k-NN, MF; MIT-CBCL, MNIST, ..; energy savings = 10X; EDP reduction = 50X;
[JSSC January 2018]
Test block
TRAINER
SRAM Bitcell Array
normal R/W Interface BLP-CBLP CTRL ADC 64b Bus
multi-functional
[ESSCIRC 2017, JSSC (special issue) May 2018] [ISSCC 2018, JSSC special issue (Invited)]
FIRST random forest IC
iso-accuracy comparisons wrt. post-layout 8-b digital processor & measured SRAM read energy
RF with 64 trees; KUL traffic sign; energy savings = 3X EDP reduction = 7X with 16kB standard 6T SRAM in 65nm CMOS 8b, 128-dim SVM; MIT- CBCL dataset; SGD-based learning; energy savings = 21X; EDP reduction = 100X;
10
Functional Read (D/A)
X-Dec. & Pulse Gen.
16kB Input/Weights Reg.
BL Processor (BLP)
(multIplier)
Cross BL processor (CBLP)
BL Processor (BLP)
(multiplier)
WBL[0] WBL[255]
Analog Processor
WL Driver
16k Multiplier 128k X 192k SLC NAND Flash Array Out Weights
X-Dec. & Pulse Gen.
WL Driver
A/D
11
BSL
BL0
WL<63> WL<0>
SSL
WL<62>
BL1 BL2 BL3
SEL0 SEL1 SEL2 SEL3 Vread Vpass Vpass PCH OUT
!
""
∆"
#$% = '#(
)#$% *
+,- (./
0+1+ ∝ 3 1- 1/ 14 15
COUT
12
!
"#$ + !&'
!
(( + !&'
!
"#$
!
((
)
*
)
+
)
,
)
$./0
)
1#$
!
234
SEL3 PCH BL1,2,3,4 OUT SEL2 SEL1 SEL0 SSL BSL Evaluation Phase Precharge Phase
SPICE Simulations System Simulations Behavioral Model BSIM Level 49 Model
Compare
Model Verification Model Parameters NAND flash Energy Models System level Energy and througput Estimation BLP Energy Energy and throughput Delay System level performance
"
13
14
NAND Flash Plane Dot Product IO
16kB Page
NAND Flash Plane Dot Product
16kB Page
NAND Flash Plane Dot Product
16kB Page
NAND Flash Plane Dot Product
16kB Page
% & ̅ (
15
decision rule
320 200
Query image (!)
Person 1 Person 2 Person 28
Face database (") stored in NAND flash
“Based on 3 closest images pick person 1”
#$% + ' ≥ 0 → face else → no face
Detection Accuracy (𝑄𝑒𝑓𝑢) 𝑄𝑒𝑓𝑢 𝜏𝑊𝑈𝐼/Δ𝑊𝑈𝐼 𝜏𝑊𝑈𝐼/Δ𝑊𝑈𝐼 PSNR 𝑄𝑒𝑓𝑢 Detection Accuracy (𝑄𝑒𝑓𝑢) 𝜏𝑊𝑈𝐼/Δ𝑊𝑈𝐼 𝜏𝑊𝑈𝐼/Δ𝑊𝑈𝐼 PSNR
" variations in the typical range
16
0.2 0.4 0.6 0.8 1
conventional DIMA
0.2 0.4 0.6 0.8 1
conventional DIMA
compute I/O string current word-line bit-line
Normalized Energy Normalized Energy 8.3× 23×
17 Plane Plane Plane Plane
BLP-CBLP BLP-CBLP BLP-CBLP BLP-CBLP IO
NAND NAND NAND NAND NAND NAND NAND NAND NAND Controller Host I/O
Planes
MBLP/ CBLP
18
19
http://shanbhag.ece.illinois.edu