A 1 9 .4 nJ/ Decision 3 6 4 K Decisions/ s I n-Mem ory Random - - PowerPoint PPT Presentation
A 1 9 .4 nJ/ Decision 3 6 4 K Decisions/ s I n-Mem ory Random - - PowerPoint PPT Presentation
A 1 9 .4 nJ/ Decision 3 6 4 K Decisions/ s I n-Mem ory Random Forest Classifier in 6 T SRAM Array Mingu Kang, Sujan Gonugondla, Naresh Shanbhag University of Illinois at Urbana Champaign Machine Learning under Resource Constraints
Machine Learning under Resource Constraints
Embedded statistical inference: IoT, sensor-rich platforms Decision making under resource constraints
Limited form factor, battery-powered, real-time
2
The Random Forest (RF) Algorithm
Random Forest [ 1]
Ensemble of many (a few hundreds) decision trees High accuracy Simple computation (only comparisons) Suitable for multi-class classifications Inherent error-resiliency (from ensemble nature)
3
RF algorithm [ 1] L. Breiman, Machine Learning2001
Implementation Challenges
4
RF algorithm
Implementation challenges
Non-uniform tree structure
- Variations in depth, # of
nodes, symmetricity Frequent memory access (
,, ,
- Memory dominates the
system efficiency Irregular data access pattern:
- ,
Prior Art:
Software and FPGA
- implementations. No ASIC.
Fails to take advantage of inherent error-resiliency
Proposed Solution: Deep In-memory Architecture (DIMA) with DSS
DIMA [ 2-4] :
Embedded analog processing Storage density, normal read & write function preserved FR: functional read BLP: bitline processor (subtraction, comparison) CBLP: cross BLP (aggregation) RDL: ADC & residual digital logic
Deterministic sub-sampling (DSS)
Regularizes memory access pattern
5
[ 2] M.Kang, et al., ICASSP14 [ 3] M.Kang, et al., Arxiv16 [ 4] M.Kang, et al., US Patent no. 9,697,877
RF Chip Architecture
SRAM bitcell array
Stores up to 42 groups Each group has 4 sub-group (1 sub group = 1 tree)
Input buffer
Stores 4: 1 sub-sampled pixels in 4 sections for DSS
Cross bar (CB)
31 CB units per sub-group enabled in parallel
Comparator (COMP)
128 analog comparators (∆
. ∆ )
6
Proposed architecture
- IREG: pixel index register, RSREG: RSS register
Functional READ (FR)
7
Fetches and computes the linear combination of stored data into analog (LB) times more data access per read & precharge Savings in energy & delay at the cost of reduced SNR
Δ
∝ 0.5
- Δ
∝
- Δ
- Δ
- Conventional read
Functional read ( FR)
- B: bit precision, L: column mux ratio
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 ‐15 ‐10 ‐5 5 10 15
VBL (V) TMSB ‐ XMSB XMSB = 0 TMSB 15 TMSB = 0 XMSB 15
: variation due to possible cominations of (TMSB, XMSB) at the TMSB ‐ XMSB value
In-memory Bitline Processing
Subtraction
1 → @ 2
Store and in the same column
∆ ∝ , ∆ ∝
Comparison: ∆
∆
- 8
Measured subtraction in a 6 5 nm CMOS A colum n of SRAM array
> <
0.67 0.675 0.68 0.685 0.69 0.695 0.7
Deterministic Sub-sampling (DSS)
Random sub-sampling (RSS)
Requires complex cross bar (e.g., 256: 1 for 256-pixel )
Deterministic sub-sampling (DSS) before RSS
Sub-samples to generate four sub-images ,,, Reduces cross bar complexity (e.g., 256: 1 → 64: 1) More than 3× and 4× energy and layout area savings 4: 1 chosen due to accuracy vs. sub-sampling ratio trade-off
9
Proposed RF algorithm
Application & Measured Results
Training (off-chip)
200 images per class employed for training Bit precision: 8, tree depth: 6, 64 trees
Testing
Randomly chosen 200 testing images from test data set
Platform ( 6 5 nm CMOS) # of trees Max tree Depth Classification rate ( decisions/ m s) Energy per decision ( nJ) Energy delay product ( fJ·s) Accuracy ( % )
- Conv. Arch.
6 4 6 1 6 7 / bank 6 0 .4 3 6 1 .6 9 3 .5 Proposed Arch. 6 4 6 3 6 4 / bank 1 9 .4 5 3 .2 9 4
EDP reduction by 6 .8 ×
KUL Belgium traffic sign dataset
10
Measured Energy vs. Accuracy Trade-off
Accuracy BL swing Energy # of trees error resiliency → allows lower BL swing → higher energy efficiency
11
Accuracy vs. # of trees vs. Δ Accuracy vs. energy w .r.t BL sw ing ( Δ) * * Δ for conv. is 10×”Δ per LSB”
Chip Summary & Comparison
12
Chip m icrograph Chip sum m ary Com parison w ith state-of-the-art
Prior art Process Algorithm Dataset Input size (8b) Throughput (decision/s) Energy
(nJ/decision)
EDP
(fJs/decision)
Accuracy [5] 130nm CMOS Support vector machine Traffic sign video 320 ×240 33 [40K]* 1.5M [1250]* 45G [31250]* 90% [6] 14nm tri-gate K-nearest neighbor Not reported 128 21.5M [498.8K]* 3.4 [145.3]* 0.2 [292.3]* Not reported Ours (M=64) 65nm CMOS Random forest KUL traffic signs 16 ×16 364.4K 19.4 (w/ CTRL) 52.4 94% [ 5] : J.Park JSSC12, [ 6] : H.Kaul ISSCC16, * scaled to 65 nm CMOS Technology 65 nm CMOS Die size 1.2 × 1.2 mm SRAM capacity 16 KB (512×256 bit-cells) Bit-cell size 2.11 × 0.92 um2 CTRL CLK freq. 1 GHz Supply voltage (V) CORE 1.0 CTRL 0.75
Conclusions
First ASIC implementation of RF algorithm
low-SNR processing via DIMA and DSS
Energy & speed benefits
2.2× and 3.1× smaller delay and energy → 6.8× smaller EDP compared to digital ASIC
Higher potential in large-scale applications
# of trees up to a few hundreds in real-life applications → Higher error-resiliency → More room to scale ∆
for energy efficiency
Future work
On-chip training to compensate process variations Different algorithms (e.g., boosted ensemble classifier)
13
Acknowledgment
This work was supported by Systems on Nanoscale Information fabriCs (SONIC), one of the six SRC STARnet Centers, sponsored by SRC and DARPA.
14