A 1 9 .4 nJ/ Decision 3 6 4 K Decisions/ s I n-Mem ory Random - - PowerPoint PPT Presentation

a 1 9 4 nj decision 3 6 4 k decisions s i n mem ory
SMART_READER_LITE
LIVE PREVIEW

A 1 9 .4 nJ/ Decision 3 6 4 K Decisions/ s I n-Mem ory Random - - PowerPoint PPT Presentation

A 1 9 .4 nJ/ Decision 3 6 4 K Decisions/ s I n-Mem ory Random Forest Classifier in 6 T SRAM Array Mingu Kang, Sujan Gonugondla, Naresh Shanbhag University of Illinois at Urbana Champaign Machine Learning under Resource Constraints


slide-1
SLIDE 1

A 1 9 .4 nJ/ Decision 3 6 4 K Decisions/ s I n-Mem ory Random Forest Classifier in 6 T SRAM Array Mingu Kang, Sujan Gonugondla, Naresh Shanbhag University of Illinois at Urbana Champaign

slide-2
SLIDE 2

Machine Learning under Resource Constraints

 Embedded statistical inference: IoT, sensor-rich platforms  Decision making under resource constraints

 Limited form factor, battery-powered, real-time

2

slide-3
SLIDE 3

The Random Forest (RF) Algorithm

 Random Forest [ 1]

 Ensemble of many (a few hundreds) decision trees  High accuracy  Simple computation (only comparisons)  Suitable for multi-class classifications  Inherent error-resiliency (from ensemble nature)

3

RF algorithm [ 1] L. Breiman, Machine Learning2001

slide-4
SLIDE 4

Implementation Challenges

4

RF algorithm

 Implementation challenges

 Non-uniform tree structure

  • Variations in depth, # of

nodes, symmetricity  Frequent memory access (

,, ,

  • Memory dominates the

system efficiency  Irregular data access pattern:

  • ,

 Prior Art:

 Software and FPGA

  • implementations. No ASIC.

 Fails to take advantage of inherent error-resiliency

slide-5
SLIDE 5

Proposed Solution: Deep In-memory Architecture (DIMA) with DSS

 DIMA [ 2-4] :

 Embedded analog processing  Storage density, normal read & write function preserved  FR: functional read  BLP: bitline processor (subtraction, comparison)  CBLP: cross BLP (aggregation)  RDL: ADC & residual digital logic

 Deterministic sub-sampling (DSS)

 Regularizes memory access pattern

5

[ 2] M.Kang, et al., ICASSP14 [ 3] M.Kang, et al., Arxiv16 [ 4] M.Kang, et al., US Patent no. 9,697,877

slide-6
SLIDE 6

RF Chip Architecture

 SRAM bitcell array

 Stores up to 42 groups  Each group has 4 sub-group (1 sub group = 1 tree)

 Input buffer

 Stores 4: 1 sub-sampled pixels in 4 sections for DSS

 Cross bar (CB)

 31 CB units per sub-group enabled in parallel

 Comparator (COMP)

 128 analog comparators (∆

. ∆ )

6

Proposed architecture

  • IREG: pixel index register, RSREG: RSS register
slide-7
SLIDE 7

Functional READ (FR)

7

 Fetches and computes the linear combination of stored data into analog  (LB) times more data access per read & precharge  Savings in energy & delay at the cost of reduced SNR

Δ

∝ 0.5

  • Δ

  • Δ
  • Δ
  • Conventional read

Functional read ( FR)

  • B: bit precision, L: column mux ratio
slide-8
SLIDE 8

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 ‐15 ‐10 ‐5 5 10 15

VBL (V) TMSB ‐ XMSB XMSB = 0 TMSB 15 TMSB = 0 XMSB 15

: variation due to possible cominations of (TMSB, XMSB) at the TMSB ‐ XMSB value

In-memory Bitline Processing

 Subtraction

1 → @ 2

 Store and in the same column

∆ ∝ , ∆ ∝

 Comparison: ∆

  • 8

Measured subtraction in a 6 5 nm CMOS A colum n of SRAM array

> <

0.67 0.675 0.68 0.685 0.69 0.695 0.7

slide-9
SLIDE 9

Deterministic Sub-sampling (DSS)

 Random sub-sampling (RSS)

 Requires complex cross bar (e.g., 256: 1 for 256-pixel )

 Deterministic sub-sampling (DSS) before RSS

 Sub-samples to generate four sub-images ,,,  Reduces cross bar complexity (e.g., 256: 1 → 64: 1)  More than 3× and 4× energy and layout area savings  4: 1 chosen due to accuracy vs. sub-sampling ratio trade-off

9

Proposed RF algorithm

slide-10
SLIDE 10

Application & Measured Results

 Training (off-chip)

 200 images per class employed for training  Bit precision: 8, tree depth: 6, 64 trees

 Testing

 Randomly chosen 200 testing images from test data set

Platform ( 6 5 nm CMOS) # of trees Max tree Depth Classification rate ( decisions/ m s) Energy per decision ( nJ) Energy delay product ( fJ·s) Accuracy ( % )

  • Conv. Arch.

6 4 6 1 6 7 / bank 6 0 .4 3 6 1 .6 9 3 .5 Proposed Arch. 6 4 6 3 6 4 / bank 1 9 .4 5 3 .2 9 4

EDP reduction by 6 .8 ×

KUL Belgium traffic sign dataset

10

slide-11
SLIDE 11

Measured Energy vs. Accuracy Trade-off

 Accuracy  BL swing  Energy  # of trees  error resiliency → allows lower BL swing → higher energy efficiency

11

Accuracy vs. # of trees vs. Δ Accuracy vs. energy w .r.t BL sw ing ( Δ) * * Δ for conv. is 10×”Δ per LSB”

slide-12
SLIDE 12

Chip Summary & Comparison

12

Chip m icrograph Chip sum m ary Com parison w ith state-of-the-art

Prior art Process Algorithm Dataset Input size (8b) Throughput (decision/s) Energy

(nJ/decision)

EDP

(fJs/decision)

Accuracy [5] 130nm CMOS Support vector machine Traffic sign video 320 ×240 33 [40K]* 1.5M [1250]* 45G [31250]* 90% [6] 14nm tri-gate K-nearest neighbor Not reported 128 21.5M [498.8K]* 3.4 [145.3]* 0.2 [292.3]* Not reported Ours (M=64) 65nm CMOS Random forest KUL traffic signs 16 ×16 364.4K 19.4 (w/ CTRL) 52.4 94% [ 5] : J.Park JSSC12, [ 6] : H.Kaul ISSCC16, * scaled to 65 nm CMOS Technology 65 nm CMOS Die size 1.2 × 1.2 mm SRAM capacity 16 KB (512×256 bit-cells) Bit-cell size 2.11 × 0.92 um2 CTRL CLK freq. 1 GHz Supply voltage (V) CORE 1.0 CTRL 0.75

slide-13
SLIDE 13

Conclusions

 First ASIC implementation of RF algorithm

 low-SNR processing via DIMA and DSS

 Energy & speed benefits

 2.2× and 3.1× smaller delay and energy → 6.8× smaller EDP compared to digital ASIC

 Higher potential in large-scale applications

 # of trees up to a few hundreds in real-life applications → Higher error-resiliency → More room to scale ∆

for energy efficiency

 Future work

 On-chip training to compensate process variations  Different algorithms (e.g., boosted ensemble classifier)

13

slide-14
SLIDE 14

Acknowledgment

 This work was supported by Systems on Nanoscale Information fabriCs (SONIC), one of the six SRC STARnet Centers, sponsored by SRC and DARPA.

14