PROMISE An End-To-End Design of a PROgrammable MIxed-Signal - - PowerPoint PPT Presentation

promise
SMART_READER_LITE
LIVE PREVIEW

PROMISE An End-To-End Design of a PROgrammable MIxed-Signal - - PowerPoint PPT Presentation

PROMISE An End-To-End Design of a PROgrammable MIxed-Signal AccElerator for Machine Learning Algorithms Prakalp Srivastava *, Mingu Kang *, Sujan K. Gonugondla, Sungmin Lim, Jungwook Choi, Vikram Adve, Nam S. Kim, Naresh Shanbhag


slide-1
SLIDE 1

PROMISE | Srivastava, Kang et al. | University of Illinois at Urbana-Champaign

PROMISE

An End-To-End Design of a PROgrammable MIxed-Signal AccElerator for Machine Learning Algorithms

Prakalp Srivastava*, Mingu Kang*, Sujan K. Gonugondla, Sungmin Lim, Jungwook Choi, Vikram Adve, Nam S. Kim, Naresh Shanbhag

(psrivas2@illinois.edu, mingu.kang@ibm.com) * Equal Contribution Supported by NSF, C-FAR, and SONIC

slide-2
SLIDE 2

PROMISE | Srivastava, Kang et al. | University of Illinois at Urbana-Champaign

Machine Learning under Resource Constraints

  • Embedded statistical inference: IoT, sensor-rich platforms
  • Decision making under resource constraints

1 / 23

slide-3
SLIDE 3

PROMISE | Srivastava, Kang et al. | University of Illinois at Urbana-Champaign

Energy Trend of Memory vs. Processing

Component-level energy trend in modern processor [Horowitz, ISSCC14’]

Integer ADD Mult 8 bits 0.03 pJ 0.2 pJ 32 bits 0.1 pJ 3 pJ Computation energy (45nm) Memory access energy (45nm) Memory 64 bits Cache 8 KB 10 pJ Cache 32 KB 20 pJ Cache 1 MB 100 pJ DRAM 1.2 – 2.6 nJ

Accuracy vs. amount of operations, and number of parameters [Canziani, Arxiv16’]

Top-1 accuracy [%] Operations [G-Ops] BN-AlexNet AlexNet BN-NIN ENet Inception-v3 Inception-v4 ResNet-152 VGG-16 VGG-19 155M 35M 40 50 80 # of parameters

2 / 23

slide-4
SLIDE 4

PROMISE | Srivastava, Kang et al. | University of Illinois at Urbana-Champaign

Deep In-memory Architecture (DIMA)

BLP BLP BLP BLP BLP BLP

Cross Bitline Processor RDL Precharge/Y-decoder X-decoder X-decoder Decision

  • Deeply embeds analog computing at the periphery of bitcell array
  • Low-swing / Low-SNR operations for aggressive energy efficiency

[M. Kang, JSSC18, J. Zhang, VLSI16, S. Gonugondla, ISSCC18, A Biswas, ISSCC18]

3 / 23

slide-5
SLIDE 5

PROMISE | Srivastava, Kang et al. | University of Illinois at Urbana-Champaign

DIMA Prototypes

I E E E P r

  • f

[Mingu Kang, JSSC18] [Mingu Kang, JSSC18 Mingu Kang, ESSCIRC17] [Sujan Gonugondla, ISSCC18]

53× EDP ↓ 7× EDP ↓ 100× EDP ↓

Multi-functional inference processor (65nm CMOS) Random forest processor (65nm CMOS) On-chip training processor (65nm CMOS)

4 / 23

slide-6
SLIDE 6

PROMISE | Srivastava, Kang et al. | University of Illinois at Urbana-Champaign

DIMA Prototypes

I E E E P r

  • f

[Mingu Kang, JSSC18] [Mingu Kang, JSSC18 Mingu Kang, ESSCIRC17] [Sujan Gonugondla, ISSCC18]

53× EDP ↓ 7× EDP ↓ 100× EDP ↓

Multi-functional inference processor (65nm CMOS) Random forest processor (65nm CMOS) On-chip training processor (65nm CMOS)

Lack of Programmability

4 / 23

slide-7
SLIDE 7

PROMISE | Srivastava, Kang et al. | University of Illinois at Urbana-Champaign

Goals & Challenges of PROMISE

  • 1. Analog programmable hardware and ISA design
  • 2. End-to-End application mapping to PROMISE
  • 3. Optimal energy with accuracy guarantee

5 / 23

slide-8
SLIDE 8

PROMISE | Srivastava, Kang et al. | University of Illinois at Urbana-Champaign

Goals & Challenges of PROMISE

  • 1. Analog programmable hardware and ISA design

− Analog noise management − Intrinsic sequentiality of operations − High variations in delay across different analog operations

  • 2. End-to-End application mapping to PROMISE
  • 3. Optimal energy with accuracy guarantee

5 / 23

slide-9
SLIDE 9

PROMISE | Srivastava, Kang et al. | University of Illinois at Urbana-Champaign

Goals & Challenges of PROMISE

  • 1. Analog programmable hardware and ISA design
  • 2. End-to-End application mapping to PROMISE
  • 3. Optimal energy with accuracy guarantee

e.g. Fully-

  • connect. layer

𝒁 = 𝑿 % 𝒀

High-level language Analog circuit (DIMA)

  • Voltage swing
  • ADC precision
  • Analog noise
  • Leakage

5 / 23

slide-10
SLIDE 10

PROMISE | Srivastava, Kang et al. | University of Illinois at Urbana-Champaign

Goals & Challenges of PROMISE

  • 1. Analog programmable hardware and ISA design
  • 2. End-to-End application mapping to PROMISE
  • 3. Optimal energy with accuracy guarantee

− Energy vs. accuracy trade-off in analog circuit − Maximize energy savings − Accuracy guarantees across long chain of analog processing

5 / 23

slide-11
SLIDE 11

PROMISE | Srivastava, Kang et al. | University of Illinois at Urbana-Champaign

High-level Program

DNN DNN Ma Matched Filter SV SVM PC PCA …

PR PROM OMIS ISE Ha Hardware

BLP BLP BLP BLP BLP BLP

Cross Bit-line Processor Precharge/Y-decoder X-decoder X-decoder RDL

Programmability Challenge – 1

6 / 23

Our Contributions

slide-12
SLIDE 12

PROMISE | Srivastava, Kang et al. | University of Illinois at Urbana-Champaign

High-level Program

DNN DNN Ma Matched Filter SV SVM PC PCA …

PROMISE Compiler PR PROM OMIS ISE Ha Hardware Programmability Challenge – 2

BLP BLP BLP BLP BLP BLP

Cross Bit-line Processor Precharge/Y-decoder X-decoder X-decoder RDL

Programmability Challenge – 1 PROMISE ISA

6 / 23

Our Contributions

slide-13
SLIDE 13

PROMISE | Srivastava, Kang et al. | University of Illinois at Urbana-Champaign

High-level Program

DNN DNN Ma Matched Filter SV SVM PC PCA …

PROMISE Compiler Optimized PROMISE ISA PR PROM OMIS ISE Ha Hardware Energy Optimization Programmability Challenge – 2 Programmability Challenge – 3

BLP BLP BLP BLP BLP BLP

Cross Bit-line Processor Precharge/Y-decoder X-decoder X-decoder RDL

PROMISE ISA Programmability Challenge – 1 PROMISE ISA

6 / 23

Our Contributions

slide-14
SLIDE 14

PROMISE | Srivastava, Kang et al. | University of Illinois at Urbana-Champaign

Prior Art

  • Instruction set architecture
  • Various ML algorithms
  • Digital implementation

[D. Liu, ASPLOS15] PuDianNao [P. Chi, ISCA16] PRIME

  • Limited programmability
  • Limited error management

[R.L. Wa, ISCA15] RedEye

  • Processor in image

sensor

  • ReRAM in-memory

processor

7 / 23

slide-15
SLIDE 15

PROMISE | Srivastava, Kang et al. | University of Illinois at Urbana-Champaign

Processing Stages in DIMA

BLP BLP BLP BLP BLP BLP

Cross Bitline Processor Precharge/Y-decoder X-decoder X-decoder

  • 1. Analog READ (aRead)
  • 2. Bitline processing (BLP)
  • 3. Cross BLP (CBLP)
  • 4. ADC & Residual digital logic (RDL)

8 / 23

slide-16
SLIDE 16

PROMISE | Srivastava, Kang et al. | University of Illinois at Urbana-Champaign

Energy efficiency↑

* Silicon measured results of template matching from [Kang JSSC18]

20 40 60 80 100 120 10 20 30

Probability of detection* [%] Bitline voltage swing [mV]

Accuracy↑

Energy vs. Accuracy Trade-off

9 / 23

slide-17
SLIDE 17

PROMISE | Srivastava, Kang et al. | University of Illinois at Urbana-Champaign

SWING = 111 (max)

Energy efficiency↑

* Silicon measured results of template matching from [Kang JSSC18]

20 40 60 80 100 120 10 20 30

Probability of detection* [%] Bitline voltage swing [mV]

SWING = 000 (min)

Accuracy↑

PROMISE Instruction

Energy vs. Accuracy Trade-off

10 / 23

slide-18
SLIDE 18

PROMISE | Srivastava, Kang et al. | University of Illinois at Urbana-Champaign

(SWING = ???) (SWING = ???) (SWING = ???) × ×

> 4096 possible combinations for 4 layers

Accuracy goal

Energy vs. Accuracy Trade-off

11 / 23

slide-19
SLIDE 19

PROMISE | Srivastava, Kang et al. | University of Illinois at Urbana-Champaign

End-to-End Application to Architecture Mapping

Julia Program

DNN DNN Ma Matched Filter SV SVM PC PCA …

PROMISE Compiler Optimized PROMISE ISA PR PROM OMIS ISE Ha Hardware Energy Optimization Programmability Challenge – 2 Programmability Challenge – 3

BLP BLP BLP BLP BLP BLP

Cross Bit-line Processor Precharge/Y-decoder X-decoder X-decoder RDL

PROMISE ISA Programmability Challenge – 1

12 / 23

slide-20
SLIDE 20

PROMISE | Srivastava, Kang et al. | University of Illinois at Urbana-Champaign

Machine Learning Algorithms

𝒈(𝒆𝒋𝒕𝒖 𝑿, 𝒀 )

Distance Metric

𝒈( ) SVM

𝑥 𝑗 𝑦[𝑗]

6 789

sign

Template Match 1

|𝑥 𝑗 − 𝑦 𝑗 |

6 789

min

Template Match 2

𝑥 𝑗 − 𝑦 𝑗

< 6 789

min DNN

𝑥 𝑗 𝑦[𝑗]

6 789

tanh, ReLU PCA

𝑥 𝑗 𝑦[𝑗]

6 789

  • K-NN 1

|𝑥 𝑗 − 𝑦 𝑗 |

6 789

majority vote K-NN 2

𝑥 𝑗 − 𝑦 𝑗

< 6 789

majority vote Matched Filter

𝑥 𝑗 𝑦[𝑗]

6 789

min

… … …

Scalar distance (SD) à Aggregation: Vector distance (VD) à Threshold (𝒈( ))

Scalar distance (SD) Vector distance (VD) Threshold (TH)

13 / 23

slide-21
SLIDE 21

PROMISE | Srivastava, Kang et al. | University of Illinois at Urbana-Champaign

PROMISE ISA

BLP BLP BLP BLP BLP BLP

Cross Bitline Processor

RDL

Precharge/Y-decoder X-decoder X-decoder

Decision

Class 1 (aREAD)

Class 2 (aSD, aVD)

Analog Digital

Class 3 ADC Class 4 (TH)

Class 1

aREAD aADD aSUBT

Class 2

signed multiply unsigned multiply sum-abs sum-abs2 compare

Class 3

ADC Bit Precision

Class 4

max/min mean sum sigmoid/reLU/tanh threshold

14 / 23

slide-22
SLIDE 22

PROMISE | Srivastava, Kang et al. | University of Illinois at Urbana-Champaign

PROMISE ISA: Task

Task: PROMISE macro instruction (51 bits)

Class 1 aREAD Class 2 aSD, aVD Class 3 ADC Class 4 TH Class 0 Set Parameters Rep Count Loop Iterations Class 1

aSUBT 𝑌 − 𝑍

Class 2

absolute 0 |𝑒7|

  • Class 3

ADC 6 bit

Class 4

min

Class 0

Set Parameters SWING

𝑌, 𝑋 address

Rep Count

# of candidates 𝑁

Example: SAD-based template matching

15 / 23

slide-23
SLIDE 23

PROMISE | Srivastava, Kang et al. | University of Illinois at Urbana-Champaign

End-to-End Application to Architecture Mapping

Julia Program

DNN DNN Ma Matched Filter SV SVM PC PCA …

PROMISE Compiler Optimized PROMISE ISA PR PROM OMIS ISE Ha Hardware Energy Optimization Programmability Challenge – 2 Programmability Challenge – 3

BLP BLP BLP BLP BLP BLP

Cross Bit-line Processor Precharge/Y-decoder X-decoder X-decoder RDL

PROMISE ISA Programmability Challenge – 1

16 / 23

slide-24
SLIDE 24

PROMISE | Srivastava, Kang et al. | University of Illinois at Urbana-Champaign

PROMISE Energy Optimization

Challenge

Map end-to-end accuracy specification to low level hardware parameters

PROMISE Context

Programmer provides only end-to-end accuracy spec: mismatch probability 𝑞𝑛 𝑞𝐺𝑀: accuracy on floating point in software 𝑞𝑄𝑆𝑃𝑁𝐽𝑇𝐹: accuracy when run on PROMISE

𝑞MN − 𝑞OPQRSTU ≤ 𝑞W

17 / 23

slide-25
SLIDE 25

PROMISE | Srivastava, Kang et al. | University of Illinois at Urbana-Champaign

PROMISE Energy Optimization

Solution: Two Step Approach

1. Sakr et al. [ICML’17] maps floating point neural networks to fixed point network with bounded accuracy degradation (𝒒𝒏) 2. Map bit precision to bit line voltage swing

SWING for each Instruction

Sakr et al. [ICML’17] Bit precision PROMISE Hardware Error Model 𝑞W Network

1 2

Training Dataset

18 / 23

slide-26
SLIDE 26

PROMISE | Srivastava, Kang et al. | University of Illinois at Urbana-Champaign

Evaluation

  • Benchmarks

− DNN, Matched Filtering, Template Matching (Manhattan and Euclidean), Linear SVM, k-NN (Manhattan and Euclidean), PCA, Linear Regression

  • Baselines

− CONV-8b baseline: computational logic synthesized for the specific algorithm + conventional SRAM. − CONV-OPT baseline: same as CONV-8b but with minimum bit precision required per benchmark

  • PROMISE

− Analog: Silicon measurement − Digital Controller: Post layout simulations

19 / 23

slide-27
SLIDE 27

PROMISE | Srivastava, Kang et al. | University of Illinois at Urbana-Champaign

Results: Throughput and Energy

2.2x higher throughput, 4.2x lower energy Key benefit coming from Class 1 and Class 2 analog operations

1 2 3 4 5 6 Match. Filt. Temp.

  • Match. L1

Temp.

  • Match. L2

Linear SVM k-NN L1 k-NN L2 PCA Linear Reg. DNN Geometric Mean Throughput ratio (PROMISE/CONV-OPT) Energy ratio (CONV-OPT/PROMISE)

Higher is better

20 / 23

slide-28
SLIDE 28

PROMISE | Srivastava, Kang et al. | University of Illinois at Urbana-Champaign

Results: Energy Optimization

Optimized: Swing set by energy optimization analysis (pm <= 1%)

4% – 20% energy reduction, Geometric Mean: 15%

0.2 0.4 0.6 0.8 1

Normalized Energy

SWING = 111 Optimized 21 / 23

slide-29
SLIDE 29

PROMISE | Srivastava, Kang et al. | University of Illinois at Urbana-Champaign

Results: Programmability Overhead

0.2 0.4 0.6 0.8 1 1.2 1.4 SVM Template Matching SVM Template Matching

DIMA PROMISE

Energy Throughput

Negligible Overhead

22 / 23

slide-30
SLIDE 30

PROMISE | Srivastava, Kang et al. | University of Illinois at Urbana-Champaign

Summary

  • Analog programmable hardware and ISA design.

− Negligible programmability overhead

  • End-to-End application mapping to PROMISE

− Programmable with high-level language Julia − 2.2x higher throughput and 4.2x lower energy compared with application specific digital hardware

  • Optimal energy with accuracy goal

− 15% energy reduction compared to maximum SWING in PROMISE

23 / 23