From Small to Tiny: How to Co-design ML Models, Computational - - PowerPoint PPT Presentation

from small to tiny how to co design ml models
SMART_READER_LITE
LIVE PREVIEW

From Small to Tiny: How to Co-design ML Models, Computational - - PowerPoint PPT Presentation

From Small to Tiny: How to Co-design ML Models, Computational Precision and Circuits in the Energy-Accuracy Trade-off Space Marian Verhelst Marian.Verhelst@kuleuven.be 1 Embedded Deep Neural Networks Keyword and Augmented Face and owner


slide-1
SLIDE 1

From Small to Tiny: How to Co-design ML Models, Computational Precision and Circuits in the Energy-Accuracy Trade-off Space

Marian Verhelst Marian.Verhelst@kuleuven.be

1

slide-2
SLIDE 2

CLOUD GPU

Raw Data Information

Embedded Deep Neural Networks

2

Augmented reality Face and owner recognition Keyword and speaker recognition

slide-3
SLIDE 3

Embedded Deep Neural Networks

Local Processing

3

Augmented reality Face and owner recognition Keyword and speaker recognition

slide-4
SLIDE 4

Towards embedded Deep Neural Networks

Minimize TOTAL energy @ target performance By innovating at the … … without giving up flexibility!

4

Application Algorithmic level Architecture level Circuit level

slide-5
SLIDE 5

Towards embedded Deep Neural Networks

Minimize TOTAL energy @ target performance By innovating at the … … without giving up flexibility!

5

Application Architecture level Circuit level Algorithmic level

TOPs/Watt!?

slide-6
SLIDE 6

Circuit level choices

MAC = multiply-accumulate

6

analog MAC

  • r 1-bit

digital MAC

[Bankman, ISSCC18] [Moons, CICC18] [Moons, ISSCC17]

X +

  • r multi-precision

digital MAC (2-16bit)

slide-7
SLIDE 7

7

Circuit level implications

Area

large

small medium Energy 500TOPS/W 200TOPs/W 16b: 0.5 TOPs/W 8b: 1 TOPs/W 4b: 5 TOPs/W 2b: 10 TOPs/W analog MAC

  • r 1-bit

digital MAC

  • r multi-precision

digital MAC Flexibility low medium high Accuracy Best one?

[Bankman, ISSCC18] [Moons, CICC18] [Moons, ISSCC17]

With Stanford (Murmann)

slide-8
SLIDE 8

Towards embedded Deep Neural Networks

Minimize TOTAL energy @ target performance By innovating at the … … without giving up flexibility!

8

Application Architecture level Circuit level Algorithmic level

Analog or digital? Optimal precision?

slide-9
SLIDE 9

Architecture level choices

9

Configurable systolic accelerator

  • r programmable

processor (ASIP)

MAC array Activation mem Weight memory Activation mem Weight memory Cntr

Area

small

large(r) Energy eff.

high

lower Flexibility (util.) low high

[Moons, CICC18/ ISSCC18] [Moons, ISSCC17]

mem comp

slide-10
SLIDE 10

Architecture level choices (2)

Spend area on

– More MACs in parallel? – Larger memory? – Local or global memories?

  • r programmable

processor (ASIP) Best one?

10

slide-11
SLIDE 11

Towards embedded Deep Neural Networks

Minimize TOTAL energy @ target performance By innovating at the … … without giving up flexibility!

11

Application Algorithmic level Architecture level Circuit level

Data parallelism? Memory hierarchy? Analog or digital? Optimal precision?

slide-12
SLIDE 12

Algorithm level choices

Same task can be implemented with many network topologies

12

vary netw ork depth vary netw ork w idth vary layer topology Layer1 Layer2 Layer3 LayerN

slide-13
SLIDE 13

Algorithm level choices: implications

13

Every parameter combination = 1 dot Pareto optimal curve

Graph for CIF AR-10

[Moons, Asilomar17]

slide-14
SLIDE 14

Algorithm level choices: precision

Same task can be implemented with many network topologies

14

vary netw ork depth vary netw ork w idth vary layer topology vary com putational precision Layer1 Layer2 Layer3 LayerN

slide-15
SLIDE 15

Algorithm level choices: implications

Most energy efficient network?

15

Int1-2-… nets need more operations! Int1-2-… nets need simpler operations! Int1-2-… nets need more, but smaller, memory accesses!

Graph for CIF AR-10

[Moons, Asilomar17]

Impact on parallelism and data reuse? Impact on compute vs memory cost?

slide-16
SLIDE 16

Towards embedded Deep Neural Networks

Minimize TOTAL energy @ target performance By innovating at the … … without giving up flexibility!

16

Application Architecture level Circuit level Algorithmic level

Network depth & width? Layer topology? Bit resolution? Data parallelism? Memory hierarchy? Analog or digital? Optimal precision?

Optimize ACROSS all levels

slide-17
SLIDE 17

Parametrized HW energy/latency/area model

Energy model parametrized across circuit & architecture options Similar approach for latency/delay/throughput, resp. area

17

DRAM access: SRAM access: Multiply-accum:

[Moons, Asilomar17]

slide-18
SLIDE 18

Graphs for CIF AR-10

Energy-based cross-layer optimization

Jointly determine most energy efficient network, and circuit parameters – 4-bit! But … Varies over accuracies and applications  flexible hardware!

18

[Moons, Asilomar17]

– Similar study [Moons, Asilomar17] for optimum memory vs. datapath size;

  • ptimum layer topology ; …

HW model

slide-19
SLIDE 19

Needs for flexible systems with cross-layer framework

19

Structural and precision scalability Power consumption

Binareye: Machine learned wake-up image processor (~1mW) Envision: Precision- Scalable CNN processor, gen2 (10-100mW) LSTMacc: Machine-learned wake-up audio processor (~10uW) [ESSCIRC’18] [VLSI’16, ISSCC’17] [ISSCC’18, CICC’18]

HW models Cross-layer

  • ptimization

HW configuration NN topology

slide-20
SLIDE 20

Towards embedded Deep Neural Networks

Minimize TOTAL energy @ target performance By innovating at the … … without giving up flexibility!

20

Face recognition Architecture level Circuit level Algorithmic level

Network depth & width? Layer topology? Bit resolution? Data parallelism? Memory hierarchy? Analog or digital? Optimal precision?

Optimize ACROSS all levels Adapt dynamically (data dependent)

slide-21
SLIDE 21

Cascaded networks for efficient face recognition

21

… Face Detection

binary, 125MMACs/f 17 kB

Run on Binareye accelerator … Owner detection

binary, 2GMACs/f 260 kB

Architecture level Circuit level Algorithmic level

<1mWatt average

y? n? … Face detection

6-bit, 15GMACs/f 15MB

Run on Envision processor

slide-22
SLIDE 22

Towards embedded Deep Neural Networks

Minimize TOTAL energy @ target performance By innovating at the … … without giving up flexibility!

22

Keyword & speaker recognition Architecture level Circuit level Algorithmic level

Network depth & width? Layer topology? Bit resolution? Data parallelism? Memory hierarchy? Analog or digital? Optimal precision?

Optimize ACROSS all levels Adapt dynamically (data dependent)

slide-23
SLIDE 23

Cascaded ML models for efficient keyword & speaker recognition

23

Voice Detection

1-4-bit, 40kMACs/sec ~2kB

Run on cascade of embedded accelerators … Keyword detection

4-8bit LSTM, 2MMACs/sec 64kB

Architecture level Circuit level Algorithmic level

<20uWatt average [VLSI2019]

y? y? Speaker identification

8-bit, GMM 70MMACs/sec 500kB

y? Speech rec

slide-24
SLIDE 24

Towards embedded Deep Neural Networks

Minimize TOTAL energy @ target performance By innovating at the …

24

System matters, not TOPs/W! Architecture level Circuit level Algorithmic level Optimize ACROSS all levels Adapt dynamically (data dependent)

slide-25
SLIDE 25

Contact: Marian.Verhelst@kuleuven.be