From Small to Tiny: How to Co-design ML Models, Computational - PowerPoint PPT Presentation

From Small to Tiny: How to Co-design ML Models, Computational Precision and Circuits in the Energy-Accuracy Trade-off Space Marian Verhelst Marian.Verhelst@kuleuven.be 1

Embedded Deep Neural Networks Keyword and Augmented Face and owner speaker recognition reality recognition Raw Data CLOUD GPU Information 2

Embedded Deep Neural Networks Keyword and Augmented Face and owner speaker recognition reality recognition Local Processing 3

Towards embedded Deep Neural Networks Minimize TOTAL energy @ target performance By innovating at the … Circuit Architecture Algorithmic level level level Application … without giving up flexibility! 4

Towards embedded Deep Neural Networks Minimize TOTAL energy @ target performance By innovating at the … Circuit Architecture Algorithmic level level level TOPs/Watt!? Application … without giving up flexibility! 5

Circuit level choices analog or 1-bit or multi-precision MAC digital MAC digital MAC (2-16bit) [Moons, CICC18] [Moons, ISSCC17] [Bankman, ISSCC18] X + MAC = multiply-accumulate 6

Circuit level implications analog or 1-bit or multi-precision MAC digital MAC digital MAC With Stanford (Murmann) [Bankman, [Moons, [Moons, ISSCC18] CICC18] ISSCC17] Area large small medium Energy 500TOPS/W 200TOPs/W 16b: 0.5 TOPs/W 8b: 1 TOPs/W 4b: 5 TOPs/W 2b: 10 TOPs/W Flexibility low medium high Accuracy Best one? 7

Towards embedded Deep Neural Networks Minimize TOTAL energy @ target performance By innovating at the … Circuit Architecture Algorithmic level level level Analog or Application digital? Optimal precision? … without giving up flexibility! 8

Architecture level choices or programmable Configurable processor (ASIP) systolic accelerator Weight memory Cntr Activation Activation mem mem MAC array [Moons, CICC18/ [Moons, Weight memory ISSCC18] ISSCC17] Area small large(r) Energy eff. high lower Flexibility (util.) low high mem comp 9

Architecture level choices (2) or programmable processor (ASIP) Spend area on – More MACs in parallel? – Larger memory? – Local or global memories? Best one? 10

Towards embedded Deep Neural Networks Minimize TOTAL energy @ target performance By innovating at the … Circuit Architecture Algorithmic level level level Analog or Data Application digital? parallelism? Optimal Memory precision? hierarchy? … without giving up flexibility! 11

Algorithm level choices Same task can be implemented with many network topologies netw ork vary w idth Layer1 Layer2 Layer3 LayerN vary netw ork depth vary layer topology 12

[Moons, Asilomar17] Algorithm level choices: implications Graph for CIF AR-10 Every parameter combination = 1 dot Pareto optimal curve 13

Algorithm level choices: precision Same task can be implemented with many network topologies vary com putational netw ork precision vary w idth Layer1 Layer2 Layer3 LayerN vary netw ork depth vary layer topology 14

[Moons, Asilomar17] Algorithm level choices: implications Graph for CIF AR-10 Int1-2-… nets need more operations! Int1-2-… nets need simpler operations! Int1-2-… nets need more, but smaller, memory accesses! Impact on parallelism and data reuse? Impact on compute vs memory cost? Most energy efficient network? 15

Towards embedded Deep Neural Networks Minimize TOTAL energy @ target performance By innovating at the … Circuit Architecture Algorithmic level level level Optimize ACROSS all levels Network depth & width? Analog or Data Application digital? parallelism? Layer topology? Optimal Memory precision? hierarchy? Bit resolution? … without giving up flexibility! 16

[Moons, Asilomar17] Parametrized HW energy/latency/area model DRAM access: SRAM access: Multiply-accum: Energy model parametrized across circuit & architecture options Similar approach for latency/delay/throughput, resp. area 17

[Moons, Asilomar17] Energy-based cross-layer optimization Graphs for CIF AR-10 HW model Jointly determine most energy efficient network, and circuit parameters – 4-bit! But … Varies over accuracies and applications  flexible hardware! – Similar study [Moons, Asilomar17] for optimum memory vs. datapath size; optimum layer topology ; … 18

Needs for flexible systems with cross-layer framework Envision : Precision- Scalable CNN processor, Power consumption gen2 (10-100mW) HW models [VLSI’16, ISSCC’17] Binareye : Machine learned wake-up image Cross-layer processor (~1mW) optimization [ISSCC’18, CICC’18] LSTMacc : Machine-learned HW configuration wake-up audio processor (~10uW) NN topology [ESSCIRC’18] Structural and precision scalability 19

Towards embedded Deep Neural Networks Minimize TOTAL energy @ target performance By innovating at the … Circuit Architecture Algorithmic level level level Network depth Optimize ACROSS all levels Face & width? Analog or Data Adapt dynamically (data dependent) recognition digital? parallelism? Layer topology? Optimal Memory precision? hierarchy? Bit resolution? … without giving up flexibility! 20

Cascaded networks for efficient face recognition Face Owner Face Detection detection detection Algorithmic level … … n? … y? Architecture level binary, binary, 6-bit, 125MMACs/f 2GMACs/f 15GMACs/f 17 kB 260 kB 15MB Run on Run on Circuit Binareye Envision level accelerator processor <1mWatt average 21

Towards embedded Deep Neural Networks Minimize TOTAL energy @ target performance By innovating at the … Circuit Architecture Algorithmic level level level Network depth Optimize ACROSS all levels Keyword & width? Analog or Data Adapt dynamically (data dependent) & speaker digital? parallelism? recognition Layer topology? Optimal Memory precision? hierarchy? Bit resolution? … without giving up flexibility! 22

Cascaded ML models for efficient keyword & speaker recognition Voice Keyword Speaker Detection detection identification Algorithmic level Speech rec … y? y? y? … Architecture level 1-4-bit, 4-8bit LSTM, 8-bit, GMM 40kMACs/sec 2MMACs/sec 70MMACs/sec ~2kB 64kB 500kB Circuit Run on cascade of embedded accelerators level <20uWatt average [VLSI2019] 23

Towards embedded Deep Neural Networks Minimize TOTAL energy @ target performance By innovating at the … Circuit Architecture Algorithmic level level level Optimize ACROSS all levels System Adapt dynamically (data dependent) matters, not TOPs/W! 24

Contact: Marian.Verhelst@kuleuven.be

From Small to Tiny: How to Co-design ML Models, Computational - PowerPoint PPT Presentation

From Small to Tiny: How to Co-design ML Models, Computational Precision and Circuits in the Energy-Accuracy Trade-off Space Marian Verhelst Marian.Verhelst@kuleuven.be 1 Embedded Deep Neural Networks Keyword and Augmented Face and owner

The Small (Tiny) House Movement SCAPA Fall Conference October 16, 2014 Photo credit Tumbleweed

WHERE CAN I PUT MY TINY HOUSE? TINY HOMES CARNIVAL 8 MARCH 2020 1 08 MAR 2020 WHO ARE WE? 2

TH THE TI TINY NY TE TEACHER ACHER SMALL INSECTS FOUND AT HOME SMALLEST AND WISEST THE TINY

TINY HOUSE CODE HACK DECATUR TINY HOUSE FESTIVAL JULY 31, 2016 Who is Kronberg Wall? WE ARE

The Benefits of Tiny Houses Kyle Sutherland What is a Tiny House? Relocatable homes on wheels

The world of The world of tiny nuclear magnets tiny nuclear magnets T. G.

Ti Ti Tiny Directory Tiny Directory Di Di t t Making Coherence Tracking Making Coherence

Ti Time me Squeezing for Tiny Device ces DAC 2018, ISCA 2019

INTRODUCTION NOVA TINY The worlds smallest progressive with fitting height of 13mm! With Nova

ORGANAGARDENS Tiny Homes January 17, 2020 Lonna Thelen, Principal Planner City of Colorado

Voice Separation with tiny ML on the edge Main collaborators: Niels H. Pontoppidan, PhD Dr. Lars

GStreamer for Tiny Devices Olivier Crte Open First Who am I ? GStreamer at Collabora since

Software Programming tiny devices You would like to have: structured programs with multiple

Office of Small Business Small Business Updates James G. Burrows: Senior Vice President, Office

City of Boston Small Business Plan Small Business Plan Overview State of Small Business in

SMALL CHARITIES THE REALITY Definition of small Micro, small The scope, reach and

Ind ndet etermi minat ate A Anal nalys ysis For Force M e Met ethod 1 The force

Biomutualism Bio-inspiration Material properties o Passive o Flexibility Robotic Fish o

Pol Olivella-Rosell CITCEA-UPC DAY 1: SMART GRIDS TABLE 2: REGULATORY CHALLENGES AND BUSINESS

A Flexible Software Framework for Testbeds In Real-World Experiments and Temperature-Controlled

Sweeny Fracs 2, 3 and 4 construction OLD OCEAN, TX Cautionary Statement This presentation

Federal Flexibility Request Update Background and Overview of CCA- Driven Flexibility Requests

Delek US Holdings, Inc. Second Quarter 2020 Earnings Call August 5, 2020 Disclaimers 2 Second

Promoting Flexible Idea Generation During Audit Planning Elizabeth (E.B.) Altiero University of