Ultra L Low Power er Infer eren ence a e at the v very e edge - PowerPoint PPT Presentation

Ultra L Low Power er Infer eren ence a e at the v very e edge o of the n netw twork Tiny ML Summit March 20-21 2019 Sunnyvale Eric Flamand, CTO & CoFounder of Greenwaves Technologies

Wh Who a o are w we? • French based startup created in 2015 • First product, GAP8, launched in Feb 2018 21/3/2019 Tiny ML Summit. March 2019 2

Ou Our M Market Vi Vision Market Demand Rich sensor data The IoT pipe NB-IoT, LTE-M, Sigfox, Linear PCM = Keyword Spotting 1.4 Mbit/s LoRa, etc. Beam forming Speech pre-processing 24-bit @ Vibration analysis 50kHz = 1.2 Fault detection Mbit/s 8-bit, 160x120 Face detection B/day to kB/day @ 10 fps = Presence detection 4.6 Mbit/s Battery operated Counting sensors Emotion detection 21/3/2019 Tiny ML Summit. March 2019 3

Ou Our M Market Vi Vision mW class radio with duty cycling capability Market Demand Rich sensor data The IoT pipe Market demand NB-IoT, LTE-M, Sigfox, Linear PCM = CNN + 1.4 Mbit/s LoRa, etc. SVM Low operation cost Bayesian Issue: way more MIPS than + Boosting an MCU can deliver but still Low deployment cost 24-bit @ Cepstral need to be 50kHz = 1.2 + analysis Mbit/s within an MCU power Low installation cost envelope ? = 8-bit, 160x120 Massive deployment of B/day to kB/day @ 10 fps = B/day to kB/day intelligent rich data sensors 4.6 Mbit/s Battery operated sensors mW class sensors available for sound, image, radar, … 21/3/2019 Tiny ML Summit. March 2019 4

GAP8 An I IOT A Applic licatio ion P Proc ocessor or Two independent clock and voltage domains, from 0-133MHz/1V up to 0-250MHz/1.2V MCU Function Cluster clock & voltage domain FC clock & voltage domain Extended RISC-V core LVDS Extensive I/O set Cluster Shared L1 Memory Serial I/Q DMA Micro DMA L2 Embedded DC/DC converters UART Memory Micro DMA Secured execution / e-fuses Logarithmic Interconnect HW SPI Sync Computation engine function I2C I $ 8 extended RISC-V cores I2S HWCE Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Fully programmable CPI Fabric Efficient parallelization Controller HyperBus Shared instruction cache Multi channel DMA L1 GPIO / PWM HW synchronization Shared Instruction Cache PMU RTC Debug ROM Debug HW convolution Engine An integrated, hierarchical architecture Inference TSMC 55LP Deep sleep Retentive Pre-analysis 1.0V to 1.2V few 10mWs 1uA 1µA+x*8µA 1mWs Max Freq: 133 MHz to 250 MHz 21/3/2019 Tiny ML Summit. March 2019 5 Up to 12.8 Gops-1

Gap8 T The o e open en s source h her eritage GreenWaves - Best in class Instruction Set - Open Source Computing Platform created by -Innovating on Risc-V and PULP Architecture (ISA) ETHZ and UniBo -Proprietary balanced system solution (SOC) - UC Berkeley originated - Permissive license (solderpad) based on PULP open source elements plus - Multiple tape outs GWT proprietary elements both on HW and - GWT Member of RiscV - GWT contributes to PULP SW/Tools side Foundation 21/3/2019 Tiny ML Summit. March 2019 6

How to w to o optimize e e ener ergy gy e efficien ency • Being strictly energy proportional to the demand • Light weight ISA specialization • Going parallel • Hardwired accelerator • Explicit memory management 21/3/2019 Tiny ML Summit. March 2019 7

Being p g proporti tional to to t the e dem emand MCU sleep mode 1 to 50 uW Low quiescent LDO  Duty Cycling Real Time Clock 32KHz only  L2 Memory partially retentive  MCU active mode 0.5 to 5 mW Embedded DC/DC, high current  Voltage can dynamically change Coarse Grain  One clock gen active, frequency can Classification  dynamically change Systematic clock gating  MCU + Parallel processor active mode 5 to 50 mW Embedded DC/DC, high current  Full Blown Voltage can dynamically change  Analysis Two clock gen active, frequencies can  dynamically change Systematic Clock Gating  Ultra fast switching time from one mode to another Highly optimized system Ultra fast voltage and frequency change time level power consumption 21/3/2019 Tiny ML Summit. March 2019 8

Light W Weig ight ISA s specializ ializatio ion • We started from RiscV ISA (IMC) and boosted core performances for • DSP kernels • Linear Algebra • SIMD type vectorization • Datapath gate count increased by approx. 30% 21/3/2019 Tiny ML Summit. March 2019 9

Light W Weig ight ISA s specializ ializatio ion Extended ISA - Cycle Count Speedup Extended ISA - Energy Improvement 8.0 8.0 7.1 6.8 6.8 7.0 7.0 6.1 6.0 6.0 5.0 5.0 3.9 3.9 3.9 3.9 3.6 3.8 3.8 3.8 3.8 [VALEUR] 4.0 4.0 3.4 3.4 3.2 3.1 2.8 2.8 2.8 2.8 2.6 2.6 2.6 2.6 2.6 3.0 3.0 2.2 2.1 2.2 2.3 2.0 2.0 2.0 1.8 1.8 1.7 1.7 1.6 1.6 2.0 2.0 1.3 1.4 1.3 1.3 1.0 1.0 0.0 0.0 Rv to Gap8 wo Vect Rv To Gap w Vect Rv to Gap8 wo Vect Rv To Gap w Vect 21/3/2019 Tiny ML Summit. March 2019 10

Going P g Parallel el • Goal: quasi perfect performance scaling as a function of the number of cores involved • Obvious gain: • Power = V 2 * Freq => Scale down voltage • Less Obvious: • Make sure synchronization is not visible since it is serial by nature (Ahmdal) • Maximize instruction cache reuse in a context where we perform lot of partial evaluation => Shared Instruction Cache with broadcast capability 21/3/2019 Tiny ML Summit. March 2019 11

Goin ing P Paralle llel l – Syn ynchron oniz ization ion • Master core wants to dispatch function Foo with arguments on a group of cores • All cores blocked on a synchronization barrier are instantly clock gated 21/3/2019 Tiny ML Summit. March 2019 12

Goin ing P Paralle llel l – Perfor orman ance S Scalin aling Quasi Perfect Scaling 21/3/2019 Tiny ML Summit. March 2019 13

Goin ing P Paralle llel l – Energy S y Scalin aling Convolution: 80% of CNN workload Average Extension’s Energy Gain: 3.4 Amplified by Parallelism: 7.4 21/3/2019 Tiny ML Summit. March 2019 14

Putting E g Ever eryth thing g Together v vs MCU Running CIFAR10, same network, same precision What Freq MHz Exec Time ms Cycles Power mW 40nm Dual Issue MCU 216 99.1 21 400 000 60 16 X GAP8 @1.0V 15.4 99.1 11 X 1 500 000 3.7 GAP8 @1.2V 17.5 8.7 1 500 000 70 GAP8 @1.0V w HWCE 4.7 99.1 460 000 0.8 21/3/2019 Tiny ML Summit. March 2019 15

Explicit M Memory M Managem emen ent • Gap8 is not equipped with data caches External L3 • Silicon area (Ram/Flash) • More important: energy efficiency mostly due to hit ratio • We can turn this weakness into an (energy) benefit if we can automate data transfers L2 uDMA • In practice a vast majority of traffic is predictable => We have a way to optimize memory allocation and bandwidth Automatic data tiling and pipelined memory Shared L1 DMA transfer interleaved with parallel call to compute kernel is solved by our “Autotiler” tool 1 8 Exec L2 to L1 L3 to L2 21/3/2019 Tiny ML Summit. March 2019 16

Explicit M Memory M Managem emen ent: t: AutoT oTile iler How to handle a parametric tile Basic Kernels • Vectorization + Parallelization • No assumption on where actual data are located Usually seen as libraries Passing actual data to basic kernels and having data circulating between them • A multi dimensional iteration space (2D; 3D; 4D; 5D. ..) and a traversal order • Each argument is a sub space of the iteration space and has actual dimensions, location (L2, external) and properties. Order may differ from User Kernels the one of the iteration space • Given a memory budget the auto tiler “tiles” each argument and generates Can be grouped and a fully pipelined implementation interleaving processing and data transfers organized as generators • Basic Kernels are inserted at defined locations in the iteration space (prologue, body, epilog, …) • Generated tiles are passed to Basic Kernels Connected User Kernels, constants, in and out features Graph • Optimal static memory allocation for all dynamic objects CNN + Pre/Post Processing 21/3/2019 Tiny ML Summit. March 2019 17

Explicit M Memory M Managem emen ent: t: AutoT oTile iler User Kernels Group of User Kernels Autotiler Library Generators BasicKernels Graph C Libraries (Constraints Solver, C Code Generator) C Programs, calls to Autotiler’s Model API #include "AutoTilerLib.h" #include "CNN_Generator.h" void Mnist() { Compile & Run on PC CNN_TiledConvNxNReLUPool2x2_SW_fp("Conv5x5RLMP_0", 5, 1, 32, 28, 28, 1); CNN_TiledConvNxNReLUPool2x2_SW_fp("Conv5x5RLMP_1", 5, 32, 64, 12, 12, 1); CNN_TiledLinearLayer ("LinearLayerRL_2", 64, 4, 4, 10, 1, 0, 0); } C code for the target handling data transfers and Basic Kernels dispatch on cluster’s cores. Working set is tiled in a way that maximize reuse at minimum distance from data path 21/3/2019 Tiny ML Summit. March 2019 18

Ultra L Low Power er Infer eren ence a e at the v very e edge - PowerPoint PPT Presentation

Ultra L Low Power er Infer eren ence a e at the v very e edge o of the n netw twork Tiny ML Summit March 20-21 2019 Sunnyvale Eric Flamand, CTO & CoFounder of Greenwaves Technologies Wh Who a o are w we? French based startup

Innovative Power Control for Ultra Low-Power and High- Ultra Low Power and High Performance

A/D Conversion and A/D Conversion Filtering for Ultra Low Filtering for Ultra Low A/D

Customer Presentation 16-bit Ultra Low Power Microcontroller The eCOG1, 16 Bit Ultra Low Power

ChemBioDraw Today & Tomorrow Mark L. Olson, PhD Vice-President, Software Development

Low Power Microprocessors Low Power Microprocessors Low Power Technology Gao Wei & Tian

Strategic Integration of Ultra Low Strategic Integration of Ultra Low Power Technologies g

Outline Introduction. Paper: System Design for Ultra-Low Power. Bernier, C. Hameau, F.,

EREN Workshop Agenda 10:00AM 11:00AM Introductory Presentations EREN Overview Laurie

Finding Inter-procedural Bugs at Scale with Infer Jules Villard <jul@fb.com> Facebook London

In Investo estor Con r Confer eren ence ce 20 2013 13 Agenda Introduction and Overview

Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra-Low Latency SSDs Jinkyu Jeong

34th thGlobal GS GS1 H Health thcare C Confer eren ence e 2018 BANGKOK UDI current

Importance of Tack Coat th Annual 39 39 th al Asphalt lt Paving ing Confer eren ence ce

Cona naway Confer eren ence 2 ce 2017 Brad Jones, P.E., Deputy Director Division of

Pareto O Oil & & Offshore Co e Confer eren ence September 2019 Leg egal D Disc

Metrop opolis C Con onfer eren ence 2019 e 2019 Introdu duction n to the he T Trans

Concepts of programming languages PureScript Christian Stuart, Douwe van Gijn, Martijn Fleuren

HiGHS High-performance open-source software for linear optimization Julian Hall School of

PYTHON FOR OPTIMIZATION BEN MORAN @BENM HTTP://BENMORAN.WORDPRESS.COM/ 2014-02-22

Ara Design and implementation of a 1GHz+ 64-bit RISC-V Vector Processor in 22 nm FD-SOI Matheus

IMPACT OF 45Q TAX CREDITS Shannon Angielski, Executive Director CCS 101 Briefing Series, Briefing

Future Directions An Ansible Developer and Ecosystem Story Timothy Appnel Senior Product

Our Value Creation Journey Corporate Presentation 2019 26 April 2019 The material presented is

The Roadmap for JuMP 1.0 Miles Lubin Google 1 JuMP-dev 2019 1 JuMP is not a Google product. Talk

Sambuz

Useful Links

Newsletter

Mail Us

Ultra L Low Power er Infer eren ence a e at the v very e edge - PowerPoint PPT Presentation

Ultra L Low Power er Infer eren ence a e at the v very e edge o of the n netw twork Tiny ML Summit March 20-21 2019 Sunnyvale Eric Flamand, CTO & CoFounder of Greenwaves Technologies Wh Who a o are w we? French based startup

Innovative Power Control for Ultra Low-Power and High- Ultra Low Power and High Performance

A/D Conversion and A/D Conversion Filtering for Ultra Low Filtering for Ultra Low A/D

Customer Presentation 16-bit Ultra Low Power Microcontroller The eCOG1, 16 Bit Ultra Low Power

ChemBioDraw Today &amp; Tomorrow Mark L. Olson, PhD Vice-President, Software Development

Low Power Microprocessors Low Power Microprocessors Low Power Technology Gao Wei &amp; Tian

Strategic Integration of Ultra Low Strategic Integration of Ultra Low Power Technologies g

Outline Introduction. Paper: System Design for Ultra-Low Power. Bernier, C. Hameau, F.,

EREN Workshop Agenda 10:00AM 11:00AM Introductory Presentations EREN Overview Laurie

Finding Inter-procedural Bugs at Scale with Infer Jules Villard &lt;jul@fb.com&gt; Facebook London

In Investo estor Con r Confer eren ence ce 20 2013 13 Agenda Introduction and Overview

Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra-Low Latency SSDs Jinkyu Jeong

34th thGlobal GS GS1 H Health thcare C Confer eren ence e 2018 BANGKOK UDI current

Importance of Tack Coat th Annual 39 39 th al Asphalt lt Paving ing Confer eren ence ce

Cona naway Confer eren ence 2 ce 2017 Brad Jones, P.E., Deputy Director Division of

Pareto O Oil &amp; &amp; Offshore Co e Confer eren ence September 2019 Leg egal D Disc

Metrop opolis C Con onfer eren ence 2019 e 2019 Introdu duction n to the he T Trans

Concepts of programming languages PureScript Christian Stuart, Douwe van Gijn, Martijn Fleuren

HiGHS High-performance open-source software for linear optimization Julian Hall School of

PYTHON FOR OPTIMIZATION BEN MORAN @BENM HTTP://BENMORAN.WORDPRESS.COM/ 2014-02-22

Ara Design and implementation of a 1GHz+ 64-bit RISC-V Vector Processor in 22 nm FD-SOI Matheus

IMPACT OF 45Q TAX CREDITS Shannon Angielski, Executive Director CCS 101 Briefing Series, Briefing

Future Directions An Ansible Developer and Ecosystem Story Timothy Appnel Senior Product

Our Value Creation Journey Corporate Presentation 2019 26 April 2019 The material presented is

The Roadmap for JuMP 1.0 Miles Lubin Google 1 JuMP-dev 2019 1 JuMP is not a Google product. Talk

Sambuz

Useful Links

Newsletter

Mail Us

ChemBioDraw Today & Tomorrow Mark L. Olson, PhD Vice-President, Software Development

Low Power Microprocessors Low Power Microprocessors Low Power Technology Gao Wei & Tian

Finding Inter-procedural Bugs at Scale with Infer Jules Villard <jul@fb.com> Facebook London

Pareto O Oil & & Offshore Co e Confer eren ence September 2019 Leg egal D Disc