Ultra L Low Power er Infer eren ence a e at the v very e edge - - PowerPoint PPT Presentation

ultra l low power er infer eren ence a e at the v very e
SMART_READER_LITE
LIVE PREVIEW

Ultra L Low Power er Infer eren ence a e at the v very e edge - - PowerPoint PPT Presentation

Ultra L Low Power er Infer eren ence a e at the v very e edge o of the n netw twork Tiny ML Summit March 20-21 2019 Sunnyvale Eric Flamand, CTO & CoFounder of Greenwaves Technologies Wh Who a o are w we? French based startup


slide-1
SLIDE 1

Ultra L Low Power er Infer eren ence a e at the v very e edge o

  • f the n

netw twork

Tiny ML Summit March 20-21 2019 Sunnyvale Eric Flamand, CTO & CoFounder of Greenwaves Technologies

slide-2
SLIDE 2

Wh Who a

  • are w

we?

  • French based startup created in 2015
  • First product, GAP8, launched in Feb 2018

21/3/2019 Tiny ML Summit. March 2019 2

slide-3
SLIDE 3

Ou Our M Market Vi Vision

21/3/2019 Tiny ML Summit. March 2019 3

The IoT pipe NB-IoT, LTE-M, Sigfox, LoRa, etc.

B/day to kB/day Battery operated sensors

8-bit, 160x120 @ 10 fps = 4.6 Mbit/s 24-bit @ 50kHz = 1.2 Mbit/s Linear PCM = 1.4 Mbit/s

Market Demand Rich sensor data Keyword Spotting Beam forming Speech pre-processing Vibration analysis Fault detection Face detection Presence detection Counting Emotion detection

slide-4
SLIDE 4

Ou Our M Market Vi Vision

21/3/2019 Tiny ML Summit. March 2019 4

B/day to kB/day

B/day to kB/day Battery operated sensors

The IoT pipe NB-IoT, LTE-M, Sigfox, LoRa, etc.

8-bit, 160x120 @ 10 fps = 4.6 Mbit/s 24-bit @ 50kHz = 1.2 Mbit/s Linear PCM = 1.4 Mbit/s

Market Demand Rich sensor data CNN SVM Bayesian Boosting Cepstral analysis

Market demand + Low operation cost + Low deployment cost + Low installation cost = Massive deployment of intelligent rich data sensors

Issue: way more MIPS than an MCU can deliver but still need to be within an MCU power envelope ?

mW class sensors available for sound, image, radar, … mW class radio with duty cycling capability

slide-5
SLIDE 5

GAP8 An I IOT A Applic licatio ion P Proc

  • cessor
  • r

21/3/2019 5

Memory L2

FC clock & voltage domain

PMU RTC Fabric Controller

L1 ROM I$

Debug LVDS Serial I/Q UART SPI I2C I2S CPI HyperBus GPIO / PWM

Micro DMA Logarithmic Interconnect Shared L1 Memory Shared Instruction Cache

Core 0 Debug Cluster DMA HW Sync Core 1 Core 7 Core 6 Core 5 Core 4 Core 3 Core 2 HWCE

Cluster clock & voltage domain

Two independent clock and voltage domains, from 0-133MHz/1V up to 0-250MHz/1.2V

MCU Function Extended RISC-V core Extensive I/O set Micro DMA Embedded DC/DC converters Secured execution / e-fuses Computation engine function 8 extended RISC-V cores Fully programmable Efficient parallelization Shared instruction cache Multi channel DMA HW synchronization HW convolution Engine

Retentive

1µA+x*8µA

Pre-analysis

1mWs

Inference

few 10mWs

An integrated, hierarchical architecture

Deep sleep

1uA

TSMC 55LP 1.0V to 1.2V Max Freq: 133 MHz to 250 MHz Up to 12.8 Gops-1

Tiny ML Summit. March 2019

slide-6
SLIDE 6

Gap8 T The o e open en s source h her eritage

21/3/2019 Tiny ML Summit. March 2019 6

GreenWaves

  • Best in class Instruction Set

Architecture (ISA)

  • UC Berkeley originated
  • GWT Member of RiscV

Foundation

  • Open Source Computing Platform created by

ETHZ and UniBo

  • Permissive license (solderpad)
  • Multiple tape outs
  • GWT contributes to PULP
  • Innovating on Risc-V and PULP
  • Proprietary balanced system solution (SOC)

based on PULP open source elements plus GWT proprietary elements both on HW and SW/Tools side

slide-7
SLIDE 7

How to w to o

  • ptimize e

e ener ergy gy e efficien ency

  • Being strictly energy proportional to the demand
  • Light weight ISA specialization
  • Going parallel
  • Hardwired accelerator
  • Explicit memory management

21/3/2019 Tiny ML Summit. March 2019 7

slide-8
SLIDE 8

Being p g proporti tional to to t the e dem emand

21/3/2019 Tiny ML Summit. March 2019 8

Ultra fast switching time from one mode to another Ultra fast voltage and frequency change time Highly optimized system level power consumption

Low quiescent LDO

Real Time Clock 32KHz only

L2 Memory partially retentive MCU sleep mode

1 to 50 uW

Duty Cycling

Embedded DC/DC, high current

Voltage can dynamically change

One clock gen active, frequency can dynamically change

Systematic clock gating MCU active mode 0.5 to 5 mW Coarse Grain Classification

Embedded DC/DC, high current

Voltage can dynamically change

Two clock gen active, frequencies can dynamically change

Systematic Clock Gating MCU + Parallel processor active mode 5 to 50 mW Full Blown Analysis

slide-9
SLIDE 9

Light W Weig ight ISA s specializ ializatio ion

  • We started from RiscV ISA (IMC) and boosted core performances for
  • DSP kernels
  • Linear Algebra
  • SIMD type vectorization
  • Datapath gate count increased by approx. 30%

21/3/2019 Tiny ML Summit. March 2019 9

slide-10
SLIDE 10

Light W Weig ight ISA s specializ ializatio ion

21/3/2019 Tiny ML Summit. March 2019 10

1.3 2.8 3.8 1.8 1.6 1.3 2.8 3.8 1.8 1.6

2.1

3.2 2.8 3.8 2.0 3.1 6.8 2.8 3.8 2.2 6.1

[VALEUR]

0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0

Extended ISA - Cycle Count Speedup

Rv to Gap8 wo Vect Rv To Gap w Vect 1.3 2.6 3.9 2.0 1.7 1.4 2.6 3.9 2.0 1.7

2.2

3.4 2.6 3.9 2.6 3.4 7.1 2.6 3.9 2.3 6.8

3.6

0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0

Extended ISA - Energy Improvement

Rv to Gap8 wo Vect Rv To Gap w Vect

slide-11
SLIDE 11

Going P g Parallel el

  • Goal: quasi perfect performance scaling as a function of the number
  • f cores involved
  • Obvious gain:
  • Power = V2 * Freq => Scale down voltage
  • Less Obvious:
  • Make sure synchronization is not visible since it is serial by nature (Ahmdal)
  • Maximize instruction cache reuse in a context where we perform lot of partial

evaluation => Shared Instruction Cache with broadcast capability

21/3/2019 Tiny ML Summit. March 2019 11

slide-12
SLIDE 12

Goin ing P Paralle llel l – Syn ynchron

  • niz

ization ion

  • Master core wants

to dispatch function Foo with arguments

  • n a group of cores
  • All cores blocked on

a synchronization barrier are instantly clock gated

21/3/2019 Tiny ML Summit. March 2019 12

slide-13
SLIDE 13

Goin ing P Paralle llel l – Perfor

  • rman

ance S Scalin aling

21/3/2019 Tiny ML Summit. March 2019 13

Quasi Perfect Scaling

slide-14
SLIDE 14

Goin ing P Paralle llel l – Energy S y Scalin aling

21/3/2019 Tiny ML Summit. March 2019 14

Average Extension’s Energy Gain: 3.4 Amplified by Parallelism: 7.4

Convolution: 80% of CNN workload

slide-15
SLIDE 15

Putting E g Ever eryth thing g Together v vs MCU

21/3/2019 Tiny ML Summit. March 2019 15

What Freq MHz Exec Time ms Cycles Power mW 40nm Dual Issue MCU 216 99.1 21 400 000 60 GAP8 @1.0V 15.4 99.1 1 500 000 3.7 GAP8 @1.2V 17.5 8.7 1 500 000 70 GAP8 @1.0V w HWCE 4.7 99.1 460 000 0.8

16 X 11 X

Running CIFAR10, same network, same precision

slide-16
SLIDE 16

Explicit M Memory M Managem emen ent

21/3/2019 Tiny ML Summit. March 2019 16

Shared L1 L2 1 8 External L3 (Ram/Flash) DMA uDMA

  • Gap8 is not equipped with data caches
  • Silicon area
  • More important: energy efficiency

mostly due to hit ratio

  • We can turn this weakness into an (energy)

benefit if we can automate data transfers

  • In practice a vast majority of traffic is

predictable => We have a way to optimize memory allocation and bandwidth Exec L2 to L1 L3 to L2 Automatic data tiling and pipelined memory transfer interleaved with parallel call to compute kernel is solved by our “Autotiler” tool

slide-17
SLIDE 17

Explicit M Memory M Managem emen ent: t: AutoT

  • Tile

iler

21/3/2019 Tiny ML Summit. March 2019 17

Basic Kernels

How to handle a parametric tile

  • Vectorization + Parallelization
  • No assumption on where actual data are located

User Kernels

Passing actual data to basic kernels and having data circulating between them

  • A multi dimensional iteration space (2D; 3D; 4D; 5D. ..) and a traversal
  • rder
  • Each argument is a sub space of the iteration space and has actual

dimensions, location (L2, external) and properties. Order may differ from the one of the iteration space

  • Given a memory budget the auto tiler “tiles” each argument and generates

a fully pipelined implementation interleaving processing and data transfers

  • Basic Kernels are inserted at defined locations in the iteration space

(prologue, body, epilog, …)

  • Generated tiles are passed to Basic Kernels

Usually seen as libraries Can be grouped and

  • rganized as generators

Graph

Connected User Kernels, constants, in and out features

  • Optimal static memory allocation for all dynamic objects

CNN + Pre/Post Processing

slide-18
SLIDE 18

Explicit M Memory M Managem emen ent: t: AutoT

  • Tile

iler

21/3/2019 Tiny ML Summit. March 2019 18

BasicKernels User Kernels Group of User Kernels Generators Graph C Programs, calls to Autotiler’s Model API C Libraries Autotiler Library (Constraints Solver, C Code Generator)

Compile & Run on PC C code for the target handling data transfers and Basic Kernels dispatch on cluster’s cores. Working set is tiled in a way that maximize reuse at minimum distance from data path

#include "AutoTilerLib.h" #include "CNN_Generator.h" void Mnist() { CNN_TiledConvNxNReLUPool2x2_SW_fp("Conv5x5RLMP_0", 5, 1, 32, 28, 28, 1); CNN_TiledConvNxNReLUPool2x2_SW_fp("Conv5x5RLMP_1", 5, 32, 64, 12, 12, 1); CNN_TiledLinearLayer ("LinearLayerRL_2", 64, 4, 4, 10, 1, 0, 0); }

slide-19
SLIDE 19

CNN H HW A Accel eler erator

21/3/2019 Tiny ML Summit. March 2019 19

  • A set of 112 multipliers performing 4x16 products and

several reduction stages enable, in one cycle :

  • A single 5x5 or 4x7 convolution with 16b weights

and 16b pixels

  • 3 3x3 convolutions with 16b weights and 16b pixels
  • 2 5x5 or 4x7 convolutions with 8b weights and 16b

pixels

  • 4 5x5 or 4x7 convolutions with 4b weights and 16b

pixels

  • In all cases weights can be reduced to 8b or 4b and

pixels to 8b in order to reduce bandwidth and power 3 X performance speedup 4 X energy gain versus pure SW 8 cores implementation

Energy gain decreases for non unit stride or dilation

slide-20
SLIDE 20

21/3/2019 Tiny ML Summit. March 2019 20

On real life networks

slide-21
SLIDE 21

Key ey W Word S Spotti ting

21/3/2019 Tiny ML Summit. March 2019 21

CNN on HWCE: Avg power: 8.79mW Duration: 58ms MFCC on FC: Avg power: 3.3 mW Duration 170ms Processing of 1 second of voice data at 1.0V:

CNN (cluster) SW version 155ms 11,8mW : 1,8 mW average HWCE version 58ms 8.8mW : 509uW average MFCC (FC) 170ms 3,3mW : 560uW average Total 1,07mW with HWCE 2,36mW in SW

Google CNN: Conv 8x20, MaxPool 2x2/2, 1 InFeat, 32 OutFeat, W:95, H:40 Conv 4x10, ReLU, InFeat 32, OutFeat 32 Linear: 10 Outs

slide-22
SLIDE 22

CNN B Based ed T Text Recogn gnition

21/3/2019 Tiny ML Summit. March 2019 22

Trainable Par: 421 263

33ms per image

slide-23
SLIDE 23

DRON ONET ET: RESNET ET b based ed Autonous Dro rone

21/3/2019 Tiny ML Summit. March 2019 23

  • Developed by UZH and ETH-Z
  • Autonomously follow a road/corridor and avoid collision
  • Up to 18 Frames Per Second at maximum frequency
  • @1.0V, FC: 50MHz, Cluster: 100MHz  6.5fps 40mW
slide-24
SLIDE 24

Conclu lusion ion

  • Properly using the following levers:
  • Agile power management
  • Core ISA extensions
  • Efficient support for Parallelization
  • Tool managed explicit optimal memory management
  • We have shown that remaining within the power envelope of an ultra

low power MCU we can boost performances by more than an order

  • f magnitude while remaining SW fully programable
  • This is enabling the support of mid complexity CNN with MCU class

power budget for inference at the very edge, on battery, for years

21/3/2019 Tiny ML Summit. March 2019 24

slide-25
SLIDE 25

21/3/2019 Tiny ML Summit. March 2019 25

Thank You!