Ultra L Low Power er Infer eren ence a e at the v very e edge o
- f the n
netw twork
Tiny ML Summit March 20-21 2019 Sunnyvale Eric Flamand, CTO & CoFounder of Greenwaves Technologies
Ultra L Low Power er Infer eren ence a e at the v very e edge - - PowerPoint PPT Presentation
Ultra L Low Power er Infer eren ence a e at the v very e edge o of the n netw twork Tiny ML Summit March 20-21 2019 Sunnyvale Eric Flamand, CTO & CoFounder of Greenwaves Technologies Wh Who a o are w we? French based startup
Tiny ML Summit March 20-21 2019 Sunnyvale Eric Flamand, CTO & CoFounder of Greenwaves Technologies
21/3/2019 Tiny ML Summit. March 2019 2
21/3/2019 Tiny ML Summit. March 2019 3
The IoT pipe NB-IoT, LTE-M, Sigfox, LoRa, etc.
B/day to kB/day Battery operated sensors
8-bit, 160x120 @ 10 fps = 4.6 Mbit/s 24-bit @ 50kHz = 1.2 Mbit/s Linear PCM = 1.4 Mbit/s
Market Demand Rich sensor data Keyword Spotting Beam forming Speech pre-processing Vibration analysis Fault detection Face detection Presence detection Counting Emotion detection
21/3/2019 Tiny ML Summit. March 2019 4
B/day to kB/day
B/day to kB/day Battery operated sensors
The IoT pipe NB-IoT, LTE-M, Sigfox, LoRa, etc.
8-bit, 160x120 @ 10 fps = 4.6 Mbit/s 24-bit @ 50kHz = 1.2 Mbit/s Linear PCM = 1.4 Mbit/s
Market Demand Rich sensor data CNN SVM Bayesian Boosting Cepstral analysis
Market demand + Low operation cost + Low deployment cost + Low installation cost = Massive deployment of intelligent rich data sensors
Issue: way more MIPS than an MCU can deliver but still need to be within an MCU power envelope ?
mW class sensors available for sound, image, radar, … mW class radio with duty cycling capability
21/3/2019 5
Memory L2
FC clock & voltage domain
PMU RTC Fabric Controller
L1 ROM I$
Debug LVDS Serial I/Q UART SPI I2C I2S CPI HyperBus GPIO / PWM
Micro DMA Logarithmic Interconnect Shared L1 Memory Shared Instruction Cache
Core 0 Debug Cluster DMA HW Sync Core 1 Core 7 Core 6 Core 5 Core 4 Core 3 Core 2 HWCE
Cluster clock & voltage domain
Two independent clock and voltage domains, from 0-133MHz/1V up to 0-250MHz/1.2V
MCU Function Extended RISC-V core Extensive I/O set Micro DMA Embedded DC/DC converters Secured execution / e-fuses Computation engine function 8 extended RISC-V cores Fully programmable Efficient parallelization Shared instruction cache Multi channel DMA HW synchronization HW convolution Engine
Retentive
1µA+x*8µA
Pre-analysis
1mWs
Inference
few 10mWs
An integrated, hierarchical architecture
Deep sleep
1uA
TSMC 55LP 1.0V to 1.2V Max Freq: 133 MHz to 250 MHz Up to 12.8 Gops-1
Tiny ML Summit. March 2019
21/3/2019 Tiny ML Summit. March 2019 6
Architecture (ISA)
Foundation
ETHZ and UniBo
based on PULP open source elements plus GWT proprietary elements both on HW and SW/Tools side
21/3/2019 Tiny ML Summit. March 2019 7
21/3/2019 Tiny ML Summit. March 2019 8
Ultra fast switching time from one mode to another Ultra fast voltage and frequency change time Highly optimized system level power consumption
Low quiescent LDO
Real Time Clock 32KHz only
L2 Memory partially retentive MCU sleep mode
1 to 50 uW
Duty Cycling
Embedded DC/DC, high current
Voltage can dynamically change
One clock gen active, frequency can dynamically change
Systematic clock gating MCU active mode 0.5 to 5 mW Coarse Grain Classification
Embedded DC/DC, high current
Voltage can dynamically change
Two clock gen active, frequencies can dynamically change
Systematic Clock Gating MCU + Parallel processor active mode 5 to 50 mW Full Blown Analysis
21/3/2019 Tiny ML Summit. March 2019 9
21/3/2019 Tiny ML Summit. March 2019 10
1.3 2.8 3.8 1.8 1.6 1.3 2.8 3.8 1.8 1.6
2.1
3.2 2.8 3.8 2.0 3.1 6.8 2.8 3.8 2.2 6.1
[VALEUR]
0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0
Extended ISA - Cycle Count Speedup
Rv to Gap8 wo Vect Rv To Gap w Vect 1.3 2.6 3.9 2.0 1.7 1.4 2.6 3.9 2.0 1.7
2.2
3.4 2.6 3.9 2.6 3.4 7.1 2.6 3.9 2.3 6.8
3.6
0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0
Extended ISA - Energy Improvement
Rv to Gap8 wo Vect Rv To Gap w Vect
evaluation => Shared Instruction Cache with broadcast capability
21/3/2019 Tiny ML Summit. March 2019 11
to dispatch function Foo with arguments
a synchronization barrier are instantly clock gated
21/3/2019 Tiny ML Summit. March 2019 12
21/3/2019 Tiny ML Summit. March 2019 13
Quasi Perfect Scaling
21/3/2019 Tiny ML Summit. March 2019 14
Average Extension’s Energy Gain: 3.4 Amplified by Parallelism: 7.4
Convolution: 80% of CNN workload
21/3/2019 Tiny ML Summit. March 2019 15
What Freq MHz Exec Time ms Cycles Power mW 40nm Dual Issue MCU 216 99.1 21 400 000 60 GAP8 @1.0V 15.4 99.1 1 500 000 3.7 GAP8 @1.2V 17.5 8.7 1 500 000 70 GAP8 @1.0V w HWCE 4.7 99.1 460 000 0.8
16 X 11 X
Running CIFAR10, same network, same precision
21/3/2019 Tiny ML Summit. March 2019 16
Shared L1 L2 1 8 External L3 (Ram/Flash) DMA uDMA
mostly due to hit ratio
benefit if we can automate data transfers
predictable => We have a way to optimize memory allocation and bandwidth Exec L2 to L1 L3 to L2 Automatic data tiling and pipelined memory transfer interleaved with parallel call to compute kernel is solved by our “Autotiler” tool
21/3/2019 Tiny ML Summit. March 2019 17
Basic Kernels
How to handle a parametric tile
User Kernels
Passing actual data to basic kernels and having data circulating between them
dimensions, location (L2, external) and properties. Order may differ from the one of the iteration space
a fully pipelined implementation interleaving processing and data transfers
(prologue, body, epilog, …)
Usually seen as libraries Can be grouped and
Graph
Connected User Kernels, constants, in and out features
CNN + Pre/Post Processing
21/3/2019 Tiny ML Summit. March 2019 18
BasicKernels User Kernels Group of User Kernels Generators Graph C Programs, calls to Autotiler’s Model API C Libraries Autotiler Library (Constraints Solver, C Code Generator)
Compile & Run on PC C code for the target handling data transfers and Basic Kernels dispatch on cluster’s cores. Working set is tiled in a way that maximize reuse at minimum distance from data path
#include "AutoTilerLib.h" #include "CNN_Generator.h" void Mnist() { CNN_TiledConvNxNReLUPool2x2_SW_fp("Conv5x5RLMP_0", 5, 1, 32, 28, 28, 1); CNN_TiledConvNxNReLUPool2x2_SW_fp("Conv5x5RLMP_1", 5, 32, 64, 12, 12, 1); CNN_TiledLinearLayer ("LinearLayerRL_2", 64, 4, 4, 10, 1, 0, 0); }
21/3/2019 Tiny ML Summit. March 2019 19
several reduction stages enable, in one cycle :
and 16b pixels
pixels
pixels
pixels to 8b in order to reduce bandwidth and power 3 X performance speedup 4 X energy gain versus pure SW 8 cores implementation
Energy gain decreases for non unit stride or dilation
21/3/2019 Tiny ML Summit. March 2019 20
21/3/2019 Tiny ML Summit. March 2019 21
CNN on HWCE: Avg power: 8.79mW Duration: 58ms MFCC on FC: Avg power: 3.3 mW Duration 170ms Processing of 1 second of voice data at 1.0V:
CNN (cluster) SW version 155ms 11,8mW : 1,8 mW average HWCE version 58ms 8.8mW : 509uW average MFCC (FC) 170ms 3,3mW : 560uW average Total 1,07mW with HWCE 2,36mW in SW
Google CNN: Conv 8x20, MaxPool 2x2/2, 1 InFeat, 32 OutFeat, W:95, H:40 Conv 4x10, ReLU, InFeat 32, OutFeat 32 Linear: 10 Outs
21/3/2019 Tiny ML Summit. March 2019 22
Trainable Par: 421 263
33ms per image
21/3/2019 Tiny ML Summit. March 2019 23
low power MCU we can boost performances by more than an order
power budget for inference at the very edge, on battery, for years
21/3/2019 Tiny ML Summit. March 2019 24
21/3/2019 Tiny ML Summit. March 2019 25