Deep-Learning Oriented Smart Sensing for the Next Generation of - - PowerPoint PPT Presentation

deep learning oriented smart sensing for the next
SMART_READER_LITE
LIVE PREVIEW

Deep-Learning Oriented Smart Sensing for the Next Generation of - - PowerPoint PPT Presentation

Deep-Learning Oriented Smart Sensing for the Next Generation of Embedded Applications Manuele Rusci, Francesco Conti , Alessandro Capotondi, Luca Benini Energy-Efficient Embedded Systems Laboratory Dipartimento di Ingegneria dellEnergia


slide-1
SLIDE 1

Deep-Learning Oriented Smart Sensing for the Next Generation of Embedded Applications

Manuele Rusci, Francesco Conti, Alessandro Capotondi, Luca Benini

Energy-Efficient Embedded Systems Laboratory

Dipartimento di Ingegneria dell’Energia Elettrica e dell’Informazione “Guglielmo Marconi”

IWES18 – Siena, 14 Settembre 2018

slide-2
SLIDE 2

From data collectors…

  • M. Rusci, F. Conti, A. Capotondi, L. Benini

2

Node average power budget

TX/RX Unit Power

External Memory

Sensing Element Analog Chain A/D Conv

Sensing Unit

Wireless Sensor

MCU

Wireless Sensing

[Alioto, Massimo. "IoT: Bird’s Eye View, Megatrends and Perspectives." Enabling the Internet of Things. Springer International Publishing, 2017. 1-45.]

slide-3
SLIDE 3

...to always-ON smart sensors

  • M. Rusci, F. Conti, A. Capotondi, L. Benini

3

Challenge: bringing intelligence in-the-node at mW cost

TX/RX Unit Power

Peripheral Subsystem

Core Region Memory Subsystem External Memory Sensing Element Analog Chain A/D Conv

Sensing Unit Processing Unit

Smart Sensing System

slide-4
SLIDE 4

...to always-ON smart sensors

  • M. Rusci, F. Conti, A. Capotondi, L. Benini

4

Challenge: bringing intelligence in-the-node at mW cost

TX/RX Unit Power

Peripheral Subsystem

Core Region Memory Subsystem External Memory Sensing Element Analog Chain A/D Conv

Sensing Unit Processing Unit

Smart Sensing System

  • 1. low-power “feature” / event extraction on sensor
slide-5
SLIDE 5

...to always-ON smart sensors

  • M. Rusci, F. Conti, A. Capotondi, L. Benini

5

Challenge: bringing intelligence in-the-node at mW cost

TX/RX Unit Power

Peripheral Subsystem

Core Region Memory Subsystem External Memory Sensing Element Analog Chain A/D Conv

Sensing Unit Processing Unit

Smart Sensing System

  • 1. low-power “feature” / event extraction on sensor
  • 2. event-based near-sensor processing
slide-6
SLIDE 6

...to always-ON smart sensors

  • M. Rusci, F. Conti, A. Capotondi, L. Benini

6

Challenge: bringing intelligence in-the-node at mW cost

TX/RX Unit Power

Peripheral Subsystem

Core Region Memory Subsystem External Memory Sensing Element Analog Chain A/D Conv

Sensing Unit Processing Unit

Smart Sensing System

  • 1. low-power “feature” / event extraction on sensor
  • 2. event-based near-sensor processing
  • 3. “slim” and uncommon transmission of high-level features
slide-7
SLIDE 7

Ultra-Low Power Imaging (GrainCam)

Focal Plane Processing. Moving an early computation stage into the sensor die to reduce the power costs of the imaging task.

  • M. Rusci, F. Conti, A. Capotondi, L. Benini

PO PN PE

Imager performing spatial filtering and binarization

  • n the sensor die through

mixed-signal sensing!

Vres VQ

Adpating exposure

PO PE PN

Spatial- contrast

Contrast Block PN PE PO QO QN QE

Vth PO Vres

to pixel PE to pixel PN

VEDGE

VQ

comp1 comp2

‘Moving’ pixel window Gradient extraction Per-pixel circitut for filtering and binarization

Traditional Camera

Graincam w/ motion detection

7

slide-8
SLIDE 8

Event-Based Paradigm

Event-based sensing: output frame

data bandwidth depends on the external context- activity

  • M. Rusci, F. Conti, A. Capotondi, L. Benini

Frame- based Event- based

Ultra-Low Power Consumption <100uW

{x0,y0} {x1,y1} {x2,y2} {x3,y3} {xn-1,yn-1}

Readout modes:

IDLE: readout the counter of asserted pixels

ACTIVE: sending the addresses of asserted pixels (Address-Coded Representation, AER)

Event-Based Data Processing

power Detection of relevant information by the sensor

Absence of significant information from the sensor

idle data transfer data processing

<10x wrt SoA imagers

  • M. Rusci et al. "A sub-mW IoT-endnode for always-on visual monitoring and smart triggering," in IEEE Internet of Things Journal, 2017

8

~100uW ~10mW

slide-9
SLIDE 9

Deep Learning at the Edge

Convolutional Neural Networks are state-of-the art for visual recognition, detection and classification tasks

  • M. Rusci, F. Conti, A. Capotondi, L. Benini

Multi-Dimensional Imager Data

bike

Inference Engine

Output Class Label

Issues:  Large memory footprint to store weights (the ‘program’) and intermediate results (up to hundreds

  • f MBs), greater than memory footprint available on ultra-low power engines (100’s kBs)

 High-complexity CNN implementation, demanding floating-point precision  Imager Power costs of tens to hundreds of mWs

How to exploit CNNs on always-on devices with a power envelope of few mWs or sub-mW ?

9

slide-10
SLIDE 10

Deep Learning at the Edge

  • M. Rusci, F. Conti, A. Capotondi, L. Benini

10

Performance for 1 fps: ~3.6 GMAC/s Energy efficiency for 1 fps @ 20 mW: ~180 GMAC/s/W = ~5pJ/MAC “Extreme” example: ResNet-34

  • classifies 224x224 images into 1000 classes
  • ~ trained human-level performance
  • ~ 21M parameters
  • ~ 3.6G MAC

Precision Accuracy loss full precision / 8bit 6bit

  • 1.3%

4bit

  • 3.3%

VGG-16 @ CIFAR-10

Quantization Specialized HW

parallelism and HW acceleration are key paradigms to achieve low energy

slide-11
SLIDE 11

Quantization: no free lunch

Running INT-Q convolution on a ARM Cortex-M7 core

  • > huge opportunity for HW/SW codesign
  • M. Rusci, F. Conti, A. Capotondi, L. Benini

lower bandwidth from L2-SRAM impacts on low-bitwidth precision

  • verhead for casting

INT-4/2 to INT-16 for 2x16bit vectorized MAC instructions Lower power consumption when fitting into L1 thanks to compression INT-1 kernel exploits bitwise

  • perations and does not pay

casting overhead because XNOR convolutions are supported by the ISA

Open Source: https://github.com/EEESlab/CMSIS_NN-INTQ

11

slide-12
SLIDE 12

Quantization + Acceleration = ❤

  • M. Rusci, F. Conti, A. Capotondi, L. Benini

12

More efficient than any ULP MCU… Bubble size = pJ/op (smaller is better)

  • F. Conti et al., https://arxiv.org/abs/1612.05974
slide-13
SLIDE 13

Quantization + Acceleration = ❤

  • M. Rusci, F. Conti, A. Capotondi, L. Benini

13

Bubble size = pJ/op (smaller is better)

1000 0.001

865 pJ/op

143 pJ/op 23 pJ/op 6 pJ/op 50 pJ/op 11 pJ/op

… and even more if compared to a commercial high-perf MCU

  • F. Conti et al., https://arxiv.org/abs/1612.05974
slide-14
SLIDE 14

Flying a Drone with DL (in <10mW)

DroNet: a ResNet-based CNN to drive a drone in the environment

  • riginal implementation: 20fps on external CPU, requires a big drone (e.g. DJI, Parrot)

DroNet on GAP8/PULP:

  • Fixed-Point 16bit (Q3.13)
  • Removed Batch Normalization
  • Max Pooling layer 2x2
  • Striding support in HW
  • Support for HWCE
  • Comparable accuracy w.r.t. baseline

14

  • F. Conti, M. Rusci, A. Capotondi, D. Rossi, L. Benini

Example nano-drone from D. Palossi et al., https://arxiv.org/abs/1805.01831

GAP8 – 8 Cores

(200MHz)

GAP8 – HWCE

(200MHz)

FPS

32 fps 51 fps

slide-15
SLIDE 15

Thanks for your attention.

Questions?

  • M. Rusci, F. Conti, A. Capotondi, L. Benini

15

Special acks to: Davide Rossi (UNIBO), Daniele Palossi (ETHZ), Eric Flamand (GreenWaves Technologies), all the PULP team https://github.com/pulp-platform Twitter @pulp_platform