deep learning oriented smart sensing for the next
play

Deep-Learning Oriented Smart Sensing for the Next Generation of - PowerPoint PPT Presentation

Deep-Learning Oriented Smart Sensing for the Next Generation of Embedded Applications Manuele Rusci, Francesco Conti , Alessandro Capotondi, Luca Benini Energy-Efficient Embedded Systems Laboratory Dipartimento di Ingegneria dellEnergia


  1. Deep-Learning Oriented Smart Sensing for the Next Generation of Embedded Applications Manuele Rusci, Francesco Conti , Alessandro Capotondi, Luca Benini Energy-Efficient Embedded Systems Laboratory Dipartimento di Ingegneria dell’Energia Elettrica e dell’Informazione “Guglielmo Marconi” IWES18 – Siena, 14 Settembre 2018

  2. From data collectors… Node average power budget Wireless Sensing Wireless Power Sensor MCU Sensing Unit TX/RX Unit Sensing Analog A/D External Element Chain Conv Memory [Alioto, Massimo. "IoT: Bird’s Eye View, Megatrends and Perspectives." Enabling the Internet of Things . Springer International Publishing, 2017. 1-45.] 2 M. Rusci, F. Conti, A. Capotondi, L. Benini

  3. ...to always-ON smart sensors Challenge: bringing intelligence in-the-node at mW cost Smart Sensing Power System Processing Unit TX/RX Sensing Unit Unit Core Peripheral Subsystem Region Sensing Analog A/D Element Chain Conv External Memory Subsystem Memory 3 M. Rusci, F. Conti, A. Capotondi, L. Benini

  4. ...to always-ON smart sensors Challenge: bringing intelligence in-the-node at mW cost Smart Sensing Power System Processing Unit TX/RX Sensing Unit Unit Core Peripheral Subsystem Region Sensing Analog A/D Element Chain Conv External Memory Subsystem Memory 1. low-power “feature” / event extraction on sensor 4 M. Rusci, F. Conti, A. Capotondi, L. Benini

  5. ...to always-ON smart sensors Challenge: bringing intelligence in-the-node at mW cost Smart Sensing Power System Processing Unit TX/RX Sensing Unit Unit Core Peripheral Subsystem Region Sensing Analog A/D Element Chain Conv External Memory Subsystem Memory 1. low-power “feature” / event extraction on sensor 2. event-based near-sensor processing 5 M. Rusci, F. Conti, A. Capotondi, L. Benini

  6. ...to always-ON smart sensors Challenge: bringing intelligence in-the-node at mW cost Smart Sensing Power System Processing Unit TX/RX Sensing Unit Unit Core Peripheral Subsystem Region Sensing Analog A/D Element Chain Conv External Memory Subsystem Memory 1. low-power “feature” / event extraction on sensor 2. event-based near-sensor processing 3. “slim” and uncommon transmission of high-level features 6 M. Rusci, F. Conti, A. Capotondi, L. Benini

  7. Ultra-Low Power Imaging (GrainCam) Focal Plane Processing . Moving an early computation stage into the sensor die to reduce the power costs of the imaging task. Per-pixel circitut for filtering and Gradient binarization extraction V res V res to pixel PN PN Imager performing spatial V EDGE to pixel PE PN PE V Q PO filtering and binarization PO comp2 PE V th Contrast Spatial- on the sensor die through Block PO contrast QO comp1 mixed-signal sensing ! V Q QN QE Adpating exposure ‘Moving’ pixel window PN PO PE Traditional Camera Graincam w/ motion detection 7 M. Rusci, F. Conti, A. Capotondi, L. Benini

  8. Event-Based Paradigm Ultra-Low Power Consumption <100uW Event-based sensing : output frame data bandwidth depends on the external context- activity <10x wrt SoA imagers {x 0 ,y 0 } {x 1 ,y 1 } Frame- Event- {x 2 ,y 2 } based based {x 3 ,y 3 } Event-Based Data Processing {x n-1 ,y n-1 } idle Readout modes : Detection of relevant data transfer data processing information by the sensor IDLE : readout the counter of asserted pixels  ~10mW power ACTIVE : sending the addresses of asserted Absence of significant  ~100uW information from the sensor pixels (Address-Coded Representation, AER) M. Rusci et al. "A sub-mW IoT-endnode for always-on visual monitoring and smart triggering," in IEEE Internet of Things Journal, 2017 8 M. Rusci, F. Conti, A. Capotondi, L. Benini

  9. Deep Learning at the Edge Convolutional Neural Networks are state-of-the art for visual recognition, detection and classification tasks Inference Engine Multi-Dimensional Imager Data Output Class Label bike How to exploit CNNs on always-on devices with a power envelope of few mWs or sub-mW ? Issues:  Large memory footprint to store weights (the ‘program’) and intermediate results (up to hundreds of MBs), greater than memory footprint available on ultra-low power engines (100’s kBs)  High-complexity CNN implementation, demanding floating-point precision  Imager Power costs of tens to hundreds of mWs 9 M. Rusci, F. Conti, A. Capotondi, L. Benini

  10. Deep Learning at the Edge “Extreme” example: ResNet-34  classifies 224x224 images into 1000 classes  ~ trained human-level performance  ~ 21M parameters  ~ 3.6G MAC Performance for 1 fps: ~3.6 GMAC/s Energy efficiency for 1 fps @ 20 mW: ~180 GMAC/s/W = ~5pJ/MAC Quantization Specialized HW Precision Accuracy loss parallelism and HW acceleration full precision / 8bit 0 are key paradigms to achieve 6bit -1.3% low energy 4bit -3.3% VGG-16 @ CIFAR-10 10 M. Rusci, F. Conti, A. Capotondi, L. Benini

  11. Quantization: no free lunch Running INT-Q convolution on a ARM Cortex-M7 core -> huge opportunity for HW/SW codesign Lower power consumption when fitting into L1 thanks lower bandwidth from L2-SRAM to compression impacts on low-bitwidth precision overhead for casting INT-4/2 to INT-16 for 2x16bit vectorized MAC instructions INT-1 kernel exploits bitwise operations and does not pay casting overhead because Open Source: XNOR convolutions are https://github.com/EEESlab/CMSIS_NN-INTQ supported by the ISA 11 M. Rusci, F. Conti, A. Capotondi, L. Benini

  12. Quantization + Acceleration = ❤ More efficient than any ULP MCU… Bubble size = pJ/op (smaller is better) F. Conti et al., https://arxiv.org/abs/1612.05974 12 M. Rusci, F. Conti, A. Capotondi, L. Benini

  13. Quantization + Acceleration = ❤ … and even more 865 6 pJ/op if compared to a pJ/op commercial high-perf MCU 23 pJ/op 143 pJ/op Bubble size = pJ/op 50 pJ/op (smaller is better) 11 pJ/op 1000 0.001 F. Conti et al., https://arxiv.org/abs/1612.05974 13 M. Rusci, F. Conti, A. Capotondi, L. Benini

  14. Flying a Drone with DL ( in <10mW ) DroNet : a ResNet-based CNN to drive a drone in the environment • original implementation: 20fps on external CPU, requires a big drone (e.g. DJI, Parrot) GAP8 – GAP8 – 8 Cores HWCE (200MHz) (200MHz) FPS 32 fps 51 fps DroNet on GAP8/PULP: - Fixed-Point 16bit (Q3.13) - Removed Batch Normalization - Max Pooling layer 2x2 - Striding support in HW - Support for HWCE - Comparable accuracy w.r.t. baseline Example nano-drone from D. Palossi et al., https://arxiv.org/abs/1805.01831 14 F. Conti, M. Rusci, A. Capotondi, D. Rossi, L. Benini

  15. Thanks for your attention. Questions? Special acks to: Davide Rossi (UNIBO), Daniele Palossi (ETHZ), Eric Flamand (GreenWaves Technologies), all the PULP team https:// github.com/pulp-platform Twitter @pulp_platform 15 M. Rusci, F. Conti, A. Capotondi, L. Benini

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend