networks on microcontrollers Manuele Rusci*, Alessandro Capotondi, - PowerPoint PPT Presentation

Memory-driven mixed low precision quantization for enabling deep inference networks on microcontrollers Manuele Rusci*, Alessandro Capotondi, Luca Benini *manuele.rusci@unibo.it Energy-Efficient Embedded Systems Laboratory Dipartimento di Ingegneria dell’Energia Elettrica e dell’Informazione “Guglielmo Marconi” – DEI – Università di Bologna

Microcontrollers for smart sensors 2 M.Rusci - MLSys2020 Austin

Microcontrollers for smart sensors ❑ Low-power (<10-100mW) & low- cost ❑ Smart device are battery- operated ❑ Highly-flexible (SW programmable) ❑ But limited resources(!) ❑ few MB of memories ❑ single RISC core up to few 100s MHZ (STM32H7: 400MHz) with DSP SIMD instructions and optional FPU ❑ Currently, tiny visual DL tasks on MCUs (visual wake words, CIFAR10) Source: STM32H7 datasheet Challenge : Run ‘complex’ and ‘big’ ( Imagenet-size) DL inference on MCU ? 3 M.Rusci - MLSys2020 Austin

Deep Learning for microcontrollers “Efficient” topologies: Accuracy vs MAC vs Memory But quantization is also essential… Source: https://towardsdatascience.com/neural- network-architectures-156e5bad51ba a 0 a 1 a 2 a 3 Reducing bit Accuracy Compute Memory FP32 : 4 instr + 32 bytes precision w 0 w 1 w 2 w 3 INT16 : 2 instr + 16 bytes INT8 : 1 instr + 8 bytes + (if ISA MAC SIMD available) Issue1 : Integer-only model needed for deployment on low-power MCUs Issue2 : 8-16 bit are not sufficient to bring ‘complex’ models on MCUs (memory!!) 4 M.Rusci - MLSys2020 Austin

Memory-Driven Mixed-Precision Quantization z Using less than Best Top1: 70.1% 8 bits… Best Mixed: 68% Best Top4 Fit 60.5% Best Top1 Fit: 48% y x still margin apply minimum tensor- wise quantization ≤8bit to fit the memory constraints with very-low accuracy drop ➢ Challenges : – How to define the quantization policy – Combine quantization flow this with integer only transformation 5 M.Rusci - MLSys2020 Austin

End-to-end Flow & Contributions Goal : Define a design flow to bring Imagenet-size models into an MCU device while paying a low accuracy drop. DNN Development Flow for microcontrollers Full- Fake- Deployment Microcontroller precision quantized Integer-only deployment C model model Device- model code f(x) Model g(x) g’(x) aware Graph Code Selection Fine- Optim Generator & Training Tuning Memory Constraints Device-aware Fine-Tuning Deployment on MCU We define a rule-based methodology to A latency-accuracy tradeoff on iso-memory determine the mixed-precision quantization mixed-precision networks belonging to the policy driven by a memory objective function. Imagenet MobilenetV1 family when running on a STM32H7 MCU. Graph Optimization We introduce the Integer Channel-Normalization (ICN) activation layer to generate an integer-only deployment graph when applying uniform sub-byte quantization . 6 M.Rusci - MLSys2020 Austin

DNN Development Flow for microcontrollers Full- Fake- Deployment Microcontroller precision quantized Integer-only deployment C model model Device- model code f(x) Model g(x) g’(x) aware Graph Code Selection Generator Fine- Optim & Training Tuning Memory Constraints Graph Optimization INTEGER-ONLY W/ SUB-BYTE QUANTIZATION 7 M.Rusci - MLSys2020 Austin

State of the Art ❑ Inference with Integer-only arithmetic (Jacob, 2018) ❑ Affine transformation between real value and (uniform) quantized parameters ❑ Quantization-aware retraining ❑ Folding of batch norm into conv weights + rounding of per-layer scaling parameters quantized tensor (INT-Q) real value tensor or sub- 𝑢 = 𝑇 𝑢 × (𝑈 𝑟 − 𝑎 𝑢 ) tensor ☺ Almost lossless with 8 bit on Image classification and detection problems. Used by TF Lite. Integer-Only MobilenetV1_224_1.0 Quantization Top1 Weights  4 bit MobilnetV1: Training collapse when folding Method (MB) batch norm into convolution weights Full-Precision 70.9 16.8  Does not support Per-Channel (PC) weight w8a8 70.1 4.06 quantization w4a4 0.1 2.05 (Jacob, 2018) Jacob, Benoit, et al. "Quantization and training of neural networks for efficient integer-arithmetic-only inference." CVPR 2018 8 M.Rusci - MLSys2020 Austin

Integer-Channel Normalization (ICN) 𝜚 = ∑𝑥 ⋅ 𝑦 Fake- 𝜚 − 𝜈 Quantized 𝑍 𝑟 = 𝑟𝑣𝑏𝑜𝑢 𝑏𝑑𝑢 ⋅ 𝛿 + 𝛾 X q 𝜏 Sub-Graph 𝜈, 𝜏, 𝛿, 𝛾 are channel-wise batchnorm parameters Conv2D Φ 𝑢 = 𝑇 𝑢 × (𝑈 𝑟 − 𝑎 𝑢 ) Replacing BatchNorm Φ = ∑(𝑋 𝑟 − 𝑎 𝑥 ) ⋅ (𝑌 𝑟 − 𝑎 𝑦 ) 𝑇 𝑥 is scalar if PL, else array 𝑇 𝑗 , 𝑇 𝑝 are scalar Activation 𝑇 𝑗 𝑇 𝑥 𝛿 1 𝐶 − 𝜈 + 𝛾 𝜏 QuantAct 𝑍 𝑟 = 𝑎 𝑧 + 𝑟𝑣𝑏𝑜𝑢 𝑏𝑑𝑢 𝜏 ( Φ + ) 𝑇 𝑝 𝑇 𝑗 𝑇 𝑥 𝛿 Y q 𝑁 0 2 𝑂 0 ( Φ + 𝐶 𝑟 ) 𝑁 0 , 𝑂 0 , 𝐶 𝑟 are channel- wise integer params Integer-Only MobilenetV1_224_1.0 Integer Channel-Normalization (ICN) Quantization Top1 Weights activation function Method (MB) ➢ holds either for PL or PC quantization Full-Precision 70.9 16.8 of weights PL+ICN w4a4 61.75 2.10 9 PC+ICN w4a4 66.41 2.12 M.Rusci - MLSys2020 Austin

DNN Development Flow for microcontrollers Full- Fake- Deployment Microcontroller precision quantized Integer-only deployment C model model Device- model code f(x) Model g(x) g’(x) aware Graph Code Selection Generator Fine- Optim & Training Tuning Memory Constraints Device-aware Fine-Tuning MIXED-PRECISION QUANTIZATION POLICY 10 M.Rusci - MLSys2020 Austin

Deployment of an integer-only graph Problem weight 0 conv 0 Can this graph fit the memory constraints of our Input Data MCU device? Weight Parameters weight 1 conv 1 Output Data conv 2 weight 2 M ROM M RAM conv 3 weight 3 add 0 conv 4 weight 4 11 M.Rusci - MLSys2020 Austin

Deployment of an integer-only graph Problem weight 0 conv 0 Can this graph fit the memory constraints of our MCU device? weight 1 conv 1 Weight Parameters conv 2 weight 2 Read-only memory M ROM for static parameters Read-write conv 3 weight 3 Input Data memory M RAM for dynamic Output Data values add 0 conv 4 weight 4 12 M.Rusci - MLSys2020 Austin

Deployment of an integer-only graph [M1] weight 0 conv 0 𝑀−1 𝑗 ෍ 𝑛𝑓𝑛 𝑋 𝑗 , 𝑅 𝑥 + 𝑛𝑓𝑛 𝑁 0 , 𝑂0, 𝐶 𝑟 < 𝑁 𝑆𝑃𝑁 𝑗=0 [M1] weights [M2] must fit 𝑗 𝑗 m𝑓𝑛 𝑌 𝑗 , 𝑅 𝑦 + 𝑛𝑓𝑛 𝑍 𝑗 , 𝑅 𝑧 < 𝑁 𝑆𝐵𝑁 , ∀𝑗 M ROM weight 1 conv 1 [M2] I/O of a conv 2 weight 2 node must fit M RAM conv 3 weight 3 Problem Formulation 𝑗 , 𝑅 𝑧 𝑗 , 𝑅 𝑥 𝑗 Find the quantization policy 𝑅 𝑦 to satisfy [M1] and [M2] add 0 𝑗 , 𝑅 𝑧 𝑗 , 𝑅 𝑥 𝑗 ∈ 2,4,8 bits 𝑅 𝑦 conv 4 weight 4 13 M.Rusci - MLSys2020 Austin

Rule-Based Mixed-Precision [M1] : size(w0) + size(w1) + 𝑗 Set 𝑅 𝑥 = 8 size (w2) + size(w3) < 𝑁 𝑆𝑃𝑁 Goal 𝜀 = 5% Maximize memory [M1] yes satisfied utilization 0 = 8 ? 𝑅 𝑥 w0 conv 0 13% no 1 = 8 𝑅 𝑥 Compute mem occupation 𝑗 ) r i = 𝑛𝑓𝑛(𝑥 𝑗 , 𝑅 𝑥 15% conv 1 w1 𝑢𝑝𝑢 𝑁𝐹𝑁 𝑆 = max 𝑠 𝑗 2 = 8 𝑅 𝑥 w2 conv 2 22% 𝑗 of the lower layer Cut 𝑅 𝑥 3 = 8 with a mem occupation 𝑅 𝑥 𝑠 𝑗 > 𝑆 − 𝜀 conv 3 w3 50% Weights Quantization Policy 14 M.Rusci - MLSys2020 Austin

Rule-Based Mixed-Precision [M1] : size(w0) + size(w1) + 𝑗 Set 𝑅 𝑥 = 8 size (w2) + size(w3) < 𝑁 𝑆𝑃𝑁 Goal 𝜀 = 5% Maximize memory [M1] yes satisfied utilization 0 = 8 ? 𝑅 𝑥 w0 conv 0 13% no 17% 1 = 8 𝑅 𝑥 Compute mem occupation 𝑗 ) r i = 𝑛𝑓𝑛(𝑥 𝑗 , 𝑅 𝑥 15% 20% conv 1 w1 𝑢𝑝𝑢 𝑁𝐹𝑁 Any cut reduces 𝑆 = max 𝑠 𝑗 the bit precision by 2 = 8 𝑅 𝑥 one step: 8→4, 4→2 w2 conv 2 22% 30% 𝑗 of the lower layer Cut 𝑅 𝑥 3 = 4 with a mem occupation 𝑅 𝑥 𝑠 𝑗 > 𝑆 − 𝜀 conv 3 w3 50% 33% Weights Cut layer 3! Quantization Policy 15 M.Rusci - MLSys2020 Austin

Rule-Based Mixed-Precision [M1] : size(w0) + size(w1) + 𝑗 Set 𝑅 𝑥 = 8 size (w2) + size(w3) < 𝑁 𝑆𝑃𝑁 Goal 𝜀 = 5% Maximize memory [M1] yes satisfied utilization 0 = 8 ? 𝑅 𝑥 w0 conv 0 17% no 1 = 8 𝑅 𝑥 Compute mem occupation 𝑗 ) r i = 𝑛𝑓𝑛(𝑥 𝑗 , 𝑅 𝑥 20% conv 1 w1 𝑢𝑝𝑢 𝑁𝐹𝑁 Any cut reduces 𝑆 = max 𝑠 𝑗 the bit precision by 2 = 4 𝑅 𝑥 one step: 8→4, 4→2 w2 conv 2 30% 𝑗 of the lower layer Cut 𝑅 𝑥 3 = 4 with a mem occupation 𝑅 𝑥 𝑠 𝑗 > 𝑆 − 𝜀 conv 3 w3 33% Weights Cut layer 2! Quantization Policy 16 M.Rusci - MLSys2020 Austin

networks on microcontrollers Manuele Rusci*, Alessandro Capotondi, - PowerPoint PPT Presentation

Memory-driven mixed low precision quantization for enabling deep inference networks on microcontrollers Manuele Rusci, Alessandro Capotondi, Luca Benini manuele.rusci@unibo.it Energy-Efficient Embedded Systems Laboratory Dipartimento di

AVR Microcontrollers- Introduction AVR Microcontrollers Widely-used microcontroller

to of Microcontrollers ECE Senior Design 9 February 2017 Popular Microcontrollers 8051

Building fault models for microcontrollers Albert Spruyt aspruyt@os3.nl University of Amsterdam

Microcontrollers for IOT Prototyping Part 2 V. Oree, EEE Dept, UoM 1 Introduction The

Using the Digital I/O interface of STMicroelectronics STM32 Microcontrollers Corrado Santoro

Performance Analysis of Contemporary Lightweight Block Ciphers on 8-bit Microcontrollers Sren

SPI Serial Port (in AVR Microcontrollers) Contents Serial communication with SPI Serial

MICROCONTROLLERS Nicolas Moro 1,3 , Amine Dehbaoui 2 , Karine Heydemann 3 , Bruno Robisson 1 ,

Optimizing C For Microcontrollers Khem Raj, Comcast Embedded Linux Conference & IOT summit -

Tock: A Secure Operating System for Microcontrollers Limitations of Microcontroller Sofware

Today Digital Signal Processors Digital signal processors Microcontrollers are optimized

Microcontroller Programming Beginning with Arduino Charlie Mooney Microcontrollers Tiny,

Interrupts in AVR Microcontrollers (Chapter 10 of the text book) Contents Interrupts

EFM32 ...the worlds most energy friendly microcontrollers How to use this presentation Click

SYSC3601 Microprocessor Systems Unit 14: Microcontrollers Topics/Reading 1. Microcontroller

Using the UART with STM32 Microcontrollers Corrado Santoro ARSLAB - Autonomous and Robotic

Running Deep Learning in less than 100KB on Microcontrollers Pete Warden Engineer, TensorFlow

microMedic 2013 National Contest Microcontrollers, rapid prototyping and open-source licensing

EE 109 Unit 4 Microcontrollers (Arduino) Overview Using software to perform logic on individual

Attacking Microcontrollers From a Software Perspective Don A. Bailey (donb@isecpartners.com) A

A free toolchain for 0.01 - computers The free toolchain for the Padauk 8-bit microcontrollers

Named Data Networking of Things: NDN for Microcontrollers (NDN-RIOT) Wentao Shang, Alex

PICOBIT: A Compact Scheme System for Microcontrollers Vincent St-Amour Universit e de Montr

ARM Microprocessor and ARM-Based Microcontrollers Nguatem William 24th May 2006 1 / 40 A

networks on microcontrollers Manuele Rusci*, Alessandro Capotondi, - PowerPoint PPT Presentation

Memory-driven mixed low precision quantization for enabling deep inference networks on microcontrollers Manuele Rusci*, Alessandro Capotondi, Luca Benini *manuele.rusci@unibo.it Energy-Efficient Embedded Systems Laboratory Dipartimento di

AVR Microcontrollers- Introduction AVR Microcontrollers Widely-used microcontroller

to of Microcontrollers ECE Senior Design 9 February 2017 Popular Microcontrollers 8051

Building fault models for microcontrollers Albert Spruyt aspruyt@os3.nl University of Amsterdam

Microcontrollers for IOT Prototyping Part 2 V. Oree, EEE Dept, UoM 1 Introduction The

Using the Digital I/O interface of STMicroelectronics STM32 Microcontrollers Corrado Santoro

Performance Analysis of Contemporary Lightweight Block Ciphers on 8-bit Microcontrollers Sren

SPI Serial Port (in AVR Microcontrollers) Contents Serial communication with SPI Serial

MICROCONTROLLERS Nicolas Moro 1,3 , Amine Dehbaoui 2 , Karine Heydemann 3 , Bruno Robisson 1 ,

Optimizing C For Microcontrollers Khem Raj, Comcast Embedded Linux Conference &amp; IOT summit -

Tock: A Secure Operating System for Microcontrollers Limitations of Microcontroller Sofware

Today Digital Signal Processors Digital signal processors Microcontrollers are optimized

Microcontroller Programming Beginning with Arduino Charlie Mooney Microcontrollers Tiny,

Interrupts in AVR Microcontrollers (Chapter 10 of the text book) Contents Interrupts

EFM32 ...the worlds most energy friendly microcontrollers How to use this presentation Click

SYSC3601 Microprocessor Systems Unit 14: Microcontrollers Topics/Reading 1. Microcontroller

Using the UART with STM32 Microcontrollers Corrado Santoro ARSLAB - Autonomous and Robotic

Running Deep Learning in less than 100KB on Microcontrollers Pete Warden Engineer, TensorFlow

microMedic 2013 National Contest Microcontrollers, rapid prototyping and open-source licensing

EE 109 Unit 4 Microcontrollers (Arduino) Overview Using software to perform logic on individual

Attacking Microcontrollers From a Software Perspective Don A. Bailey (donb@isecpartners.com) A

A free toolchain for 0.01 - computers The free toolchain for the Padauk 8-bit microcontrollers

Named Data Networking of Things: NDN for Microcontrollers (NDN-RIOT) Wentao Shang, Alex

PICOBIT: A Compact Scheme System for Microcontrollers Vincent St-Amour Universit e de Montr

ARM Microprocessor and ARM-Based Microcontrollers Nguatem William 24th May 2006 1 / 40 A

Memory-driven mixed low precision quantization for enabling deep inference networks on microcontrollers Manuele Rusci, Alessandro Capotondi, Luca Benini manuele.rusci@unibo.it Energy-Efficient Embedded Systems Laboratory Dipartimento di

Optimizing C For Microcontrollers Khem Raj, Comcast Embedded Linux Conference & IOT summit -