networks on microcontrollers
play

networks on microcontrollers Manuele Rusci*, Alessandro Capotondi, - PowerPoint PPT Presentation

Memory-driven mixed low precision quantization for enabling deep inference networks on microcontrollers Manuele Rusci*, Alessandro Capotondi, Luca Benini *manuele.rusci@unibo.it Energy-Efficient Embedded Systems Laboratory Dipartimento di


  1. Memory-driven mixed low precision quantization for enabling deep inference networks on microcontrollers Manuele Rusci*, Alessandro Capotondi, Luca Benini *manuele.rusci@unibo.it Energy-Efficient Embedded Systems Laboratory Dipartimento di Ingegneria dell’Energia Elettrica e dell’Informazione “Guglielmo Marconi” – DEI – Università di Bologna

  2. Microcontrollers for smart sensors 2 M.Rusci - MLSys2020 Austin

  3. Microcontrollers for smart sensors ❑ Low-power (<10-100mW) & low- cost ❑ Smart device are battery- operated ❑ Highly-flexible (SW programmable) ❑ But limited resources(!) ❑ few MB of memories ❑ single RISC core up to few 100s MHZ (STM32H7: 400MHz) with DSP SIMD instructions and optional FPU ❑ Currently, tiny visual DL tasks on MCUs (visual wake words, CIFAR10) Source: STM32H7 datasheet Challenge : Run ‘complex’ and ‘big’ ( Imagenet-size) DL inference on MCU ? 3 M.Rusci - MLSys2020 Austin

  4. Deep Learning for microcontrollers “Efficient” topologies: Accuracy vs MAC vs Memory But quantization is also essential… Source: https://towardsdatascience.com/neural- network-architectures-156e5bad51ba a 0 a 1 a 2 a 3 Reducing bit Accuracy Compute Memory FP32 : 4 instr + 32 bytes precision w 0 w 1 w 2 w 3 INT16 : 2 instr + 16 bytes INT8 : 1 instr + 8 bytes + (if ISA MAC SIMD available) Issue1 : Integer-only model needed for deployment on low-power MCUs Issue2 : 8-16 bit are not sufficient to bring ‘complex’ models on MCUs (memory!!) 4 M.Rusci - MLSys2020 Austin

  5. Memory-Driven Mixed-Precision Quantization z Using less than Best Top1: 70.1% 8 bits… Best Mixed: 68% Best Top4 Fit 60.5% Best Top1 Fit: 48% y x still margin apply minimum tensor- wise quantization ≤8bit to fit the memory constraints with very-low accuracy drop ➢ Challenges : – How to define the quantization policy – Combine quantization flow this with integer only transformation 5 M.Rusci - MLSys2020 Austin

  6. End-to-end Flow & Contributions Goal : Define a design flow to bring Imagenet-size models into an MCU device while paying a low accuracy drop. DNN Development Flow for microcontrollers Full- Fake- Deployment Microcontroller precision quantized Integer-only deployment C model model Device- model code f(x) Model g(x) g’(x) aware Graph Code Selection Fine- Optim Generator & Training Tuning Memory Constraints Device-aware Fine-Tuning Deployment on MCU We define a rule-based methodology to A latency-accuracy tradeoff on iso-memory determine the mixed-precision quantization mixed-precision networks belonging to the policy driven by a memory objective function. Imagenet MobilenetV1 family when running on a STM32H7 MCU. Graph Optimization We introduce the Integer Channel-Normalization (ICN) activation layer to generate an integer-only deployment graph when applying uniform sub-byte quantization . 6 M.Rusci - MLSys2020 Austin

  7. DNN Development Flow for microcontrollers Full- Fake- Deployment Microcontroller precision quantized Integer-only deployment C model model Device- model code f(x) Model g(x) g’(x) aware Graph Code Selection Generator Fine- Optim & Training Tuning Memory Constraints Graph Optimization INTEGER-ONLY W/ SUB-BYTE QUANTIZATION 7 M.Rusci - MLSys2020 Austin

  8. State of the Art ❑ Inference with Integer-only arithmetic (Jacob, 2018) ❑ Affine transformation between real value and (uniform) quantized parameters ❑ Quantization-aware retraining ❑ Folding of batch norm into conv weights + rounding of per-layer scaling parameters quantized tensor (INT-Q) real value tensor or sub- 𝑢 = 𝑇 𝑢 × (𝑈 𝑟 − 𝑎 𝑢 ) tensor ☺ Almost lossless with 8 bit on Image classification and detection problems. Used by TF Lite. Integer-Only MobilenetV1_224_1.0 Quantization Top1 Weights  4 bit MobilnetV1: Training collapse when folding Method (MB) batch norm into convolution weights Full-Precision 70.9 16.8  Does not support Per-Channel (PC) weight w8a8 70.1 4.06 quantization w4a4 0.1 2.05 (Jacob, 2018) Jacob, Benoit, et al. "Quantization and training of neural networks for efficient integer-arithmetic-only inference." CVPR 2018 8 M.Rusci - MLSys2020 Austin

  9. Integer-Channel Normalization (ICN) 𝜚 = ∑𝑥 ⋅ 𝑦 Fake- 𝜚 − 𝜈 Quantized 𝑍 𝑟 = 𝑟𝑣𝑏𝑜𝑢 𝑏𝑑𝑢 ⋅ 𝛿 + 𝛾 X q 𝜏 Sub-Graph 𝜈, 𝜏, 𝛿, 𝛾 are channel-wise batchnorm parameters Conv2D Φ 𝑢 = 𝑇 𝑢 × (𝑈 𝑟 − 𝑎 𝑢 ) Replacing BatchNorm Φ = ∑(𝑋 𝑟 − 𝑎 𝑥 ) ⋅ (𝑌 𝑟 − 𝑎 𝑦 ) 𝑇 𝑥 is scalar if PL, else array 𝑇 𝑗 , 𝑇 𝑝 are scalar Activation 𝑇 𝑗 𝑇 𝑥 𝛿 1 𝐶 − 𝜈 + 𝛾 𝜏 QuantAct 𝑍 𝑟 = 𝑎 𝑧 + 𝑟𝑣𝑏𝑜𝑢 𝑏𝑑𝑢 𝜏 ( Φ + ) 𝑇 𝑝 𝑇 𝑗 𝑇 𝑥 𝛿 Y q 𝑁 0 2 𝑂 0 ( Φ + 𝐶 𝑟 ) 𝑁 0 , 𝑂 0 , 𝐶 𝑟 are channel- wise integer params Integer-Only MobilenetV1_224_1.0 Integer Channel-Normalization (ICN) Quantization Top1 Weights activation function Method (MB) ➢ holds either for PL or PC quantization Full-Precision 70.9 16.8 of weights PL+ICN w4a4 61.75 2.10 9 PC+ICN w4a4 66.41 2.12 M.Rusci - MLSys2020 Austin

  10. DNN Development Flow for microcontrollers Full- Fake- Deployment Microcontroller precision quantized Integer-only deployment C model model Device- model code f(x) Model g(x) g’(x) aware Graph Code Selection Generator Fine- Optim & Training Tuning Memory Constraints Device-aware Fine-Tuning MIXED-PRECISION QUANTIZATION POLICY 10 M.Rusci - MLSys2020 Austin

  11. Deployment of an integer-only graph Problem weight 0 conv 0 Can this graph fit the memory constraints of our Input Data MCU device? Weight Parameters weight 1 conv 1 Output Data conv 2 weight 2 M ROM M RAM conv 3 weight 3 add 0 conv 4 weight 4 11 M.Rusci - MLSys2020 Austin

  12. Deployment of an integer-only graph Problem weight 0 conv 0 Can this graph fit the memory constraints of our MCU device? weight 1 conv 1 Weight Parameters conv 2 weight 2 Read-only memory M ROM for static parameters Read-write conv 3 weight 3 Input Data memory M RAM for dynamic Output Data values add 0 conv 4 weight 4 12 M.Rusci - MLSys2020 Austin

  13. Deployment of an integer-only graph [M1] weight 0 conv 0 𝑀−1 𝑗 ෍ 𝑛𝑓𝑛 𝑋 𝑗 , 𝑅 𝑥 + 𝑛𝑓𝑛 𝑁 0 , 𝑂0, 𝐶 𝑟 < 𝑁 𝑆𝑃𝑁 𝑗=0 [M1] weights [M2] must fit 𝑗 𝑗 m𝑓𝑛 𝑌 𝑗 , 𝑅 𝑦 + 𝑛𝑓𝑛 𝑍 𝑗 , 𝑅 𝑧 < 𝑁 𝑆𝐵𝑁 , ∀𝑗 M ROM weight 1 conv 1 [M2] I/O of a conv 2 weight 2 node must fit M RAM conv 3 weight 3 Problem Formulation 𝑗 , 𝑅 𝑧 𝑗 , 𝑅 𝑥 𝑗 Find the quantization policy 𝑅 𝑦 to satisfy [M1] and [M2] add 0 𝑗 , 𝑅 𝑧 𝑗 , 𝑅 𝑥 𝑗 ∈ 2,4,8 bits 𝑅 𝑦 conv 4 weight 4 13 M.Rusci - MLSys2020 Austin

  14. Rule-Based Mixed-Precision [M1] : size(w0) + size(w1) + 𝑗 Set 𝑅 𝑥 = 8 size (w2) + size(w3) < 𝑁 𝑆𝑃𝑁 Goal 𝜀 = 5% Maximize memory [M1] yes satisfied utilization 0 = 8 ? 𝑅 𝑥 w0 conv 0 13% no 1 = 8 𝑅 𝑥 Compute mem occupation 𝑗 ) r i = 𝑛𝑓𝑛(𝑥 𝑗 , 𝑅 𝑥 15% conv 1 w1 𝑢𝑝𝑢 𝑁𝐹𝑁 𝑆 = max 𝑠 𝑗 2 = 8 𝑅 𝑥 w2 conv 2 22% 𝑗 of the lower layer Cut 𝑅 𝑥 3 = 8 with a mem occupation 𝑅 𝑥 𝑠 𝑗 > 𝑆 − 𝜀 conv 3 w3 50% Weights Quantization Policy 14 M.Rusci - MLSys2020 Austin

  15. Rule-Based Mixed-Precision [M1] : size(w0) + size(w1) + 𝑗 Set 𝑅 𝑥 = 8 size (w2) + size(w3) < 𝑁 𝑆𝑃𝑁 Goal 𝜀 = 5% Maximize memory [M1] yes satisfied utilization 0 = 8 ? 𝑅 𝑥 w0 conv 0 13% no 17% 1 = 8 𝑅 𝑥 Compute mem occupation 𝑗 ) r i = 𝑛𝑓𝑛(𝑥 𝑗 , 𝑅 𝑥 15% 20% conv 1 w1 𝑢𝑝𝑢 𝑁𝐹𝑁 Any cut reduces 𝑆 = max 𝑠 𝑗 the bit precision by 2 = 8 𝑅 𝑥 one step: 8→4, 4→2 w2 conv 2 22% 30% 𝑗 of the lower layer Cut 𝑅 𝑥 3 = 4 with a mem occupation 𝑅 𝑥 𝑠 𝑗 > 𝑆 − 𝜀 conv 3 w3 50% 33% Weights Cut layer 3! Quantization Policy 15 M.Rusci - MLSys2020 Austin

  16. Rule-Based Mixed-Precision [M1] : size(w0) + size(w1) + 𝑗 Set 𝑅 𝑥 = 8 size (w2) + size(w3) < 𝑁 𝑆𝑃𝑁 Goal 𝜀 = 5% Maximize memory [M1] yes satisfied utilization 0 = 8 ? 𝑅 𝑥 w0 conv 0 17% no 1 = 8 𝑅 𝑥 Compute mem occupation 𝑗 ) r i = 𝑛𝑓𝑛(𝑥 𝑗 , 𝑅 𝑥 20% conv 1 w1 𝑢𝑝𝑢 𝑁𝐹𝑁 Any cut reduces 𝑆 = max 𝑠 𝑗 the bit precision by 2 = 4 𝑅 𝑥 one step: 8→4, 4→2 w2 conv 2 30% 𝑗 of the lower layer Cut 𝑅 𝑥 3 = 4 with a mem occupation 𝑅 𝑥 𝑠 𝑗 > 𝑆 − 𝜀 conv 3 w3 33% Weights Cut layer 2! Quantization Policy 16 M.Rusci - MLSys2020 Austin

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend