networks on microcontrollers Manuele Rusci*, Alessandro Capotondi, - - PowerPoint PPT Presentation

networks on microcontrollers
SMART_READER_LITE
LIVE PREVIEW

networks on microcontrollers Manuele Rusci*, Alessandro Capotondi, - - PowerPoint PPT Presentation

Memory-driven mixed low precision quantization for enabling deep inference networks on microcontrollers Manuele Rusci*, Alessandro Capotondi, Luca Benini *manuele.rusci@unibo.it Energy-Efficient Embedded Systems Laboratory Dipartimento di


slide-1
SLIDE 1

Memory-driven mixed low precision quantization for enabling deep inference networks on microcontrollers

Manuele Rusci*, Alessandro Capotondi, Luca Benini

*manuele.rusci@unibo.it Energy-Efficient Embedded Systems Laboratory Dipartimento di Ingegneria dell’Energia Elettrica e dell’Informazione “Guglielmo Marconi” – DEI – Università di Bologna

slide-2
SLIDE 2

Microcontrollers for smart sensors

M.Rusci - MLSys2020 Austin

2

slide-3
SLIDE 3

Microcontrollers for smart sensors

❑ Low-power (<10-100mW) & low- cost

❑ Smart device are battery- operated

❑ Highly-flexible (SW programmable) ❑ But limited resources(!)

❑ few MB of memories ❑ single RISC core up to few 100s MHZ (STM32H7: 400MHz) with DSP SIMD instructions and optional FPU

❑ Currently, tiny visual DL tasks on MCUs (visual wake words, CIFAR10)

M.Rusci - MLSys2020 Austin Source: STM32H7 datasheet

Challenge: Run ‘complex’ and ‘big’ (Imagenet-size) DL inference on MCU ?

3

slide-4
SLIDE 4

Deep Learning for microcontrollers

“Efficient” topologies: Accuracy vs MAC vs Memory

M.Rusci - MLSys2020 Austin

Source: https://towardsdatascience.com/neural- network-architectures-156e5bad51ba

But quantization is also essential…

+

a0 a1 a2 a3 w0 w1 w2 w3

INT16: 2 instr + 16 bytes INT8: 1 instr + 8 bytes

Reducing bit precision

Memory

(if ISA MAC SIMD available)

Accuracy

Issue2: 8-16 bit are not sufficient to bring ‘complex’ models on MCUs (memory!!) Issue1: Integer-only model needed for deployment on low-power MCUs

FP32 : 4 instr + 32 bytes

Compute

4

slide-5
SLIDE 5

Memory-Driven Mixed-Precision Quantization

apply minimum tensor-wise quantization ≤8bit to fit the memory constraints with very-low accuracy drop

M.Rusci - MLSys2020 Austin

Best Top1: 70.1% Best Mixed: 68% Best Top4 Fit 60.5% Best Top1 Fit:48%

➢ Challenges:

– How to define the quantization policy – Combine quantization flow this with integer only transformation

5

still margin

Using less than 8 bits…

x y z

slide-6
SLIDE 6

End-to-end Flow & Contributions

M.Rusci - MLSys2020 Austin

Model Selection & Training

Device- aware Fine- Tuning Graph Optim

Microcontroller deployment Full- precision model f(x) Deployment Integer-only model g’(x) Code Generator C code Fake- quantized model g(x)

Memory Constraints

DNN Development Flow for microcontrollers

Device-aware Fine-Tuning

We define a rule-based methodology to determine the mixed-precision quantization policy driven by a memory objective function.

Graph Optimization

We introduce the Integer Channel-Normalization (ICN) activation layer to generate an integer-only deployment graph when applying uniform sub-byte quantization.

Deployment on MCU

A latency-accuracy tradeoff on iso-memory mixed-precision networks belonging to the Imagenet MobilenetV1 family when running on a STM32H7 MCU.

Goal: Define a design flow to bring Imagenet-size models into an MCU device while paying a low accuracy drop.

6

slide-7
SLIDE 7

INTEGER-ONLY W/ SUB-BYTE QUANTIZATION

Graph Optimization

M.Rusci - MLSys2020 Austin

7

Model Selection & Training

Device- aware Fine- Tuning Graph Optim

Microcontroller deployment Full- precision model f(x) Deployment Integer-only model g’(x) Code Generator C code Fake- quantized model g(x)

Memory Constraints

DNN Development Flow for microcontrollers

slide-8
SLIDE 8

State of the Art

❑ Inference with Integer-only arithmetic (Jacob, 2018)

❑ Affine transformation between real value and (uniform) quantized parameters ❑ Quantization-aware retraining ❑ Folding of batch norm into conv weights + rounding of per-layer scaling parameters

M.Rusci - MLSys2020 Austin

𝑢 = 𝑇𝑢 × (𝑈

𝑟 − 𝑎𝑢)

real value tensor or sub- tensor quantized tensor (INT-Q)

8

(Jacob, 2018) Jacob, Benoit, et al. "Quantization and training of neural networks for efficient integer-arithmetic-only inference." CVPR 2018

Quantization Method Top1 Weights (MB) Full-Precision 70.9 16.8 w8a8 70.1 4.06 w4a4 0.1 2.05

☺ Almost lossless with 8 bit on Image classification and detection problems. Used by TF Lite.  4 bit MobilnetV1: Training collapse when folding batch norm into convolution weights  Does not support Per-Channel (PC) weight quantization

Integer-Only MobilenetV1_224_1.0

slide-9
SLIDE 9

Integer-Channel Normalization (ICN)

Xq

Fake- Quantized Sub-Graph

Conv2D BatchNorm Activation

QuantAct

Φ

𝑍

𝑟 = 𝑟𝑣𝑏𝑜𝑢𝑏𝑑𝑢

𝜚 − 𝜈 𝜏 ⋅ 𝛿 + 𝛾

Yq

𝜈, 𝜏, 𝛿, 𝛾 are channel-wise batchnorm parameters

Replacing

𝜚 = ∑𝑥 ⋅ 𝑦 𝑢 = 𝑇𝑢 × (𝑈

𝑟 − 𝑎𝑢)

𝑍

𝑟 = 𝑎𝑧 + 𝑟𝑣𝑏𝑜𝑢𝑏𝑑𝑢

𝑇𝑗𝑇𝑥 𝑇𝑝 𝛿 𝜏 ( Φ + 1 𝑇𝑗𝑇𝑥 𝐶 − 𝜈 + 𝛾 𝜏 𝛿 )

Φ = ∑(𝑋

𝑟 − 𝑎𝑥) ⋅ (𝑌𝑟 − 𝑎𝑦)

Φ+ 𝐶𝑟 ) ( 𝑁02𝑂0

𝑁0, 𝑂0, 𝐶𝑟are channel- wise integer params

𝑇𝑥 is scalar if PL, else array 𝑇𝑗, 𝑇𝑝 are scalar

Integer Channel-Normalization (ICN) activation function ➢ holds either for PL or PC quantization

  • f weights

9

Quantization Method Top1 Weights (MB) Full-Precision 70.9 16.8 PL+ICN w4a4 61.75 2.10 PC+ICN w4a4 66.41 2.12 Integer-Only MobilenetV1_224_1.0

M.Rusci - MLSys2020 Austin

slide-10
SLIDE 10

MIXED-PRECISION QUANTIZATION POLICY

Device-aware Fine-Tuning

M.Rusci - MLSys2020 Austin

10

Model Selection & Training

Device- aware Fine- Tuning Graph Optim

Microcontroller deployment Full- precision model f(x) Deployment Integer-only model g’(x) Code Generator C code Fake- quantized model g(x)

Memory Constraints

DNN Development Flow for microcontrollers

slide-11
SLIDE 11

Deployment of an integer-only graph

M.Rusci - MLSys2020 Austin

weight 0

conv 3 conv 0 conv 4 add 0 conv 1

weight 1 weight 3 weight 4

conv 2

weight 2

11

Problem Can this graph fit the memory constraints of our MCU device?

Output Data Input Data

MROM MRAM

Weight Parameters

slide-12
SLIDE 12

Deployment of an integer-only graph

M.Rusci - MLSys2020 Austin

weight 0

conv 3 conv 0 conv 4 add 0 conv 1

weight 1 weight 3 weight 4

conv 2

weight 2

12

Problem Can this graph fit the memory constraints of our MCU device?

Read-only memory MROM for static parameters Read-write memory MRAM for dynamic values

Output Data Input Data

Weight Parameters

slide-13
SLIDE 13

Deployment of an integer-only graph

M.Rusci - MLSys2020 Austin

weight 0

conv 3 conv 0 conv 4 add 0 conv 1

weight 1 weight 3 weight 4

[M1] weights must fit MROM [M2] I/O of a node must fit MRAM

conv 2

weight 2

[M1] ෍

𝑗=0 𝑀−1

𝑛𝑓𝑛 𝑋

𝑗, 𝑅𝑥 𝑗

+ 𝑛𝑓𝑛 𝑁0, 𝑂0, 𝐶𝑟 < 𝑁𝑆𝑃𝑁 [M2] m𝑓𝑛 𝑌𝑗, 𝑅𝑦

𝑗

+ 𝑛𝑓𝑛 𝑍

𝑗, 𝑅𝑧 𝑗

< 𝑁𝑆𝐵𝑁, ∀𝑗 Problem Formulation Find the quantization policy 𝑅𝑦

𝑗 , 𝑅𝑧 𝑗 , 𝑅𝑥 𝑗

to satisfy [M1] and [M2] 𝑅𝑦

𝑗 , 𝑅𝑧 𝑗 , 𝑅𝑥 𝑗 ∈ 2,4,8 bits

13

slide-14
SLIDE 14

Rule-Based Mixed-Precision

M.Rusci - MLSys2020 Austin

Set 𝑅𝑥

𝑗

= 8 Compute mem occupation ri = 𝑛𝑓𝑛(𝑥𝑗, 𝑅𝑥

𝑗 )

𝑢𝑝𝑢𝑁𝐹𝑁 𝑆 = max 𝑠𝑗 Cut 𝑅𝑥

𝑗 of the lower layer

with a mem occupation 𝑠𝑗 > 𝑆 − 𝜀 [M1] satisfied ?

no yes Goal Maximize memory utilization Weights Quantization Policy conv 2 conv 0

w0 w2

conv 1

w1

𝑅𝑥

0 = 8

𝑅𝑥

1 = 8

𝑅𝑥

2 = 8

22% 15% conv 3

w3

𝑅𝑥

3 = 8

50% 13% 𝜀 = 5%

[M1] : size(w0) + size(w1) + size (w2) + size(w3) < 𝑁𝑆𝑃𝑁

14

slide-15
SLIDE 15

Rule-Based Mixed-Precision

M.Rusci - MLSys2020 Austin

Set 𝑅𝑥

𝑗

= 8 Compute mem occupation ri = 𝑛𝑓𝑛(𝑥𝑗, 𝑅𝑥

𝑗 )

𝑢𝑝𝑢𝑁𝐹𝑁 𝑆 = max 𝑠𝑗 Cut 𝑅𝑥

𝑗 of the lower layer

with a mem occupation 𝑠𝑗 > 𝑆 − 𝜀 [M1] satisfied ?

no yes

Any cut reduces the bit precision by

  • ne step: 8→4,

4→2

Goal Maximize memory utilization Weights Quantization Policy conv 2 conv 0

w0 w2

conv 1

w1

𝑅𝑥

0 = 8

𝑅𝑥

1 = 8

𝑅𝑥

2 = 8

22% 15% conv 3

w3

𝑅𝑥

3 = 4

50% 13% Cut layer 3! 𝜀 = 5%

[M1] : size(w0) + size(w1) + size (w2) + size(w3) < 𝑁𝑆𝑃𝑁

15

30% 20% 33% 17%

slide-16
SLIDE 16

Rule-Based Mixed-Precision

M.Rusci - MLSys2020 Austin

Set 𝑅𝑥

𝑗

= 8 Compute mem occupation ri = 𝑛𝑓𝑛(𝑥𝑗, 𝑅𝑥

𝑗 )

𝑢𝑝𝑢𝑁𝐹𝑁 𝑆 = max 𝑠𝑗 Cut 𝑅𝑥

𝑗 of the lower layer

with a mem occupation 𝑠𝑗 > 𝑆 − 𝜀 [M1] satisfied ?

no yes

Any cut reduces the bit precision by

  • ne step: 8→4,

4→2

Goal Maximize memory utilization Weights Quantization Policy conv 2 conv 0

w0 w2

conv 1

w1

𝑅𝑥

0 = 8

𝑅𝑥

1 = 8

𝑅𝑥

2 = 4

30% 20% conv 3

w3

𝑅𝑥

3 = 4

33% 17% 𝜀 = 5% Cut layer 2!

[M1] : size(w0) + size(w1) + size (w2) + size(w3) < 𝑁𝑆𝑃𝑁

16

slide-17
SLIDE 17

Rule-Based Mixed-Precision

M.Rusci - MLSys2020 Austin

Set 𝑅𝑦

𝑗 = 𝑅𝑧 𝑗−1 = 8

[M2] satisfied ?

no yes

For any layer i While 𝑛𝑓𝑛 𝑧𝑗, 𝑅𝑧

𝑗

> 𝑛𝑓𝑛 𝑦𝑗, 𝑅𝑦

𝑗

and !M2: Cut Qy

i

For any layer i While 𝑛𝑓𝑛 𝑧𝑗, 𝑅𝑧

𝑗

< 𝑛𝑓𝑛 𝑦𝑗, 𝑅𝑦

𝑗

and !M2: Cut Qx

i

forward backward Any cut reduces the bit precision by

  • ne step: 8→4,

4→2

Goal Maximize memory utilization Activation Quantization Policy conv 2 conv 0 conv 1 conv 3 size y0 size y1 size y2 size y3

size y0 +size y1 < 𝑁𝑆𝐵𝑁 size y1 +size y3 < 𝑁𝑆𝐵𝑁 size y2 +size y3 < 𝑁𝑆𝐵𝑁

[M2]

forward Cut this?

17

slide-18
SLIDE 18

Rule-Based Mixed-Precision

M.Rusci - MLSys2020 Austin

Set 𝑅𝑦

𝑗 = 𝑅𝑧 𝑗−1 = 8

[M2] satisfied ?

no yes

For any layer i While 𝑛𝑓𝑛 𝑧𝑗, 𝑅𝑧

𝑗

> 𝑛𝑓𝑛 𝑦𝑗, 𝑅𝑦

𝑗

and !M2: Cut Qy

i

For any layer i While 𝑛𝑓𝑛 𝑧𝑗, 𝑅𝑧

𝑗

< 𝑛𝑓𝑛 𝑦𝑗, 𝑅𝑦

𝑗

and !M2: Cut Qx

i

forward backward Any cut reduces the bit precision by

  • ne step: 8→4,

4→2

Goal Maximize memory utilization Activation Quantization Policy conv 2 conv 0 conv 1 conv 3 size y0 size y1 size y2 size y3

size y0 +size y1 < 𝑁𝑆𝐵𝑁 size y1 +size y3 < 𝑁𝑆𝐵𝑁 size y2 +size y3 < 𝑁𝑆𝐵𝑁

[M2]

forward Cut this?

18

slide-19
SLIDE 19

Rule-Based Mixed-Precision

M.Rusci - MLSys2020 Austin

Set 𝑅𝑦

𝑗 = 𝑅𝑧 𝑗−1 = 8

[M2] satisfied ?

no yes

For any layer i While 𝑛𝑓𝑛 𝑧𝑗, 𝑅𝑧

𝑗

> 𝑛𝑓𝑛 𝑦𝑗, 𝑅𝑦

𝑗

and !M2: Cut Qy

i

For any layer i While 𝑛𝑓𝑛 𝑧𝑗, 𝑅𝑧

𝑗

< 𝑛𝑓𝑛 𝑦𝑗, 𝑅𝑦

𝑗

and !M2: Cut Qx

i

forward backward Any cut reduces the bit precision by

  • ne step: 8→4,

4→2

Goal Maximize memory utilization Activation Quantization Policy conv 2 conv 0 conv 1 conv 3 size y0 size y1 size y2 size y3

size y0 +size y1 < 𝑁𝑆𝐵𝑁 size y1 +size y3 < 𝑁𝑆𝐵𝑁 size y2 +size y3 < 𝑁𝑆𝐵𝑁

[M2]

backward Cut this?

19

slide-20
SLIDE 20

Experimental Results on MobilenetV1

Iso-memory MobilenetV1 models with 2MB FLASH and 512kB RAM.

M.Rusci - MLSys2020 Austin

Open source: https://github.com/mrusci/training-mixed-precision-quantized-networks

Model Mparams Full-Prec Mix-PC Mix-PL 224_1.0 4.24 70.9 64.3 59.6 192_1.0 4.24 70.0 65.9 61.9 224_0.75 2.59 68.4 68.0 67.0 192_0.75 2.59 67.2 67.2 64.8 224_0.5 1.34 63.3 63.5 63.1 192_0.5 1.34 61.7 62.0 59.5

Integer-only

Quantization-aware Fine-Tuning recipe: ❑ Init w/ pre-trained params ❑ 8H on 4 NVIDIA Tesla P100 ❑ ADAM, lr=1e-4 (5e-5 @5eph, 1e-5 at 8 eph) ❑ Frozen batch norm stats after 1 eph ❑ Asymmetric quant on weights, either PC (min/max) or PL (PACT) ❑ Asymmetric activation (PACT) 20

slide-21
SLIDE 21

Experimental Results on MobilenetV1

Iso-memory MobilenetV1 models with 2MB FLASH and 512kB RAM.

M.Rusci - MLSys2020 Austin

Open source: https://github.com/mrusci/training-mixed-precision-quantized-networks

Model Mparams Full-Prec Mix-PC Mix-PL 224_1.0 4.24 70.9 64.3 59.6 192_1.0 4.24 70.0 65.9 61.9 224_0.75 2.59 68.4 68.0 67.0 192_0.75 2.59 67.2 67.2 64.8 224_0.5 1.34 63.3 63.5 63.1 192_0.5 1.34 61.7 62.0 59.5

Integer-only

Quantization-aware Fine-Tuning recipe: ❑ Init w/ pre-trained params ❑ 8H on 4 NVIDIA Tesla P100 ❑ ADAM, lr=1e-4 (5e-5 @5eph, 1e-5 at 8 eph) ❑ Frozen batch norm stats after 1 eph ❑ Asymmetric quant on weights, either PC (min/max) or PL (PACT) ❑ Asymmetric activation (PACT)

Higher drop due to more aggressive cuts

21

slide-22
SLIDE 22

Experimental Results on MobilenetV1

Iso-memory MobilenetV1 models with 2MB FLASH and 512kB RAM.

M.Rusci - MLSys2020 Austin

Open source: https://github.com/mrusci/training-mixed-precision-quantized-networks

Model Mparams Full-Prec Mix-PC Mix-PL 224_1.0 4.24 70.9 64.3 59.6 192_1.0 4.24 70.0 65.9 61.9 224_0.75 2.59 68.4 68.0 67.0 192_0.75 2.59 67.2 67.2 64.8 224_0.5 1.34 63.3 63.5 63.1 192_0.5 1.34 61.7 62.0 59.5

Integer-only

Quantization-aware Fine-Tuning recipe: ❑ Init w/ pre-trained params ❑ 8H on 4 NVIDIA Tesla P100 ❑ ADAM, lr=1e-4 (5e-5 @5eph, 1e-5 at 8 eph) ❑ Frozen batch norm stats after 1 eph ❑ Asymmetric quant on weights, either PC (min/max) or PL (PACT) ❑ Asymmetric activation (PACT)

Lossless, ~8 bit fits the memory constraints Higher drop due to more aggressive cuts

22

slide-23
SLIDE 23

Experimental Results on MobilenetV1

Iso-memory MobilenetV1 models with 2MB FLASH and 512kB RAM.

M.Rusci - MLSys2020 Austin

Open source: https://github.com/mrusci/training-mixed-precision-quantized-networks

Model Mparams Full-Prec Mix-PC Mix-PL 224_1.0 4.24 70.9 64.3 59.6 192_1.0 4.24 70.0 65.9 61.9 224_0.75 2.59 68.4 68.0 67.0 192_0.75 2.59 67.2 67.2 64.8 224_0.5 1.34 63.3 63.5 63.1 192_0.5 1.34 61.7 62.0 59.5

Integer-only

Quantization-aware Fine-Tuning recipe: ❑ Init w/ pre-trained params ❑ 8H on 4 NVIDIA Tesla P100 ❑ ADAM, lr=1e-4 (5e-5 @5eph, 1e-5 at 8 eph) ❑ Frozen batch norm stats after 1 eph ❑ Asymmetric quant on weights, either PC (min/max) or PL (PACT) ❑ Asymmetric activation (PACT)

Lossless, ~8 bit fits the memory constraints Higher drop due to more aggressive cuts Nearly lossless, few but ‘significant’ cuts ☺

Overall, an integer-only network running

  • n MCU with -2.9% accuracy drop wrt to

the most precise model

23

slide-24
SLIDE 24

LATENCY-ACCURACY TRADE-OFF ON A STM32H7 MCU

Deployments on MCUs

M.Rusci - MLSys2020 Austin

24

Model Selection & Training

Device- aware Fine- Tuning Graph Optim

Microcontroller deployment Full- precision model f(x) Deployment Integer-only model g’(x) Code Generator C code Fake- quantized model g(x)

Memory Constraints

DNN Development Flow for microcontrollers

slide-25
SLIDE 25

40 45 50 55 60 65 70 500 1000 1500

Top-1 Accuracy

Latency (CPU MCycles)

128-MixQ-PL 160-MixQ-PL 192-MixQ-PL 224-MixQ-PL 128-MixQ-PC-ICN 160-MixQ-PC-ICN 192-MixQ-PC-ICN 224-MixQ-PC-ICN

Latency-Accuracy Trade Off

Experiments runs on a STM32H743 (400MHz clk)

M.Rusci - MLSys2020 Austin

Mobilenet_192_0.5 INT8 Top1: 59.5% Mobilenet_224_0.75 MixQ-PC-ICN Top1: 68% Mobilenet_224_0.75 MixQ-PL Top1: 67%

➢ The implementation is based on the sw lib for mixed-precision inference (based on Cmsis-NN): ❑ Cmix-NN: https://github.com/EEESlab/CMix-NN ❑ UINT2-4 software emulated ❑ MAC 2x16 bits ➢ PC on the pareto ➢ But PC slower than PL by 20-30%

Φ = ∑ 𝑌𝑟 − 𝑎𝑦 ⋅ 𝑋

𝑟 − 𝑎𝑥

= ∑𝑌𝑗𝑛2𝑑𝑝𝑚 ⋅ 𝑋

𝑟 − 𝑎𝑥

= ∑𝑌𝑗𝑛2𝑑𝑝𝑚 ⋅ 𝑋

𝑟 − ∑𝑌𝑗𝑛2𝑑𝑝𝑚 ⋅ 𝑎𝑥

PL PC Overall +8% with respect to best 8-bit integer-only MobilenetV1 fitting the device (Jacob et al. 2018)

25

slide-26
SLIDE 26

Wrap-up

  • We proposed an end-to-end methodology to train and

deploy ‘complex’ DL models on tiny MCUs.

– sub-byte uniform quantization – mixed-precision settings – a memory-driven rule-based method for determine the quantization policy – integer-only transformation with ICN activation layers – mixed precision software library for MCU

  • Deployment of a 68% Imagenet MobilenetV1 into a MCU

with 2MB FLASH and 512 kB RAM.

M.Rusci - MLSys2020 Austin

26