MCUNet: Tiny Deep Learning on IoT Devices Ji Lin 1 Song Han 1 - - PowerPoint PPT Presentation

mcunet tiny deep learning on iot devices
SMART_READER_LITE
LIVE PREVIEW

MCUNet: Tiny Deep Learning on IoT Devices Ji Lin 1 Song Han 1 - - PowerPoint PPT Presentation

MCUNet: Tiny Deep Learning on IoT Devices Ji Lin 1 Song Han 1 Wei-Ming Chen 1,2 Yujun Lin 1 John Cohn 3 Chuang Gan 3 1 MIT 2 National Taiwan University 3 MIT-IBM Watson AI Lab NeurIPS 2020 (spotlight) Background: The Era of AIoT on


slide-1
SLIDE 1

1MIT 2National Taiwan University 3MIT-IBM Watson AI Lab

MCUNet: Tiny Deep Learning

  • n IoT Devices

NeurIPS 2020 (spotlight)

Ji Lin1 Wei-Ming Chen1,2 John Cohn3 Yujun Lin1 Song Han1 Chuang Gan3

slide-2
SLIDE 2
  • Low-cost, low-power

Background: The Era of AIoT on Microcontrollers (MCUs)

slide-3
SLIDE 3

Background: The Era of AIoT on Microcontrollers (MCUs)

  • Low-cost, low-power
  • Rapid growth

#Units (Billion)

10 20 30 40 50 12 13 14 15F 16F 17F 18F 19F

slide-4
SLIDE 4

Smart Retail Personalized Healthcare Precision Agriculture Smart Home …

  • Wide applications
  • Low-cost, low-power
  • Rapid growth

#Units (Billion)

10 20 30 40 50 12 13 14 15F 16F 17F 18F 19F

Background: The Era of AIoT on Microcontrollers (MCUs)

slide-5
SLIDE 5

Challenge: Memory Too Small to Hold DNN

Memory (Activation) Storage (Weights)

slide-6
SLIDE 6

Cloud AI

Challenge: Memory Too Small to Hold DNN

Memory (Activation) Storage (Weights) 16GB ~TB/PB

slide-7
SLIDE 7

Cloud AI Mobile AI

Challenge: Memory Too Small to Hold DNN

Memory (Activation) Storage (Weights) 16GB 4GB 256GB ~TB/PB

slide-8
SLIDE 8

Cloud AI Mobile AI Tiny AI

Challenge: Memory Too Small to Hold DNN

Memory (Activation) Storage (Weights) 16GB 4GB 256GB 320kB 1MB ~TB/PB

slide-9
SLIDE 9

Cloud AI Mobile AI Tiny AI

Challenge: Memory Too Small to Hold DNN

Memory (Activation) Storage (Weights) 16GB ~TB/PB 4GB 256GB 320kB 1MB

13,000x smaller 50,000x smaller

slide-10
SLIDE 10

Cloud AI Mobile AI Tiny AI

Challenge: Memory Too Small to Hold DNN

Memory (Activation) Storage (Weights) 16GB ~TB/PB 4GB 256GB 320kB 1MB

13,000x smaller 50,000x smaller

We need to reduce the peak activation size AND the model size to fit a DNN into MCUs.

slide-11
SLIDE 11

Existing efficient network only reduces model size but NOT activation size!

10 20 30 40 50 Param (MB) Peak Activation (MB) ResNet-18 MobileNetV2-0.75 MCUNet ~70% ImageNet Top-1

4.6x 1.8x

slide-12
SLIDE 12

Challenge: Memory Too Small to Hold DNN

ResNet-50 MobileNetV2 MobileNetV2 (int8) 2000 4000 6000 8000

320kB constraint

Peak Memory (kB)

22x 23x 5x

slide-13
SLIDE 13

Challenge: Memory Too Small to Hold DNN

ResNet-50 MobileNetV2 MobileNetV2 (int8) 2000 4000 6000 8000

320kB constraint

Peak Memory (kB)

22x 23x 5x

slide-14
SLIDE 14

Challenge: Memory Too Small to Hold DNN MCUNet

ResNet-50 MobileNetV2 MobileNetV2 (int8) 2000 4000 6000 8000

320kB constraint

Peak Memory (kB)

22x 23x 5x

slide-15
SLIDE 15

MCUNet: System-Algorithm Co-design

slide-16
SLIDE 16

MCUNet: System-Algorithm Co-design

(a) Search NN model on an existing library e.g., ProxylessNAS, MnasNet

Library NAS

slide-17
SLIDE 17

MCUNet: System-Algorithm Co-design

(a) Search NN model on an existing library e.g., ProxylessNAS, MnasNet (b) Tune deep learning library given a NN model e.g., TVM

Library NAS Library NN Model

slide-18
SLIDE 18

MCUNet: System-Algorithm Co-design

TinyEngine TinyNAS Efficient Compiler / Runtime Efficient Neural Architecture Library NN Model

(a) Search NN model on an existing library e.g., ProxylessNAS, MnasNet (b) Tune deep learning library given a NN model e.g., TVM (c) MCUNet: system-algorithm co-design

MCUNet Library NAS

slide-19
SLIDE 19

MCUNet: System-Algorithm Co-design

TinyEngine TinyNAS Efficient Compiler / Runtime Efficient Neural Architecture Library NN Model

(a) Search NN model on an existing library e.g., ProxylessNAS, MnasNet (b) Tune deep learning library given a NN model e.g., TVM (c) MCUNet: system-algorithm co-design

MCUNet Library NAS TinyNAS

slide-20
SLIDE 20

TinyNAS: Two-Stage NAS for Tiny Memory Constraints

Search space design is crucial for NAS performance There is no prior expertise on MCU model design

Full Network Space

slide-21
SLIDE 21

TinyNAS: Two-Stage NAS for Tiny Memory Constraints

Search space design is crucial for NAS performance There is no prior expertise on MCU model design

Optimized Search Space Full Network Space

Memory/Storage Constraints

slide-22
SLIDE 22

TinyNAS: Two-Stage NAS for Tiny Memory Constraints

Search space design is crucial for NAS performance There is no prior expertise on MCU model design

Optimized Search Space Full Network Space Model Specialization

Memory/Storage Constraints

slide-23
SLIDE 23

Revisit ProxylessNAS search space: S = kernel size × expansion ratio × depth

TinyNAS: (1) Automated search space optimization

slide-24
SLIDE 24

Revisit ProxylessNAS search space: S = kernel size × expansion ratio × depth

k=7 k=5 k=3

TinyNAS: (1) Automated search space optimization

slide-25
SLIDE 25

Revisit ProxylessNAS search space: S = kernel size × expansion ratio × depth

k=7 k=5 k=3 e=6 e=4 e=2 pw1 dw pw2

TinyNAS: (1) Automated search space optimization

slide-26
SLIDE 26

Revisit ProxylessNAS search space: S = kernel size × expansion ratio × depth

k=7 k=5 k=3 e=6 e=4 e=2 pw1 dw pw2 d=4 d=3 d=2

TinyNAS: (1) Automated search space optimization

slide-27
SLIDE 27

Revisit ProxylessNAS search space: S = kernel size × expansion ratio × depth

Out of memory!

TinyNAS: (1) Automated search space optimization

slide-28
SLIDE 28

Extended search space to cover wide range of hardware capacity: S’ = kernel size × expansion ratio × depth × input resolution R × width multiplier W

TinyNAS: (1) Automated search space optimization

slide-29
SLIDE 29

Different R and W for different hardware capacity R=224, W=1.0

TinyNAS: (1) Automated search space optimization

(i.e., different optimized sub-space) Extended search space to cover wide range of hardware capacity: S’ = kernel size × expansion ratio × depth × input resolution R × width multiplier W

slide-30
SLIDE 30

Different R and W for different hardware capacity R=260, W=1.4 * R=224, W=1.0

TinyNAS: (1) Automated search space optimization

(i.e., different optimized sub-space) Extended search space to cover wide range of hardware capacity: S’ = kernel size × expansion ratio × depth × input resolution R × width multiplier W

* Cai et al., Once-for-All: Train One Network and Specialize it for Efficient Deployment, ICLR’20

slide-31
SLIDE 31

Different R and W for different hardware capacity R=260, W=1.4 R=224, W=1.0 R=?, W=?

F412/F743/H746/…

TinyNAS: (1) Automated search space optimization

(i.e., different optimized sub-space) Extended search space to cover wide range of hardware capacity: S’ = kernel size × expansion ratio × depth × input resolution R × width multiplier W

256kB/320kB/512kB/…

slide-32
SLIDE 32

Analyzing FLOPs distribution of satisfying models in each search space: Larger FLOPs -> Larger model capacity -> More likely to give higher accuracy

TinyNAS: (1) Automated search space optimization

slide-33
SLIDE 33

0% 25% 50% 75% 100% 25 30 35 40 45 50 55 60 65

w0.3-r160 | 32.5 w0.4-r144 | 46.9

FLOPs (M) Cumulative Probability

width-res. | mFLOPs

TinyNAS: (1) Automated search space optimization

Analyzing FLOPs distribution of satisfying models in each search space: Larger FLOPs -> Larger model capacity -> More likely to give higher accuracy 320kB?

slide-34
SLIDE 34

0% 25% 50% 75% 100% 25 30 35 40 45 50 55 60 65

w0.3-r160 | 32.5 w0.4-r144 | 46.9

FLOPs (M) Cumulative Probability

width-res. | mFLOPs

TinyNAS: (1) Automated search space optimization

Analyzing FLOPs distribution of satisfying models in each search space: Larger FLOPs -> Larger model capacity -> More likely to give higher accuracy

slide-35
SLIDE 35

0% 25% 50% 75% 100% 25 30 35 40 45 50 55 60 65

w0.3-r160 | 32.5 w0.4-r144 | 46.9 p0.8

FLOPs (M) Cumulative Probability

p=80% (32.3M, 80%) width-res. | mFLOPs

Bad design space

(45.4M, 80%)

TinyNAS: (1) Automated search space optimization

Analyzing FLOPs distribution of satisfying models in each search space: Larger FLOPs -> Larger model capacity -> More likely to give higher accuracy

slide-36
SLIDE 36

0% 25% 50% 75% 100% 25 30 35 40 45 50 55 60 65

w0.3-r160 | 32.5 w0.4-r144 | 46.9 p0.8

FLOPs (M) Cumulative Probability

p=80% (32.3M, 80%) best acc: 76.4% width-res. | mFLOPs

Bad design space

b e s t a c c : 7 4 . 2 % (45.4M, 80%)

TinyNAS: (1) Automated search space optimization

Analyzing FLOPs distribution of satisfying models in each search space: Larger FLOPs -> Larger model capacity -> More likely to give higher accuracy

slide-37
SLIDE 37

0% 25% 50% 75% 100% 25 30 35 40 45 50 55 60 65

w0.3-r160 | 32.5 w0.4-r112 | 32.4 w0.4-r128 | 39.3 w0.4-r144 | 46.9 w0.5-r112 | 38.3 w0.5-r128 | 46.9 w0.5-r144 | 52.0 w0.6-r112 | 41.3 w0.7-r96 | 31.4 w0.7-r112 | 38.4 p0.8

FLOPs (M) Cumulative Probability

p=80% (50.3M, 80%) (32.3M, 80%) best acc: 76.4% width-res. | mFLOPs

Good design space: likely to achieve high FLOPs under memory constraint Bad design space

best acc: 78.7% b e s t a c c : 7 4 . 2 %

TinyNAS: (1) Automated search space optimization

Analyzing FLOPs distribution of satisfying models in each search space: Larger FLOPs -> Larger model capacity -> More likely to give higher accuracy

slide-38
SLIDE 38

TinyNAS: (2) Resource-constrained model specialization

Random sample (kernel size, expansion, depth) Jointly fine-tune multiple sub- networks Super Network

  • Small sub-networks are nested in large sub-networks.
  • One-shot NAS through weight sharing

* Cai et al., Once-for-All: Train One Network and Specialize it for Efficient Deployment, ICLR’20

slide-39
SLIDE 39

TinyNAS: (2) Resource-constrained model specialization

Random sample (kernel size, expansion, depth) Jointly fine-tune multiple sub- networks Super Network …

Directly evaluate the accuracy of sub-nets

  • One-shot NAS through weight sharing
slide-40
SLIDE 40

TinyNAS: (2) Resource-constrained model specialization

40

Elastic Kernel Size Elastic Depth Elastic Width

slide-41
SLIDE 41

TinyNAS: (2) Resource-constrained model specialization

41

Elastic Kernel Size Elastic Depth Elastic Width

7x7 Index central 5x5 Index central 3x3

Start with full kernel size Smaller kernel takes centered weights

slide-42
SLIDE 42

TinyNAS: (2) Resource-constrained model specialization

42

Elastic Kernel Size Elastic Depth Elastic Width

unit i train with full depth unit i shrink the depth O1 O2 O3

Allow later layers in each unit to be skipped to reduce the depth

slide-43
SLIDE 43

TinyNAS: (2) Resource-constrained model specialization

Elastic Kernel Size Elastic Depth Elastic Width

train with full width channel importance 0.02 0.15 0.85 0.63 channel sorting reorg. shrink the width O1 O2 O3 O1

Shrink the width Keep the most important channels when shrinking via channel sorting

slide-44
SLIDE 44

TinyNAS Better Utilizes the Memory

Peak Mem (kB) 75 150 225 300

MobileNetV2 TinyNAS

Peak Memory for First Two Stages

slide-45
SLIDE 45

TinyNAS Better Utilizes the Memory

Peak Mem (kB) 75 150 225 300

average average

MobileNetV2 TinyNAS

  • TinyNAS designs networks with more uniform peak memory for each block,

allowing us to fit a larger model at the same amount of memory

Peak Memory for First Two Stages

2.2x 1.6x max max

slide-46
SLIDE 46

TinyEngine: A Memory-Efficient Inference Library

  • 1. Reducing overhead with separated compilation & runtime

(a) Existing libraries based on runtime interpretation e.g., TF-Lite Micro, CMSIS-NN

NN Model

All Supported ops

Inference

Runtime

Meta info. & Memory allocation

Interprete

slide-47
SLIDE 47

TinyEngine: A Memory-Efficient Inference Library

(a) Existing libraries based on runtime interpretation e.g., TF-Lite Micro, CMSIS-NN

NN Model

All Supported ops

Inference

Runtime

Meta info. & Memory allocation

Interprete

Computation

  • verhead
  • 1. Reducing overhead with separated compilation & runtime
slide-48
SLIDE 48

TinyEngine: A Memory-Efficient Inference Library

(a) Existing libraries based on runtime interpretation e.g., TF-Lite Micro, CMSIS-NN

NN Model

All Supported ops

Inference

Runtime

Meta info. & Memory allocation

Interprete

Computation

  • verhead

Storage

  • verhead
  • 1. Reducing overhead with separated compilation & runtime
slide-49
SLIDE 49

TinyEngine: A Memory-Efficient Inference Library

(a) Existing libraries based on runtime interpretation e.g., TF-Lite Micro, CMSIS-NN

NN Model

All Supported ops

Inference

Runtime

Meta info. & Memory allocation

Interprete

Computation

  • verhead

Memory

  • verhead

Storage

  • verhead
  • 1. Reducing overhead with separated compilation & runtime
slide-50
SLIDE 50

TinyEngine: A Memory-Efficient Inference Library

(b) TinyEngine: Model-adaptive code generation.

NN Model

Specialized ops

Inference

Compile time (offline) Runtime

Memory schedule

(a) Existing libraries based on runtime interpretation e.g., TF-Lite Micro, CMSIS-NN

NN Model

All Supported ops

Inference

Runtime

Meta info. & Memory allocation

Interprete

Computation

  • verhead

Memory

  • verhead

Storage

  • verhead
  • 1. Reducing overhead with separated compilation & runtime
slide-51
SLIDE 51

TinyEngine: A Memory-Efficient Inference Library

(b) TinyEngine: Model-adaptive code generation.

NN Model

Specialized ops

Inference

Compile time (offline) Runtime

Memory schedule

Code generation

Peak Mem (KB) ↓

76 48 Baseline: ARM CMSIS-NN 160

1.7x smaller

  • 1. Reducing overhead with separated compilation & runtime
slide-52
SLIDE 52

TinyEngine: A Memory-Efficient Inference Library

  • 2. In-place depth-wise convolution

The dilemma of inverted bottleneck

1x

peak mem

6x 1x

slide-53
SLIDE 53

TinyEngine: A Memory-Efficient Inference Library

  • 2. In-place depth-wise convolution

The dilemma of inverted bottleneck

1x

peak mem

6x 1x

Peak Memory: 2N

channels

(a) Depth-wise convolution

Input activation Output activation

2 1 n n-1 (Peak Mem: 2n)

N-1 N

slide-54
SLIDE 54

TinyEngine: A Memory-Efficient Inference Library

  • 2. In-place depth-wise convolution

The dilemma of inverted bottleneck

1x

peak mem

6x 1x

Peak Memory: 2N

channels

(a) Depth-wise convolution

Input activation Output activation

2 1 n n-1 (Peak Mem: 2n)

N-1 N

(b) In-place depth-wise convolution

Input/output activation Temp buffer

1 2 n write back (Peak Mem: n+1)

N

Peak Memory: N+1

slide-55
SLIDE 55

TinyEngine: A Memory-Efficient Inference Library

  • 2. In-place depth-wise convolution

The dilemma of inverted bottleneck

1x

peak mem

6x 1x

Code generation

Peak Mem (KB) ↓

76 48 Baseline: ARM CMSIS-NN 160

3.4x smaller

In-place depth-wise

Peak Memory: 2N

channels

(a) Depth-wise convolution

Input activation Output activation

2 1 n n-1 (Peak Mem: 2n)

N-1 N

(b) In-place depth-wise convolution

Input/output activation Temp buffer

1 2 n write back (Peak Mem: n+1)

N

Peak Memory: N+1

slide-56
SLIDE 56

TinyEngine: Faster Inference Speed

Baseline: ARM CMSIS-NN Million MAC/s ↑ 52

Analyzing Million MAC/s improved by each technique

slide-57
SLIDE 57

Baseline: ARM CMSIS-NN Code generation: Eliminate runtime interpretation overhead Million MAC/s ↑ 64 52

(1) Code generator-based compilation -> Eliminate overheads of runtime interpretation Analyzing Million MAC/s improved by each technique

OOM

NN Model

Specialized ops

Inference

TinyEngine: Model-adaptive code generation.

Compile time Runtime

Memory schedule

TinyEngine: Faster Inference Speed

slide-58
SLIDE 58

OOM

Baseline: ARM CMSIS-NN Code generation Specialized Im2col: Increase the data reuse Million MAC/s ↑ 64 70 52

(a) Model-level memory scheduling (b) Tile size configuration for Im2col (2) Model-adaptive memory scheduling -> Increase data reuse for each layer Analyzing Million MAC/s improved by each technique

TinyEngine: Faster Inference Speed

slide-59
SLIDE 59

OOM

Baseline: ARM CMSIS-NN Code generation Specialized Im2col Op fusion Million MAC/s ↑ 64 70 52 75

(3) Computation Kernel Specialization: Operation fusion

  • Minimize memory footprint
  • Optimize the overall computation

Pad Conv ReLU BN Specialized kernel

e.g., fuse Pad+Conv+ReLU+BN Analyzing Million MAC/s improved by each technique

TinyEngine: Faster Inference Speed

slide-60
SLIDE 60

OOM

Baseline: ARM CMSIS-NN Code generation Specialized Im2col Op fusion Loop unrolling Million MAC/s ↑ 64 70 52 75 79 82

(3) Computation Kernel Specialization: Loop unrolling e.g., fully unroll for 3x3 conv

...

  • Eliminate the branch instruction overheads of loops

Analyzing Million MAC/s improved by each technique

TinyEngine: Faster Inference Speed

slide-61
SLIDE 61

Baseline: ARM CMSIS-NN Code generation Specialized Im2col Op fusion Loop unrolling Tiling Million MAC/s ↑ 64 70 52 75 79 82 Specialized loop unrolling/tiling according to network architecture

Analyzing Million MAC/s improved by each technique (3) Computation Kernel Specialization: Loop tiling for each layer

Im2col buffer x Weights ...

Tile size

Output

Tile size

  • Tiling loops based on the kernel size and available memory

TinyEngine: Faster Inference Speed

slide-62
SLIDE 62

TinyEngine: A Memory-Efficient Inference Library

Code generation

Peak Mem (KB) ↓

76 48 Baseline: ARM CMSIS-NN 160

3.4x smaller

In-place depth-wise

Million MAC/s ↑

1.6x faster

Baseline: ARM CMSIS-NN Code generation Specialized Im2col Op fusion Loop unrolling Tiling Million MAC/s ↑ 64 70 52 75 79 82

adsfas

slide-63
SLIDE 63

TinyEngine: A Memory-Efficient Inference Library

Code generation

Peak Mem (KB) ↓

76 48 Baseline: ARM CMSIS-NN 160

3.4x smaller

In-place depth-wise

Million MAC/s ↑

Baseline: ARM CMSIS-NN Code generation Specialized Im2col Op fusion Loop unrolling Tiling 64 70 52 75 79 82

1.6x faster

0% 25% 50% 75% 100%

1.00 1.00 1.00 1.00 0.61 0.66 0.64 0.94 0.82 0.32 0.33 0.32 0.33

TF-Lite Micro MicroTVM Tuned CMSIS-NN TinyEngine

SmallCifar MobileNetV2 ProxylessNAS MnasNet

Normalized Speed↑

1.6x faster 3x 
 faster 3x 
 faster 3x 
 faster 3x 
 faster 1.5x faster 1.6x faster OOM OOM OOM

Peak Mem (KB)↓

46 92 138 184 230

84 41 65 46 228 197 217 67 144 216 161 211 64

SmallCifar MobileNetV2 ProxylessNAS MnasNet

3.1x 
 smaller 4.8x 
 smaller OOM 2.7x 
 smaller 3.3x 
 smaller OOM OOM

  • Consistent improvement on different networks
slide-64
SLIDE 64

TinyEngine: A Memory-Efficient Inference Library

Code generation

Peak Mem (KB) ↓

76 48 Baseline: ARM CMSIS-NN 160

3.4x smaller

In-place depth-wise

Million MAC/s ↑

Baseline: ARM CMSIS-NN Code generation Specialized Im2col Op fusion Loop unrolling Tiling 64 70 52 75 79 82

1.6x faster

0% 25% 50% 75% 100%

1.00 1.00 1.00 1.00 0.61 0.66 0.64 0.94 0.82 0.32 0.33 0.32 0.33

TF-Lite Micro MicroTVM Tuned CMSIS-NN TinyEngine

SmallCifar MobileNetV2 ProxylessNAS MnasNet

Normalized Speed↑

1.6x faster 3x 
 faster 3x 
 faster 3x 
 faster 3x 
 faster 1.5x faster 1.6x faster OOM OOM OOM

Peak Mem (KB)↓

46 92 138 184 230

84 41 65 46 228 197 217 67 144 216 161 211 64

SmallCifar MobileNetV2 ProxylessNAS MnasNet

3.1x 
 smaller 4.8x 
 smaller OOM 2.7x 
 smaller 3.3x 
 smaller OOM OOM

  • Consistent improvement on different networks
slide-65
SLIDE 65

Experimental Results

We focus on large-scale datasets to reflect real-life use cases. Datasets: (1) ImageNet-1000 (2) Wake Words

  • Visual: Visual Wake Words
  • Audio: Google Speech Commands
slide-66
SLIDE 66

System-Algorithm Co-design Gives the Best Results

  • ImageNet classification on STM32F746 MCU (320kB SRAM, 1MB Flash)

40

Baseline (MbV2*+CMSIS) ImageNet Top1: 35% 45% 55% 65%

* scaled down version: width multiplier 0.3, input resolution 80

slide-67
SLIDE 67

56 44 40

Baseline (MbV2*+CMSIS) System-only (MbV2*+TinyEngine) Model-only (TinyNAS+CMSIS) ImageNet Top1: 35% 45% 55% 65%

System-Algorithm Co-design Gives the Best Results

  • ImageNet classification on STM32F746 MCU (320kB SRAM, 1MB Flash)

* scaled down version: width multiplier 0.3, input resolution 80

slide-68
SLIDE 68

62 56 44 40

Baseline (MbV2*+CMSIS) System-only (MbV2*+TinyEngine) Model-only (TinyNAS+CMSIS) Co-design (TinyNAS+TinyEngine) ImageNet Top1: 35% 45% 55% 65%

System-Algorithm Co-design Gives the Best Results

  • ImageNet classification on STM32F746 MCU (320kB SRAM, 1MB Flash)

* scaled down version: width multiplier 0.3, input resolution 80

slide-69
SLIDE 69

Handling Diverse Hardware

50 55 60 65 70 75

70.7 65.9 63.5 62.0 ImageNet Top-1 Accuracy (%)

STM32F412 (256kB/1MB) STM32F746 (320kB/1MB) STM32F765 (512kB/1MB) STM32H743 (512kB/2MB)

  • Specializing models (int4) for different MCUs (SRAM/Flash)
slide-70
SLIDE 70

Handling Diverse Hardware

50 55 60 65 70 75

70.7 65.9 63.5 62.0 ImageNet Top-1 Accuracy (%)

STM32F412 (256kB/1MB) STM32F746 (320kB/1MB) STM32F765 (512kB/1MB) STM32H743 (512kB/2MB)

  • Specializing models (int4) for different MCUs (SRAM/Flash)

The first to achieve >70% ImageNet accuracy on commercial MCUs

slide-71
SLIDE 71

Handling Diverse Hardware

50 55 60 65 70 75 53.8

70.7 65.9 63.5 62.0 ImageNet Top-1 Accuracy (%)

STM32F412 (256kB/1MB) STM32F746 (320kB/1MB) STM32F765 (512kB/1MB) STM32H743 (512kB/2MB)

  • Specializing models (int4) for different MCUs (SRAM/Flash)

The first to achieve >70% ImageNet accuracy on commercial MCUs

MobileNetV2+CMSIS-NN

+17%

slide-72
SLIDE 72

10 20 30 40 50 Param (MB) Peak Activation (MB) ResNet-18 MobileNetV2-0.75 MCUNet ~70% ImageNet Top-1

4.6x 1.8x

Reduce Both Model Size and Activation Size

slide-73
SLIDE 73

Reduce Both Model Size and Activation Size

10 20 30 40 50 Param (MB) Peak Activation (MB) ResNet-18 MobileNetV2-0.75 MCUNet

24.6x 13.8x

~70% ImageNet Top-1

slide-74
SLIDE 74

Visual Wake Words (VWW)

VWW Accuracy

84 86 88 90 92 440 880 1320 1760 2200

MCUNet MobileNetV2 ProxylessNAS Han et al.

84 86 88 90 92 50 162.5 275 387.5 500

OOM Latency (ms) Peak SRAM (kB)

(a) Trade-off: accuracy vs. measured latency (b) Trade-off: accuracy vs. peak memory

256kB constraint on MCU

slide-75
SLIDE 75

Visual Wake Words (VWW)

VWW Accuracy

84 86 88 90 92 440 880 1320 1760 2200

MCUNet MobileNetV2 ProxylessNAS Han et al.

84 86 88 90 92 50 162.5 275 387.5 500

10FPS 5FPS

OOM Latency (ms) Peak SRAM (kB)

(a) Trade-off: accuracy vs. measured latency (b) Trade-off: accuracy vs. peak memory

256kB constraint on MCU

slide-76
SLIDE 76

Visual Wake Words (VWW)

VWW Accuracy

84 86 88 90 92 440 880 1320 1760 2200

MCUNet MobileNetV2 ProxylessNAS Han et al.

84 86 88 90 92 50 162.5 275 387.5 500

10FPS 5FPS

OOM Latency (ms) Peak SRAM (kB) 3.4× faster 3.7× smaller 2.4× faster

(a) Trade-off: accuracy vs. measured latency (b) Trade-off: accuracy vs. peak memory

256kB constraint on MCU

slide-77
SLIDE 77

Audio Wake Words (Speech Commands)

GSC Accuracy

88 90 92 94 96 340 680 1020 1360 1700

MCUNet MobileNetV2 ProxylessNAS

88 90 92 94 96 30 147.5 265 382.5 500

10FPS 5FPS

2.8× faster 4.1× smaller 2% higher 256kB constraint Latency (ms) Peak SRAM (kB)

(a) Trade-off: accuracy vs. measured latency (b) Trade-off: accuracy vs. peak memory

OOM

slide-78
SLIDE 78

Visual Wake Word Detection

  • Detecting whether a person is present in the frame
slide-79
SLIDE 79

MCUNet: Tiny Deep Learning on IoT Devices

Project Page: http://tinyml.mit.edu

  • Our study suggests that the era of tiny machine learning on IoT devices has arrived

Cloud AI Mobile AI Tiny AI ResNet MobileNet MCUNet