Massachusetts Institute of Technology Song Han
AutoML for TinyML with Once-for-All Network
Once-for-All, ICLR’20
AutoML for TinyML with Once-for-All Network Song Han Massachusetts - - PowerPoint PPT Presentation
AutoML for TinyML with Once-for-All Network Song Han Massachusetts Institute of Technology Once-for-All, ICLR20 AutoML for TinyML with Once-for-All Network
Massachusetts Institute of Technology Song Han
AutoML for TinyML with Once-for-All Network
Once-for-All, ICLR’20
AutoML for TinyML with Once-for-All Network
Once-for-All, ICLR’20
Less Computational Resources: TinyML
large model A lot of computation fewer engineers small model less computation
Once-for-All, ICLR’20
Cloud AI Mobile AI Tiny AI (AIoT)
Challenge: Efficient Inference on Diverse Hardware Platforms
3
especially on resource-constrained edge devices.
less resource less resource
Challenge: Efficient Inference on Diverse Hardware Platforms
4
Design Cost (GPU hours)
200
for training iterations: forward-backward();
The design cost is calculated under the assumption of using MobileNet-v2.
Once-for-All, ICLR’20
Challenge: Efficient Inference on Diverse Hardware Platforms
5
The design cost is calculated under the assumption of using MnasNet. [1] Tan, Mingxing, et al. "Mnasnet: Platform-aware neural architecture search for mobile." CVPR. 2019.
Design Cost (GPU hours)
40K
for training iterations: forward-backward(); if good_model: break; for search episodes: for post-search training iterations: forward-backward();
(1)
Once-for-All, ICLR’20
Challenge: Efficient Inference on Diverse Hardware Platforms
6
Diverse Hardware Platforms
The design cost is calculated under the assumption of using MnasNet. [1] Tan, Mingxing, et al. "Mnasnet: Platform-aware neural architecture search for mobile." CVPR. 2019.
160K 40K
Design Cost (GPU hours)
2019 2017 2015 2013
for training iterations: forward-backward(); if good_model: break; for search episodes: for devices: for post-search training iterations: forward-backward();
(2) (1)
Once-for-All, ICLR’20
Challenge: Efficient Inference on Diverse Hardware Platforms
7
Diverse Hardware Platforms Cloud AI ( FLOPS)
1012
Mobile AI ( FLOPS)
109
Tiny AI ( FLOPS)
106
…
160K 40K 1600K
Design Cost (GPU hours)
for training iterations: forward-backward(); if good_model: break; for many devices: for search episodes: for post-search training iterations: forward-backward();
The design cost is calculated under the assumption of using MnasNet. [1] Tan, Mingxing, et al. "Mnasnet: Platform-aware neural architecture search for mobile." CVPR. 2019.
(2) (1)
Once-for-All, ICLR’20
Challenge: Efficient Inference on Diverse Hardware Platforms
8
Diverse Hardware Platforms Cloud AI ( FLOPS)
1012
Mobile AI ( FLOPS)
109
Tiny AI ( FLOPS)
106
…
160K 40K 1600K
Design Cost (GPU hours)
11.4k lbs CO2 emission
→
45.4k lbs CO2 emission
→
454.4k lbs CO2 emission
→
1 GPU hour translates to 0.284 lbs CO2 emission according to Strubell, Emma, et al. "Energy and policy considerations for deep learning in NLP." ACL. 2019.
for training iterations: forward-backward(); if good_model: break; for many devices: for search episodes: for post-search training iterations: forward-backward();
(2) (1)
Evolved Transformer ICML’19, ACL’19
We need Green AI: Solve the Environmental Problem of NAS
Ours 52 4 orders of magnitude ACL’20 Hardware-Aware Transformer
TinyML comes at the cost of BigML
(inference) (training/search)
Problem:
Once-for-All, ICLR’20
OFA: Decouple Training and Search
10
for training iterations: forward-backward(); if good_model: break; for devices: for search episodes:
=>
(1) (2)
for post-search training iterations: forward-backward(); for OFA training iterations: forward-backward(); for devices: for search episodes: sample from OFA; if good_model: break;
training search
direct deploy without training;
decouple
Conventional NAS Once-for-All:
Once-for-All, ICLR’20
Challenge: Efficient Inference on Diverse Hardware Platforms
11
Diverse Hardware Platforms …
Once-for-All Network
Cloud AI ( FLOPS)
1012
Mobile AI ( FLOPS)
109
Tiny AI ( FLOPS)
106
160K 40K 1600K
Design Cost (GPU hours)
11.4k lbs CO2 emission
→
454.4k lbs CO2 emission
→
45.4k lbs CO2 emission
→
1 GPU hour translates to 0.284 lbs CO2 emission according to Strubell, Emma, et al. "Energy and policy considerations for deep learning in NLP." ACL. 2019.
for OFA training iterations: forward-backward(); for devices: for search episodes: sample from OFA; if good_model: break;
training search
decouple
direct deploy without training;
Once-for-All, ICLR’20
Once-for-All Network: Decouple Model Training and Architecture Design
12
Once-for-All, ICLR’20
Once-for-All Network: Decouple Model Training and Architecture Design
13
Once-for-All, ICLR’20
Once-for-All Network: Decouple Model Training and Architecture Design
14
Once-for-All, ICLR’20
Once-for-All Network: Decouple Model Training and Architecture Design
15
…
Once-for-All, ICLR’20
Challenge: how to prevent different subnetworks from interfering with each other?
16
Once-for-All, ICLR’20
Solution: Progressive Shrinking
17
different sub-networks in a single once-for-all network, covering 4 different dimensions: resolution, kernel size, depth, width.
than training a normal neural network given so many sub-networks to support.
1019
Once-for-All, ICLR’20
18
different sub-networks in a single once-for-all network, covering 4 different dimensions: resolution, kernel size, depth, width.
than training a normal neural network given so many sub-networks to support.
1019
Train the full model Shrink the model (4 dimensions) Jointly fine-tune both large and small sub-networks
joint fine-tuning process.
network
Progressive Shrinking
Solution: Progressive Shrinking
Once-for-All, ICLR’20
Connection to Network Pruning
19
Train the full model Shrink the model (only width) Fine-tune the small net single pruned network
Network Pruning
Train the full model Shrink the model (4 dimensions) Fine-tune both large and small sub-nets
network
higher flexibility across 4 dimensions. Progressive Shrinking
Once-for-All, ICLR’20
Progressive Shrinking
20
Elastic Kernel Size Elastic Depth Elastic Width
Full Full Full
Elastic Resolution
Full Partial
Once-for-All, ICLR’20
Progressive Shrinking
21
Elastic Kernel Size Elastic Depth Elastic Width
Full Full Full
Elastic Resolution
Full Partial
Once-for-All, ICLR’20
Progressive Shrinking
22
Elastic Kernel Size Elastic Depth Elastic Width
Full Full Full
Elastic Resolution
Full Partial
Once-for-All, ICLR’20
Progressive Shrinking
23
Elastic Kernel Size Elastic Depth Elastic Width
Full Full Full
Elastic Resolution
Full Partial
Once-for-All, ICLR’20
Progressive Shrinking
24
Elastic Kernel Size Elastic Depth Elastic Width
Full Full Full
Elastic Resolution
Full Partial
Once-for-All, ICLR’20
Progressive Shrinking
25
Elastic Kernel Size Elastic Depth Elastic Width
Full Full Full
Elastic Resolution
Full Partial
Once-for-All, ICLR’20
Progressive Shrinking
26
Elastic Resolution Elastic Kernel Size Elastic Depth Elastic Width
Full Full Full Full Partial Partial
Once-for-All, ICLR’20
Progressive Shrinking
27
Elastic Resolution Elastic Kernel Size Elastic Depth Elastic Width
Full Full Full Full Partial Partial
Once-for-All, ICLR’20
Progressive Shrinking
28
Elastic Resolution Elastic Depth Elastic Width
Full Full Full Full Partial Partial
Elastic Kernel Size
Once-for-All, ICLR’20
Progressive Shrinking
29
Elastic Resolution Elastic Depth Elastic Width
Full Full Full Full Partial Partial
Elastic Kernel Size
Once-for-All, ICLR’20
Progressive Shrinking
30
Elastic Resolution Elastic Depth Elastic Width
Full Full Full Full Partial Partial
Elastic Kernel Size
Once-for-All, ICLR’20
Progressive Shrinking
31
Elastic Resolution Elastic Kernel Size Elastic Depth Elastic Width
Full Full Full Full Partial Partial
Once-for-All, ICLR’20
Progressive Shrinking
32
Elastic Resolution Elastic Kernel Size Elastic Depth Elastic Width
Full Full Full Full Partial Partial Partial
Once-for-All, ICLR’20
Progressive Shrinking
33
Elastic Resolution Elastic Kernel Size Elastic Depth Elastic Width
Full Full Full Full Partial Partial Partial
Once-for-All, ICLR’20
Progressive Shrinking
34
Elastic Resolution Elastic Kernel Size Elastic Width
Full Full Full Full Partial Partial
Elastic Depth
Partial
Once-for-All, ICLR’20
Progressive Shrinking
35
Elastic Resolution Elastic Kernel Size Elastic Width
Full Full Full Full Partial Partial
Elastic Depth
Partial
Once-for-All, ICLR’20
Progressive Shrinking
36
Elastic Resolution Elastic Kernel Size Elastic Width
Full Full Full Full Partial Partial
Elastic Depth
Partial
Once-for-All, ICLR’20
Progressive Shrinking
37
Elastic Width
Full Full Full Full Partial Partial Partial
Elastic Resolution Elastic Kernel Size Elastic Depth
Once-for-All, ICLR’20
Progressive Shrinking
38
Full Full Full Full Partial Partial Partial
Elastic Resolution Elastic Kernel Size
Partial
Elastic Width Elastic Depth
Once-for-All, ICLR’20
Progressive Shrinking
39
Full Full Full Full Partial Partial Partial
Elastic Resolution Elastic Kernel Size
Partial
Elastic Width Elastic Depth
Once-for-All, ICLR’20
Progressive Shrinking
40
Full Full Full Full Partial Partial Partial
Elastic Resolution Elastic Kernel Size
Partial
Elastic Width Elastic Depth
Once-for-All, ICLR’20
Progressive Shrinking
41
Full Full Full Full Partial Partial Partial
Elastic Resolution Elastic Kernel Size
Partial
Elastic Width Elastic Depth
Once-for-All, ICLR’20
Progressive Shrinking
42
Full Full Full Full Partial Partial Partial
Elastic Resolution Elastic Kernel Size
Partial
Elastic Width Elastic Depth
Once-for-All, ICLR’20
Progressive Shrinking
43
Full Full Full Full Partial Partial Partial
Elastic Resolution Elastic Kernel Size
Partial
Elastic Width Elastic Depth
Once-for-All, ICLR’20
Progressive Shrinking
44
Randomly sample input image size for each batch
Elastic Kernel Size Elastic Depth Elastic Width
Full Full Full
Elastic Resolution
Full Partial
Once-for-All, ICLR’20
Progressive Shrinking
45
Randomly sample input image size for each batch
Elastic Kernel Size Elastic Depth Elastic Width
Full Full Full
Elastic Resolution
Full Partial
Once-for-All, ICLR’20
Progressive Shrinking
46
Randomly sample input image size for each batch
Elastic Kernel Size Elastic Depth Elastic Width
Full Full Full
Elastic Resolution
Full Partial
Once-for-All, ICLR’20
Progressive Shrinking
47
Randomly sample input image size for each batch
Elastic Kernel Size Elastic Depth Elastic Width
Full Full Full
Elastic Resolution
Full Partial
Once-for-All, ICLR’20
Progressive Shrinking
48
Randomly sample input image size for each batch
Elastic Kernel Size Elastic Depth Elastic Width
Full Full Full
Elastic Resolution
Full Partial
Once-for-All, ICLR’20
Progressive Shrinking
49
Randomly sample input image size for each batch
Elastic Kernel Size Elastic Depth Elastic Width
Full Full Full
Elastic Resolution
Full Partial
Once-for-All, ICLR’20
Progressive Shrinking
50
7x7 Transform Matrix 25x25 5x5 Transform Matrix 9x9 3x3
Start with full kernel size Smaller kernel takes centered weights via a transformation matrix
Elastic Resolution Elastic Kernel Size Elastic Depth Elastic Width
Full Full Full Full Partial Partial
Once-for-All, ICLR’20
Progressive Shrinking
51
7x7 Transform Matrix 25x25 5x5 Transform Matrix 9x9 3x3
Start with full kernel size Smaller kernel takes centered weights via a transformation matrix
Elastic Resolution Elastic Kernel Size Elastic Depth Elastic Width
Full Full Full Full Partial Partial
Once-for-All, ICLR’20
Progressive Shrinking
52
7x7 Transform Matrix 25x25 5x5 Transform Matrix 9x9 3x3
Start with full kernel size Smaller kernel takes centered weights via a transformation matrix
Elastic Resolution Elastic Depth Elastic Width
Full Full Full Full Partial Partial
Elastic Kernel Size
Once-for-All, ICLR’20
Progressive Shrinking
53
7x7 Transform Matrix 25x25 5x5 Transform Matrix 9x9 3x3
Start with full kernel size Smaller kernel takes centered weights via a transformation matrix
Elastic Resolution Elastic Depth Elastic Width
Full Full Full Full Partial Partial
Elastic Kernel Size
Once-for-All, ICLR’20
Progressive Shrinking
54
7x7 Transform Matrix 25x25 5x5 Transform Matrix 9x9 3x3
Start with full kernel size Smaller kernel takes centered weights via a transformation matrix
Elastic Resolution Elastic Depth Elastic Width
Full Full Full Full Partial Partial
Elastic Kernel Size
Once-for-All, ICLR’20
Progressive Shrinking
55
7x7 Transform Matrix 25x25 5x5 Transform Matrix 9x9 3x3
Start with full kernel size Smaller kernel takes centered weights via a transformation matrix
Elastic Resolution Elastic Kernel Size Elastic Depth Elastic Width
Full Full Full Full Partial Partial
Once-for-All, ICLR’20
Progressive Shrinking
56
Elastic Resolution Elastic Kernel Size Elastic Depth Elastic Width
Full Full Full Full Partial Partial
unit i train with full depth unit i shrink the depth O1 O2 unit i shrink the depth O1 O2 O3
Gradually allow later layers in each unit to be skipped to reduce the depth Partial
Once-for-All, ICLR’20
Progressive Shrinking
57
Elastic Resolution Elastic Kernel Size Elastic Depth Elastic Width
Full Full Full Full Partial Partial
unit i train with full depth unit i shrink the depth O1 O2 unit i shrink the depth O1 O2 O3
Gradually allow later layers in each unit to be skipped to reduce the depth Partial
Once-for-All, ICLR’20
Progressive Shrinking
58
Elastic Resolution Elastic Kernel Size Elastic Width
Full Full Full Full Partial Partial
Elastic Depth
unit i train with full depth unit i shrink the depth O1 O2 unit i shrink the depth O1 O2 O3
Gradually allow later layers in each unit to be skipped to reduce the depth Partial
Once-for-All, ICLR’20
Progressive Shrinking
59
Elastic Resolution Elastic Kernel Size Elastic Width
Full Full Full Full Partial Partial
Elastic Depth
unit i train with full depth unit i shrink the depth O1 O2 unit i shrink the depth O1 O2 O3
Gradually allow later layers in each unit to be skipped to reduce the depth Partial
Once-for-All, ICLR’20
Progressive Shrinking
60
Elastic Resolution Elastic Kernel Size Elastic Width
Full Full Full Full Partial Partial
Elastic Depth
unit i train with full depth unit i shrink the depth O1 O2 unit i shrink the depth O1 O2 O3
Gradually allow later layers in each unit to be skipped to reduce the depth Partial
Once-for-All, ICLR’20
Progressive Shrinking
61
unit i train with full depth unit i shrink the depth O1 O2 unit i shrink the depth O1 O2 O3
Gradually allow later layers in each unit to be skipped to reduce the depth
Elastic Width
Full Full Full Full Partial Partial Partial
Elastic Resolution Elastic Kernel Size Elastic Depth
Once-for-All, ICLR’20
Progressive Shrinking
62
train with full width channel importance 0.02 0.15 0.85 0.63 channel sorting progressively shrink the width channel importance 0.82 0.11 0.46 reorg. reorg. progressively shrink the width channel sorting O1 O2 O3 O1 O2 O1
Gradually shrink the width Keep the most important channels when shrinking via channel sorting Full Full Full Full Partial Partial Partial
Elastic Resolution Elastic Kernel Size
Partial
Elastic Width Elastic Depth
Once-for-All, ICLR’20
Progressive Shrinking
63
train with full width channel importance 0.02 0.15 0.85 0.63 channel sorting progressively shrink the width channel importance 0.82 0.11 0.46 reorg. reorg. progressively shrink the width channel sorting O1 O2 O3 O1 O2 O1
Gradually shrink the width Keep the most important channels when shrinking via channel sorting Full Full Full Full Partial Partial Partial
Elastic Resolution Elastic Kernel Size
Partial
Elastic Width Elastic Depth
Once-for-All, ICLR’20
Progressive Shrinking
64
train with full width channel importance 0.02 0.15 0.85 0.63 channel sorting progressively shrink the width channel importance 0.82 0.11 0.46 reorg. reorg. progressively shrink the width channel sorting O1 O2 O3 O1 O2 O1
Gradually shrink the width Keep the most important channels when shrinking via channel sorting Full Full Full Full Partial Partial Partial
Elastic Resolution Elastic Kernel Size
Partial
Elastic Width Elastic Depth
Once-for-All, ICLR’20
Progressive Shrinking
65
train with full width channel importance 0.02 0.15 0.85 0.63 channel sorting progressively shrink the width channel importance 0.82 0.11 0.46 reorg. reorg. progressively shrink the width channel sorting O1 O2 O3 O1 O2 O1
Gradually shrink the width Keep the most important channels when shrinking via channel sorting Full Full Full Full Partial Partial Partial
Elastic Resolution Elastic Kernel Size
Partial
Elastic Width Elastic Depth
Once-for-All, ICLR’20
Progressive Shrinking
66
train with full width channel importance 0.02 0.15 0.85 0.63 channel sorting progressively shrink the width channel importance 0.82 0.11 0.46 reorg. reorg. progressively shrink the width channel sorting O1 O2 O3 O1 O2 O1
Gradually shrink the width Keep the most important channels when shrinking via channel sorting Full Full Full Full Partial Partial Partial
Elastic Resolution Elastic Kernel Size
Partial
Elastic Width Elastic Depth
Once-for-All, ICLR’20
Progressive Shrinking
67
train with full width channel importance 0.02 0.15 0.85 0.63 channel sorting progressively shrink the width channel importance 0.82 0.11 0.46 reorg. reorg. progressively shrink the width channel sorting O1 O2 O3 O1 O2 O1
Gradually shrink the width Keep the most important channels when shrinking via channel sorting Full Full Full Full Partial Partial Partial
Elastic Resolution Elastic Kernel Size
Partial
Elastic Width Elastic Depth
Once-for-All, ICLR’20
Progressive Shrinking
68
Once- for-all Network K = 7 D = 4 W = 6
Train full network
Elastic Kernel Size
D = 4, W = 6 K [7, 5, 3]
Generate kernel weights (Fig. 3) Fine-tune weights & transformation matrix
Elastic Width
D [4, 3, 2], K [7, 5, 3]
Sample E at each Fine-tune weights W [6, 4, 3]
[6, 4]
(Fig. 4) Sample W at each layer; sample K, D
Elastic Resolution
R [128, 132, …, 224]
W = 6, K [7, 5, 3]
Skip top (4-D) Fine-tune weights D [4, 3, 2]
[4, 3]
unit; sample K Keep the first D layers at each unit (Fig. 3) Fine-tune weights Fine-tune weights
put it together:
Once-for-All, ICLR’20
69
Performances of Sub-networks on ImageNet
ImageNet Top-1 Acc (%) 67 70 73 75 78
w/o PS w/ PS
D=2 W=3 K=3 D=2 W=3 K=7 D=2 W=6 K=3 D=2 W=6 K=7 D=4 W=3 K=3 D=4 W=3 K=7 D=4 W=6 K=3 D=4 W=6 K=7
2.5% 2.8% 3.5% 3.4% 3.3% 3.4% 3.7% 3.5%
Sub-networks under various architecture configurations D: depth, W: width, K: kernel size
Once-for-All, ICLR’20
How about search?
70
for OFA training iterations: forward-backward(); for devices: for search episodes: sample from OFA; if good_model: break;
training search
decouple
direct deploy without training;
//with evolution
Once-for-All, ICLR’20
2.6x faster than EfficientNet 1.5x faster than MobileNetV3
71
Top-1 ImageNet Acc (%) 76 77 78 79 80 81 50 100 150 200 250 300 350 400
OFA EfficientNet
76.3 78.8 79.8 79.8 78.7 Google Pixel1 Latency (ms) 80.1 2.6x faster 3.8% higher accuracy Google Pixel1 Latency (ms) Top-1 ImageNet Acc (%) 67 69 71 73 75 77 18 24 30 36 42 48 54 60
OFA MobileNetV3
75.2 73.3 70.4 67.4 76.4 74.9 73.3 71.4 4% higher accuracy 1.5x faster
Once-for-All, ICLR’20
Once-for-All, ICLR’20
More accurate than training from scratch
72
Top-1 ImageNet Acc (%) 76 77 78 79 80 81 50 100 150 200 250 300 350 400
OFA EfficientNet OFA - Train from scratch
76.3 78.8 79.8 79.8 78.7 Google Pixel1 Latency (ms) 80.1 2.6x faster 3.8% higher accuracy Google Pixel1 Latency (ms) Top-1 ImageNet Acc (%) 67 69 71 73 75 77 18 24 30 36 42 48 54 60
OFA MobileNetV3 OFA - Train from scatch
75.2 73.3 70.4 67.4 76.4 74.9 73.3 71.4 4% higher accuracy 1.5x faster
Once-for-All, ICLR’20
OFA: 80% Top-1 Accuracy on ImageNet
73
1 2 3 4 5 6 7 8 9
MACs (Billion)
69 71 73 75 77 79 81
ImageNet Top-1 accuracy (%)
2M 4M 8M Handcrafted 16M AutoML 32M 64M
The higher the better The lower the better
Once-for-All (ours) EfficientNet ProxylessNAS MBNetV3 AmoebaNet MBNetV2 PNASNet ShuffleNet DARTS IGCV3-D MobileNetV1 (MBNetV1) NASNet-A InceptionV2 DenseNet-121 DenseNet-169 ResNet-50 ResNetXt-50 InceptionV3 DenseNet-264 DPN-92 ResNet-101 Xception ResNetXt-101
14x less computation
595M MACs 80.0% Top-1 Model Size
the mobile vision setting (< 600M MACs).
Once-for-All, ICLR’20
Once-for-All, ICLR’20
74
1 2 3 4 5 6 7 8 9
MACs (Billion)
69 71 73 75 77 79 81
ImageNet Top-1 accuracy (%)
2M 4M 8M Handcrafted 16M AutoML 32M 64M
The higher the better The lower the better
Once-for-All (ours) EfficientNet ProxylessNAS MBNetV3 AmoebaNet MBNetV2 PNASNet ShuffleNet DARTS IGCV3-D MobileNetV1 (MBNetV1) NASNet-A InceptionV2 DenseNet-121 DenseNet-169 ResNet-50 ResNetXt-50 InceptionV3 DenseNet-264 DPN-92 ResNet-101 Xception ResNetXt-101
14x less computation
595M MACs 80.0% Top-1 Model Size
Mobile Setting
OFA: 80% Top-1 Accuracy on ImageNet
Once-for-All, ICLR’20
OFA Enables Fast Specialization on Diverse Hardware Platforms
75 Samsung S7 Edge Latency (ms) Top-1 ImageNet Acc (%) 67 69 71 73 75 77 25 40 55 70 85 100
OFA MobileNetV3 MobileNetV2
75.2 73.3 70.4 67.4 70.5 73.1 74.7 76.3
Google Pixel2 Latency (ms) 67 69 71 73 75 77 23 28 33 38 43 48 53 58 63 68
75.2 73.3 70.4 67.4 75.8 74.7 73.4 71.5
LG G8 Latency (ms) 67 69 71 73 75 77 7 10 13 16 19 22 25
75.2 73.3 70.4 67.4 76.4 74.7 73.0 71.1
Top-1 ImageNet Acc (%) 58 62 66 69 73 77 10 14 18 22 26 30 NVIDIA 1080Ti Latency (ms) Batch Size = 64
60.3 65.4 69.8 72.0 72.6 73.8 75.3 76.4
58 62 66 69 73 77 9 11 13 15 17 19 Intel Xeon CPU Latency (ms) Batch Size = 1
60.3 65.4 69.8 72.0 71.1 74.6 75.7 72.0
58 62 66 69 73 77 3.0 4.0 5.0 6.0 7.0 8.0 Xilinx ZU3EG FPGA Latency (ms) Batch Size = 1 (Quantized)
59.1 63.3 69.0 71.5 67.0 69.6 72.8 73.7
Once-for-All, ICLR’20
Diverse Hardware Platforms, 50+ Pretriained Models are Released
76
Measured results on FPGA
OFA for FPGA Accelerators
Arithmetic Intensity (OPS/Byte) 0.0 12.5 25.0 37.5 50.0 ZU3EG FPGA (GOPS/s) 0.0 20.0 40.0 60.0 80.0
MobileNetV2 MnasNet OFA (Ours)
40% higher 57% higher
improvement via neural network specialization.
Once-for-All, ICLR’20
We need Green AI Solve the Environmental Problem of NAS
Evolved Transformer
How to save CO2 emission
rather than “just search it”
across many sub-networks and deployment scenarios
Once-for-All, ICLR’20 Lite Transformer, ICLR’20
OFA has broad applications
OFA’s Application: Hardware-Aware Transformer
3.7x smaller model size, same performance on WMT’14 En-De; 3x, 1.6x, 1.5x faster on Raspberry Pi, CPU, GPU than Transformer Baseline 12,000x less CO2 than evolved transformer
HAT, ACL’20
626,155 126,000 36,156 11,023 Human Life (Avg. 1 year) American Life (Avg. 1 year) US Car w/ Fuel (Avg. 1 lifetime) Evolved Transformer HAT (Ours)
52 12041×
175K 350K 525K 700K CO2 Emission (lbs)
“Encantada de conocerte” “만나서 반갑습니다” “” “Freut mich, dich kennenzulernen”
Efficient NLP on mobile devices enable real time conversation between speakers using different languages
Kinetics Top-1 Accuracy (%) 69 70 71 72 73 74 75
Computation (GFLOPs) 10 20 30 40 Same Acc.
OFA + TSM (large) OFA + TSM (small) MobileNetV2 + TSM ResNet50 + TSM ResNet50 + I3D
7x less computation Same Comp. +3.0% Acc.
TSM, ICCV’19
OFA’s Application: Efficient Video Recognition
7x less computation, same performance as TSM+ResNet50 same computation, 3% higher accuracy than TSM+MobileNet-v2
OFA’s Application: Efficient 3D Recognition
self-driving: a whole trunk of GPU Accuracy v.s. Latency Tradeoff
4x FLOPs reduction and 2x speedup over MinkowskiNet 3.6% better accuracy under the same computation budget.
AR/VR: a whole backpack
followup of PVCNN, NeurIPS’19 (spotlight)
84
GAN Compression, CVPR’20
OFA’s Application: GAN Compression
8-21x FLOPs reduction on CycleGAN, Pix2pix, GauGAN 1.7x-18.5x speedup on CPU/GPU & Mobile CPU/GPU
Summary: Once-for-All Network
net, image_size = ofa_specialized(net_id, pretrained=True)
Project Page: https://ofa.mit.edu
setting a new state-of-the-art 80% ImageNet Top1-accuracy under the mobile setting (< 600M MACs).
Train the full model Shrink the model In 4 dimensions Fine-tune both large and small sub-nets
network
Progressive Shrinking
References
86
Model Compression & NAS
Efficient Vision:
Efficient NLP:
Hardware & EDA:
Make AI Efficient: Tiny Computational Resources Tiny Human Resources
Media Coverage: Website: songhan.mit.edu
youtube.com/c/MITHANLab github.com/mit-han-lab