Massachusetts Institute of Technology Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, Song Han
Once for All: Train One Network and Specialize it for Efficient Deployment
Once-for-All, ICLR’20
Once for All: Train One Network and Specialize it for Efficient - - PowerPoint PPT Presentation
Once for All: Train One Network and Specialize it for Efficient Deployment Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, Song Han Massachusetts Institute of Technology Once-for-All, ICLR20 Challenge: Efficient Inference on Diverse
Massachusetts Institute of Technology Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, Song Han
Once for All: Train One Network and Specialize it for Efficient Deployment
Once-for-All, ICLR’20
FLOPS
1012
Cloud AI Mobile AI Tiny AI (AIoT)
FLOPS
109
FLOPS
106
Challenge: Efficient Inference on Diverse Hardware Platforms
especially on resource-constrained edge devices.
less resource less resource
3
Diverse Hardware Platforms
The design cost is calculated under the assumption of using MnasNet. [1] Tan, Mingxing, et al. "Mnasnet: Platform-aware neural architecture search for mobile." CVPR. 2019.
Design Cost (GPU hours)
40K
Challenge: Efficient Inference on Diverse Hardware Platforms
4
Diverse Hardware Platforms
The design cost is calculated under the assumption of using MnasNet. [1] Tan, Mingxing, et al. "Mnasnet: Platform-aware neural architecture search for mobile." CVPR. 2019.
160K 40K
Design Cost (GPU hours)
Challenge: Efficient Inference on Diverse Hardware Platforms
2019 2017 2015 2013
5
Diverse Hardware Platforms Cloud AI ( FLOPS)
1012
Mobile AI ( FLOPS)
109
Tiny AI ( FLOPS)
106
…
The design cost is calculated under the assumption of using MnasNet. [1] Tan, Mingxing, et al. "Mnasnet: Platform-aware neural architecture search for mobile." CVPR. 2019.
160K 40K 1600K
Design Cost (GPU hours)
Challenge: Efficient Inference on Diverse Hardware Platforms
6
Diverse Hardware Platforms Cloud AI ( FLOPS)
1012
Mobile AI ( FLOPS)
109
Tiny AI ( FLOPS)
106
…
160K 40K 1600K
Design Cost (GPU hours)
Challenge: Efficient Inference on Diverse Hardware Platforms
11.4k lbs CO2 emission
→
45.4k lbs CO2 emission
→
454.4k lbs CO2 emission
→
1 GPU hour translates to 0.284 lbs CO2 emission according to Strubell, Emma, et al. "Energy and policy considerations for deep learning in NLP." ACL. 2019.
7
Diverse Hardware Platforms
?
… Cloud AI ( FLOPS)
1012
Mobile AI ( FLOPS)
109
Tiny AI ( FLOPS)
106
Challenge: Efficient Inference on Diverse Hardware Platforms
160K 40K 1600K
Design Cost (GPU hours)
11.4k lbs CO2 emission
→
454.4k lbs CO2 emission
→
1 GPU hour translates to 0.284 lbs CO2 emission according to Strubell, Emma, et al. "Energy and policy considerations for deep learning in NLP." ACL. 2019.
45.4k lbs CO2 emission
→
8
Diverse Hardware Platforms …
Once-for-All Network
Cloud AI ( FLOPS)
1012
Mobile AI ( FLOPS)
109
Tiny AI ( FLOPS)
106
Challenge: Efficient Inference on Diverse Hardware Platforms
160K 40K 1600K
Design Cost (GPU hours)
11.4k lbs CO2 emission
→
454.4k lbs CO2 emission
→
1 GPU hour translates to 0.284 lbs CO2 emission according to Strubell, Emma, et al. "Energy and policy considerations for deep learning in NLP." ACL. 2019.
45.4k lbs CO2 emission
→
Once-for-All Network: Decouple Model Training and Architecture Design
9
Once-for-All Network: Decouple Model Training and Architecture Design
10
Once-for-All Network: Decouple Model Training and Architecture Design
11
Once-for-All Network: Decouple Model Training and Architecture Design
12
…
Progressive Shrinking for Training OFA Networks
13
different sub-networks in a single once-for-all network, covering 4 different dimensions: resolution, kernel size, depth, width.
than training a normal neural network given so many sub-networks to support.
1019
Progressive Shrinking
14
different sub-networks in a single once-for-all network, covering 4 different dimensions: resolution, kernel size, depth, width.
than training a normal neural network given so many sub-networks to support.
1019
Train the full model Shrink the model (4 dimensions) Jointly fine-tune both large and small sub-networks
joint fine-tuning process.
network
Progressive Shrinking
Connection to Network Pruning
15
Train the full model Shrink the model (only width) Fine-tune the small net single pruned network
Network Pruning
Train the full model Shrink the model (4 dimensions) Fine-tune both large and small sub-nets
network
higher flexibility across 4 dimensions. Progressive Shrinking
16
Randomly sample input image size for each batch
Elastic Kernel Size Elastic Depth Elastic Width
Full Full Full
Elastic Resolution
Full Partial
Progressive Shrinking
17
7x7 Transform Matrix 25x25 5x5 Transform Matrix 9x9 3x3
Start with full kernel size Smaller kernel takes centered weights via a transformation matrix
Elastic Resolution Elastic Kernel Size Elastic Depth Elastic Width
Full Full Full Full Partial Partial
Progressive Shrinking
18
unit i train with full depth unit i shrink the depth O1 O2 unit i shrink the depth O1 O2 O3
Gradually allow later layers in each unit to be skipped to reduce the depth
Elastic Width
Full Full Full Full Partial Partial Partial
Elastic Resolution Elastic Kernel Size Elastic Depth
Progressive Shrinking
19
train with full width channel importance 0.02 0.15 0.85 0.63 channel sorting progressively shrink the width channel importance 0.82 0.11 0.46 reorg. reorg. progressively shrink the width channel sorting O1 O2 O3 O1 O2 O1
Gradually shrink the width Keep the most important channels when shrinking via channel sorting Full Full Full Full Partial Partial Partial
Elastic Resolution Elastic Kernel Size
Partial
Elastic Width Elastic Depth
Progressive Shrinking
20
Performances of Sub-networks on ImageNet
ImageNet Top-1 Acc (%) 67 70 73 75 78
w/o PS w/ PS
D=2 W=3 K=3 D=2 W=3 K=7 D=2 W=6 K=3 D=2 W=6 K=7 D=4 W=3 K=3 D=4 W=3 K=7 D=4 W=6 K=3 D=4 W=6 K=7
2.5% 2.8% 3.5% 3.4% 3.3% 3.4% 3.7% 3.5%
Sub-networks under various architecture configurations D: depth, W: width, K: kernel size
OFA: 80% Top-1 Accuracy on ImageNet
21
1 2 3 4 5 6 7 8 9
MACs (Billion)
69 71 73 75 77 79 81
ImageNet Top-1 accuracy (%)
2M 4M 8M Handcrafted 16M AutoML 32M 64M
The higher the better The lower the better
Once-for-All (ours) EfficientNet ProxylessNAS MBNetV3 AmoebaNet MBNetV2 PNASNet ShuffleNet DARTS IGCV3-D MobileNetV1 (MBNetV1) NASNet-A InceptionV2 DenseNet-121 DenseNet-169 ResNet-50 ResNetXt-50 InceptionV3 DenseNet-264 DPN-92 ResNet-101 Xception ResNetXt-101
14x less computation
595M MACs 80.0% Top-1 Model Size
the mobile setting (< 600M MACs).
Comparison with EfficientNet and MobileNetV3
22
Top-1 ImageNet Acc (%) 76 77 78 79 80 81 50 100 150 200 250 300 350 400
OFA EfficientNet
76.3 78.8 79.8 79.8 78.7 Google Pixel1 Latency (ms) 80.1 2.6x faster 3.8% higher accuracy Google Pixel1 Latency (ms) Top-1 ImageNet Acc (%) 67 69 71 73 75 77 18 24 30 36 42 48 54 60
OFA MobileNetV3
75.2 73.3 70.4 67.4 76.4 74.9 73.3 71.4 4% higher accuracy 1.5x faster
OFA for Fast Specialization on Diverse Hardware Platforms
23 Samsung S7 Edge Latency (ms) Top-1 ImageNet Acc (%) 67 69 71 73 75 77 25 40 55 70 85 100
OFA MobileNetV3 MobileNetV2
75.2 73.3 70.4 67.4 70.5 73.1 74.7 76.3
Google Pixel2 Latency (ms) 67 69 71 73 75 77 23 28 33 38 43 48 53 58 63 68
75.2 73.3 70.4 67.4 75.8 74.7 73.4 71.5
LG G8 Latency (ms) 67 69 71 73 75 77 7 10 13 16 19 22 25
75.2 73.3 70.4 67.4 76.4 74.7 73.0 71.1
Top-1 ImageNet Acc (%) 58 62 66 69 73 77 10 14 18 22 26 30 NVIDIA 1080Ti Latency (ms) Batch Size = 64
60.3 65.4 69.8 72.0 72.6 73.8 75.3 76.4
58 62 66 69 73 77 9 11 13 15 17 19 Intel Xeon CPU Latency (ms) Batch Size = 1
60.3 65.4 69.8 72.0 71.1 74.6 75.7 72.0
58 62 66 69 73 77 3.0 4.0 5.0 6.0 7.0 8.0 Xilinx ZU3EG FPGA Latency (ms) Batch Size = 1 (Quantized)
59.1 63.3 69.0 71.5 67.0 69.6 72.8 73.7
OFA Saves Orders of Magnitude Design Cost
24
the carbon footprint by 1,335x compared to MnasNet under 40 platforms.
Measured results on FPGA
OFA for FPGA Accelerators
Arithmetic Intensity (OPS/Byte) 0.0 12.5 25.0 37.5 50.0 ZU3EG FPGA (GOPS/s) 0.0 20.0 40.0 60.0 80.0
MobileNetV2 MnasNet OFA (Ours)
40% higher 57% higher
improvement via neural network specialization.
Summary
net, image_size = ofa_specialized(net_id, pretrained=True)
Project Page: https://ofa.mit.edu
setting a new state-of-the-art 80% ImageNet Top1-accuracy under the mobile setting (< 600M MACs).
Train the full model Shrink the model In 4 dimensions Fine-tune both large and small sub-nets
network
Progressive Shrinking