[PPT] - Once for All: Train One Network and Specialize it for Efficient PowerPoint Presentation

SLIDE 1

Massachusetts Institute of Technology Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, Song Han

Once for All: Train One Network and Specialize it for Efficient Deployment

Once-for-All, ICLR’20

SLIDE 2

Memory: 32GB
Computation:

FLOPS

1012

Cloud AI Mobile AI Tiny AI (AIoT)

Memory: 4GB
Computation:

FLOPS

109

Memory: 100 KB
Computation: <

FLOPS

106

Challenge: Efficient Inference on Diverse Hardware Platforms

Different hardware platforms have different resource constraints. We need to customize
ur models for each platform to achieve the best accuracy-efficiency trade-off,

especially on resource-constrained edge devices.

less resource less resource

SLIDE 3

3

Diverse Hardware Platforms

The design cost is calculated under the assumption of using MnasNet. [1] Tan, Mingxing, et al. "Mnasnet: Platform-aware neural architecture search for mobile." CVPR. 2019.

Design Cost (GPU hours)

40K

Challenge: Efficient Inference on Diverse Hardware Platforms

SLIDE 4

4

Diverse Hardware Platforms

The design cost is calculated under the assumption of using MnasNet. [1] Tan, Mingxing, et al. "Mnasnet: Platform-aware neural architecture search for mobile." CVPR. 2019.

160K 40K

Design Cost (GPU hours)

Challenge: Efficient Inference on Diverse Hardware Platforms

2019 2017 2015 2013

SLIDE 5

5

Diverse Hardware Platforms Cloud AI ( FLOPS)

1012

Mobile AI ( FLOPS)

109

Tiny AI ( FLOPS)

106

…

The design cost is calculated under the assumption of using MnasNet. [1] Tan, Mingxing, et al. "Mnasnet: Platform-aware neural architecture search for mobile." CVPR. 2019.

160K 40K 1600K

Design Cost (GPU hours)

Challenge: Efficient Inference on Diverse Hardware Platforms

SLIDE 6

6

Diverse Hardware Platforms Cloud AI ( FLOPS)

1012

Mobile AI ( FLOPS)

109

Tiny AI ( FLOPS)

106

…

160K 40K 1600K

Design Cost (GPU hours)

Challenge: Efficient Inference on Diverse Hardware Platforms

11.4k lbs CO2 emission

→

45.4k lbs CO2 emission

→

454.4k lbs CO2 emission

→

1 GPU hour translates to 0.284 lbs CO2 emission according to Strubell, Emma, et al. "Energy and policy considerations for deep learning in NLP." ACL. 2019.

SLIDE 7

7

Diverse Hardware Platforms

?

… Cloud AI ( FLOPS)

1012

Mobile AI ( FLOPS)

109

Tiny AI ( FLOPS)

106

Challenge: Efficient Inference on Diverse Hardware Platforms

160K 40K 1600K

Design Cost (GPU hours)

11.4k lbs CO2 emission

→

454.4k lbs CO2 emission

→

1 GPU hour translates to 0.284 lbs CO2 emission according to Strubell, Emma, et al. "Energy and policy considerations for deep learning in NLP." ACL. 2019.

45.4k lbs CO2 emission

→

SLIDE 8

8

Diverse Hardware Platforms …

Once-for-All Network

Cloud AI ( FLOPS)

1012

Mobile AI ( FLOPS)

109

Tiny AI ( FLOPS)

106

Challenge: Efficient Inference on Diverse Hardware Platforms

160K 40K 1600K

Design Cost (GPU hours)

11.4k lbs CO2 emission

→

454.4k lbs CO2 emission

→

1 GPU hour translates to 0.284 lbs CO2 emission according to Strubell, Emma, et al. "Energy and policy considerations for deep learning in NLP." ACL. 2019.

45.4k lbs CO2 emission

→

SLIDE 9

Once-for-All Network: Decouple Model Training and Architecture Design

9

nce-for-all network

SLIDE 10

Once-for-All Network: Decouple Model Training and Architecture Design

10

nce-for-all network

SLIDE 11

Once-for-All Network: Decouple Model Training and Architecture Design

11

nce-for-all network

SLIDE 12

Once-for-All Network: Decouple Model Training and Architecture Design

12

…

nce-for-all network

SLIDE 13

Progressive Shrinking for Training OFA Networks

13

More than

different sub-networks in a single once-for-all network, covering 4 different dimensions: resolution, kernel size, depth, width.

Directly optimizing the once-for-all network from scratch is much more challenging

than training a normal neural network given so many sub-networks to support.

1019

SLIDE 14

Progressive Shrinking

14

More than

different sub-networks in a single once-for-all network, covering 4 different dimensions: resolution, kernel size, depth, width.

Directly optimizing the once-for-all network from scratch is much more challenging

than training a normal neural network given so many sub-networks to support.

1019

Train the full model Shrink the model (4 dimensions) Jointly fine-tune both large and small sub-networks

Small sub-networks are nested in large sub-networks.
Cast the training process of the once-for-all network as a progressive shrinking and

joint fine-tuning process.

nce-for-all

network

Progressive Shrinking

SLIDE 15

Connection to Network Pruning

15

Train the full model Shrink the model (only width) Fine-tune the small net single pruned network

Network Pruning

Train the full model Shrink the model (4 dimensions) Fine-tune both large and small sub-nets

nce-for-all

network

Progressive shrinking can be viewed as a generalized network pruning with much

higher flexibility across 4 dimensions. Progressive Shrinking

SLIDE 16

16

Randomly sample input image size for each batch

Elastic Kernel Size Elastic Depth Elastic Width

Full Full Full

Elastic Resolution

Full Partial

Progressive Shrinking

SLIDE 17

17

7x7 Transform Matrix 25x25 5x5 Transform Matrix 9x9 3x3

Start with full kernel size Smaller kernel takes centered weights via a transformation matrix

Elastic Resolution Elastic Kernel Size Elastic Depth Elastic Width

Full Full Full Full Partial Partial

Progressive Shrinking

SLIDE 18

18

unit i train with full depth unit i shrink the depth O1 O2 unit i shrink the depth O1 O2 O3

Gradually allow later layers in each unit to be skipped to reduce the depth

Elastic Width

Full Full Full Full Partial Partial Partial

Elastic Resolution Elastic Kernel Size Elastic Depth

Progressive Shrinking

SLIDE 19

19

train with full width channel importance 0.02 0.15 0.85 0.63 channel sorting progressively shrink the width channel importance 0.82 0.11 0.46 reorg. reorg. progressively shrink the width channel sorting O1 O2 O3 O1 O2 O1

Gradually shrink the width Keep the most important channels when shrinking via channel sorting Full Full Full Full Partial Partial Partial

Elastic Resolution Elastic Kernel Size

Partial

Elastic Width Elastic Depth

Progressive Shrinking

SLIDE 20

20

Performances of Sub-networks on ImageNet

ImageNet Top-1 Acc (%) 67 70 73 75 78

w/o PS w/ PS

D=2 W=3 K=3 D=2 W=3 K=7 D=2 W=6 K=3 D=2 W=6 K=7 D=4 W=3 K=3 D=4 W=3 K=7 D=4 W=6 K=3 D=4 W=6 K=7

2.5% 2.8% 3.5% 3.4% 3.3% 3.4% 3.7% 3.5%

Sub-networks under various architecture configurations D: depth, W: width, K: kernel size

Progressive shrinking consistently improves accuracy of sub-networks on ImageNet.

SLIDE 21

OFA: 80% Top-1 Accuracy on ImageNet

21

1 2 3 4 5 6 7 8 9

MACs (Billion)

69 71 73 75 77 79 81

ImageNet Top-1 accuracy (%)

2M 4M 8M Handcrafted 16M AutoML 32M 64M

→

The higher the better The lower the better

Once-for-All (ours) EfficientNet ProxylessNAS MBNetV3 AmoebaNet MBNetV2 PNASNet ShuffleNet DARTS IGCV3-D MobileNetV1 (MBNetV1) NASNet-A InceptionV2 DenseNet-121 DenseNet-169 ResNet-50 ResNetXt-50 InceptionV3 DenseNet-264 DPN-92 ResNet-101 Xception ResNetXt-101

14x less computation

595M MACs 80.0% Top-1 Model Size

Once-for-all sets a new state-of-the-art 80% ImageNet top-1 accuracy under

the mobile setting (< 600M MACs).

SLIDE 22

Comparison with EfficientNet and MobileNetV3

22

Top-1 ImageNet Acc (%) 76 77 78 79 80 81 50 100 150 200 250 300 350 400

OFA EfficientNet

76.3 78.8 79.8 79.8 78.7 Google Pixel1 Latency (ms) 80.1 2.6x faster 3.8% higher accuracy Google Pixel1 Latency (ms) Top-1 ImageNet Acc (%) 67 69 71 73 75 77 18 24 30 36 42 48 54 60

OFA MobileNetV3

75.2 73.3 70.4 67.4 76.4 74.9 73.3 71.4 4% higher accuracy 1.5x faster

Once-for-all is 2.6x faster than EfficientNet and 1.5x faster than MobileNetV3
n Google Pixel1 without loss of accuracy.

SLIDE 23

OFA for Fast Specialization on Diverse Hardware Platforms

23 Samsung S7 Edge Latency (ms) Top-1 ImageNet Acc (%) 67 69 71 73 75 77 25 40 55 70 85 100

OFA MobileNetV3 MobileNetV2

75.2 73.3 70.4 67.4 70.5 73.1 74.7 76.3

Google Pixel2 Latency (ms) 67 69 71 73 75 77 23 28 33 38 43 48 53 58 63 68

75.2 73.3 70.4 67.4 75.8 74.7 73.4 71.5

LG G8 Latency (ms) 67 69 71 73 75 77 7 10 13 16 19 22 25

75.2 73.3 70.4 67.4 76.4 74.7 73.0 71.1

Top-1 ImageNet Acc (%) 58 62 66 69 73 77 10 14 18 22 26 30 NVIDIA 1080Ti Latency (ms) Batch Size = 64

60.3 65.4 69.8 72.0 72.6 73.8 75.3 76.4

58 62 66 69 73 77 9 11 13 15 17 19 Intel Xeon CPU Latency (ms) Batch Size = 1

60.3 65.4 69.8 72.0 71.1 74.6 75.7 72.0

58 62 66 69 73 77 3.0 4.0 5.0 6.0 7.0 8.0 Xilinx ZU3EG FPGA Latency (ms) Batch Size = 1 (Quantized)

59.1 63.3 69.0 71.5 67.0 69.6 72.8 73.7

SLIDE 24

OFA Saves Orders of Magnitude Design Cost

24

Geen AI is important. The computation cost of OFA stays constant with #hardware platforms, reducing

the carbon footprint by 1,335x compared to MnasNet under 40 platforms.

SLIDE 25

Measured results on FPGA

OFA for FPGA Accelerators

Arithmetic Intensity (OPS/Byte) 0.0 12.5 25.0 37.5 50.0 ZU3EG FPGA (GOPS/s) 0.0 20.0 40.0 60.0 80.0

MobileNetV2 MnasNet OFA (Ours)

40% higher 57% higher

Non-specialized neural networks do not fully utilize the hardware resource. There is a large room for

improvement via neural network specialization.

SLIDE 26

Summary

Released 50 different pre-trained OFA models on diverse hardware platforms (CPU/GPU/FPGA/DSP).

net, image_size = ofa_specialized(net_id, pretrained=True)

Released the training code & pre-trained OFA network that provides diverse sub-networks without training.
fa_network = ofa_net(net_id, pretrained=True)
We introduce once-for-all network for efficient inference on diverse hardware platforms.
We present an effective progressive shrinking approach for training once-for-all networks.

Project Page: https://ofa.mit.edu

Once-for-all network surpasses MobileNetV3 and EfficientNet by a large margin under all scenarios,

setting a new state-of-the-art 80% ImageNet Top1-accuracy under the mobile setting (< 600M MACs).

First place in the 3rd Low-Power Computer Vision Challenge, DSP track @ICCV’19
First place in the 4th Low-Power Computer Vision Challenge @NeurIPS’19

Train the full model Shrink the model In 4 dimensions Fine-tune both large and small sub-nets

nce-for-all

network

Progressive Shrinking