AutoML for TinyML with Once-for-All Network Song Han Massachusetts - - PowerPoint PPT Presentation

automl for tinyml with once for all network
SMART_READER_LITE
LIVE PREVIEW

AutoML for TinyML with Once-for-All Network Song Han Massachusetts - - PowerPoint PPT Presentation

AutoML for TinyML with Once-for-All Network Song Han Massachusetts Institute of Technology Once-for-All, ICLR20 AutoML for TinyML with Once-for-All Network


slide-1
SLIDE 1

Massachusetts Institute of Technology Song Han

AutoML for TinyML with Once-for-All Network

Once-for-All, ICLR’20

slide-2
SLIDE 2

AutoML for TinyML with Once-for-All Network

Once-for-All, ICLR’20

  • Less Engineer Resources: AutoML

Less Computational Resources: TinyML

  • many engineers

large model A lot of computation fewer engineers small model less computation

slide-3
SLIDE 3

Once-for-All, ICLR’20

  • Memory: 32GB
  • Computation: TFLOPS/s

Cloud AI Mobile AI Tiny AI (AIoT)

  • Memory: 4GB
  • Computation: GFLOPS/s
  • Memory: <100 KB
  • Computation: <MFLOPS/s

Challenge: Efficient Inference on Diverse Hardware Platforms

3

  • Different hardware platforms have different resource constraints. We need to customize
  • ur models for each platform to achieve the best accuracy-efficiency trade-off,

especially on resource-constrained edge devices.

less resource less resource

slide-4
SLIDE 4

Challenge: Efficient Inference on Diverse Hardware Platforms

4

Design Cost (GPU hours)

200

for training iterations: forward-backward();

The design cost is calculated under the assumption of using MobileNet-v2.

slide-5
SLIDE 5

Once-for-All, ICLR’20

Challenge: Efficient Inference on Diverse Hardware Platforms

5

The design cost is calculated under the assumption of using MnasNet. [1] Tan, Mingxing, et al. "Mnasnet: Platform-aware neural architecture search for mobile." CVPR. 2019.

Design Cost (GPU hours)

40K

for training iterations: forward-backward(); if good_model: break; for search episodes: for post-search training iterations: forward-backward();

(1)

slide-6
SLIDE 6

Once-for-All, ICLR’20

Challenge: Efficient Inference on Diverse Hardware Platforms

6

Diverse Hardware Platforms

The design cost is calculated under the assumption of using MnasNet. [1] Tan, Mingxing, et al. "Mnasnet: Platform-aware neural architecture search for mobile." CVPR. 2019.

160K 40K

Design Cost (GPU hours)

2019 2017 2015 2013

for training iterations: forward-backward(); if good_model: break; for search episodes: for devices: for post-search training iterations: forward-backward();

(2) (1)

slide-7
SLIDE 7

Once-for-All, ICLR’20

Challenge: Efficient Inference on Diverse Hardware Platforms

7

Diverse Hardware Platforms Cloud AI ( FLOPS)

1012

Mobile AI ( FLOPS)

109

Tiny AI ( FLOPS)

106

160K 40K 1600K

Design Cost (GPU hours)

for training iterations: forward-backward(); if good_model: break; for many devices: for search episodes: for post-search training iterations: forward-backward();

The design cost is calculated under the assumption of using MnasNet. [1] Tan, Mingxing, et al. "Mnasnet: Platform-aware neural architecture search for mobile." CVPR. 2019.

(2) (1)

slide-8
SLIDE 8

Once-for-All, ICLR’20

Challenge: Efficient Inference on Diverse Hardware Platforms

8

Diverse Hardware Platforms Cloud AI ( FLOPS)

1012

Mobile AI ( FLOPS)

109

Tiny AI ( FLOPS)

106

160K 40K 1600K

Design Cost (GPU hours)

11.4k lbs CO2 emission

45.4k lbs CO2 emission

454.4k lbs CO2 emission

1 GPU hour translates to 0.284 lbs CO2 emission according to Strubell, Emma, et al. "Energy and policy considerations for deep learning in NLP." ACL. 2019.

for training iterations: forward-backward(); if good_model: break; for many devices: for search episodes: for post-search training iterations: forward-backward();

(2) (1)

slide-9
SLIDE 9

Evolved Transformer ICML’19, ACL’19

We need Green AI: Solve the Environmental Problem of NAS

Ours 52 4 orders of magnitude ACL’20 Hardware-Aware Transformer

TinyML comes at the cost of BigML

(inference) (training/search)

Problem:

slide-10
SLIDE 10

Once-for-All, ICLR’20

OFA: Decouple Training and Search

10

for training iterations: forward-backward(); if good_model: break; for devices: for search episodes:

=>

(1) (2)

for post-search training iterations: forward-backward(); for OFA training iterations: forward-backward(); for devices: for search episodes: sample from OFA; if good_model: break;

training search

direct deploy without training;

decouple

Conventional NAS Once-for-All:

slide-11
SLIDE 11

Once-for-All, ICLR’20

Challenge: Efficient Inference on Diverse Hardware Platforms

11

Diverse Hardware Platforms …

Once-for-All Network

Cloud AI ( FLOPS)

1012

Mobile AI ( FLOPS)

109

Tiny AI ( FLOPS)

106

160K 40K 1600K

Design Cost (GPU hours)

11.4k lbs CO2 emission

454.4k lbs CO2 emission

45.4k lbs CO2 emission

1 GPU hour translates to 0.284 lbs CO2 emission according to Strubell, Emma, et al. "Energy and policy considerations for deep learning in NLP." ACL. 2019.

for OFA training iterations: forward-backward(); for devices: for search episodes: sample from OFA; if good_model: break;

training search

decouple

direct deploy without training;

slide-12
SLIDE 12

Once-for-All, ICLR’20

Once-for-All Network: Decouple Model Training and Architecture Design

12

  • nce-for-all network
slide-13
SLIDE 13

Once-for-All, ICLR’20

Once-for-All Network: Decouple Model Training and Architecture Design

13

  • nce-for-all network
slide-14
SLIDE 14

Once-for-All, ICLR’20

Once-for-All Network: Decouple Model Training and Architecture Design

14

  • nce-for-all network
slide-15
SLIDE 15

Once-for-All, ICLR’20

Once-for-All Network: Decouple Model Training and Architecture Design

15

  • nce-for-all network
slide-16
SLIDE 16

Once-for-All, ICLR’20

Challenge: how to prevent different subnetworks from interfering with each other?

16

slide-17
SLIDE 17

Once-for-All, ICLR’20

Solution: Progressive Shrinking

17

  • More than

different sub-networks in a single once-for-all network, covering 4 different dimensions: resolution, kernel size, depth, width.

  • Directly optimizing the once-for-all network from scratch is much more challenging

than training a normal neural network given so many sub-networks to support.

1019

slide-18
SLIDE 18

Once-for-All, ICLR’20

18

  • More than

different sub-networks in a single once-for-all network, covering 4 different dimensions: resolution, kernel size, depth, width.

  • Directly optimizing the once-for-all network from scratch is much more challenging

than training a normal neural network given so many sub-networks to support.

1019

Train the full model Shrink the model (4 dimensions) Jointly fine-tune both large and small sub-networks

  • Small sub-networks are nested in large sub-networks.
  • Cast the training process of the once-for-all network as a progressive shrinking and

joint fine-tuning process.

  • nce-for-all

network

Progressive Shrinking

Solution: Progressive Shrinking

slide-19
SLIDE 19

Once-for-All, ICLR’20

Connection to Network Pruning

19

Train the full model Shrink the model (only width) Fine-tune the small net single pruned network

Network Pruning

Train the full model Shrink the model (4 dimensions) Fine-tune both large and small sub-nets

  • nce-for-all

network

  • Progressive shrinking can be viewed as a generalized network pruning with much

higher flexibility across 4 dimensions. Progressive Shrinking

slide-20
SLIDE 20

Once-for-All, ICLR’20

Progressive Shrinking

20

Elastic Kernel Size Elastic Depth Elastic Width

Full Full Full

Elastic Resolution

Full Partial

slide-21
SLIDE 21

Once-for-All, ICLR’20

Progressive Shrinking

21

Elastic Kernel Size Elastic Depth Elastic Width

Full Full Full

Elastic Resolution

Full Partial

slide-22
SLIDE 22

Once-for-All, ICLR’20

Progressive Shrinking

22

Elastic Kernel Size Elastic Depth Elastic Width

Full Full Full

Elastic Resolution

Full Partial

slide-23
SLIDE 23

Once-for-All, ICLR’20

Progressive Shrinking

23

Elastic Kernel Size Elastic Depth Elastic Width

Full Full Full

Elastic Resolution

Full Partial

slide-24
SLIDE 24

Once-for-All, ICLR’20

Progressive Shrinking

24

Elastic Kernel Size Elastic Depth Elastic Width

Full Full Full

Elastic Resolution

Full Partial

slide-25
SLIDE 25

Once-for-All, ICLR’20

Progressive Shrinking

25

Elastic Kernel Size Elastic Depth Elastic Width

Full Full Full

Elastic Resolution

Full Partial

slide-26
SLIDE 26

Once-for-All, ICLR’20

Progressive Shrinking

26

Elastic Resolution Elastic Kernel Size Elastic Depth Elastic Width

Full Full Full Full Partial Partial

slide-27
SLIDE 27

Once-for-All, ICLR’20

Progressive Shrinking

27

Elastic Resolution Elastic Kernel Size Elastic Depth Elastic Width

Full Full Full Full Partial Partial

slide-28
SLIDE 28

Once-for-All, ICLR’20

Progressive Shrinking

28

Elastic Resolution Elastic Depth Elastic Width

Full Full Full Full Partial Partial

Elastic Kernel Size

slide-29
SLIDE 29

Once-for-All, ICLR’20

Progressive Shrinking

29

Elastic Resolution Elastic Depth Elastic Width

Full Full Full Full Partial Partial

Elastic Kernel Size

slide-30
SLIDE 30

Once-for-All, ICLR’20

Progressive Shrinking

30

Elastic Resolution Elastic Depth Elastic Width

Full Full Full Full Partial Partial

Elastic Kernel Size

slide-31
SLIDE 31

Once-for-All, ICLR’20

Progressive Shrinking

31

Elastic Resolution Elastic Kernel Size Elastic Depth Elastic Width

Full Full Full Full Partial Partial

slide-32
SLIDE 32

Once-for-All, ICLR’20

Progressive Shrinking

32

Elastic Resolution Elastic Kernel Size Elastic Depth Elastic Width

Full Full Full Full Partial Partial Partial

slide-33
SLIDE 33

Once-for-All, ICLR’20

Progressive Shrinking

33

Elastic Resolution Elastic Kernel Size Elastic Depth Elastic Width

Full Full Full Full Partial Partial Partial

slide-34
SLIDE 34

Once-for-All, ICLR’20

Progressive Shrinking

34

Elastic Resolution Elastic Kernel Size Elastic Width

Full Full Full Full Partial Partial

Elastic Depth

Partial

slide-35
SLIDE 35

Once-for-All, ICLR’20

Progressive Shrinking

35

Elastic Resolution Elastic Kernel Size Elastic Width

Full Full Full Full Partial Partial

Elastic Depth

Partial

slide-36
SLIDE 36

Once-for-All, ICLR’20

Progressive Shrinking

36

Elastic Resolution Elastic Kernel Size Elastic Width

Full Full Full Full Partial Partial

Elastic Depth

Partial

slide-37
SLIDE 37

Once-for-All, ICLR’20

Progressive Shrinking

37

Elastic Width

Full Full Full Full Partial Partial Partial

Elastic Resolution Elastic Kernel Size Elastic Depth

slide-38
SLIDE 38

Once-for-All, ICLR’20

Progressive Shrinking

38

Full Full Full Full Partial Partial Partial

Elastic Resolution Elastic Kernel Size

Partial

Elastic Width Elastic Depth

slide-39
SLIDE 39

Once-for-All, ICLR’20

Progressive Shrinking

39

Full Full Full Full Partial Partial Partial

Elastic Resolution Elastic Kernel Size

Partial

Elastic Width Elastic Depth

slide-40
SLIDE 40

Once-for-All, ICLR’20

Progressive Shrinking

40

Full Full Full Full Partial Partial Partial

Elastic Resolution Elastic Kernel Size

Partial

Elastic Width Elastic Depth

slide-41
SLIDE 41

Once-for-All, ICLR’20

Progressive Shrinking

41

Full Full Full Full Partial Partial Partial

Elastic Resolution Elastic Kernel Size

Partial

Elastic Width Elastic Depth

slide-42
SLIDE 42

Once-for-All, ICLR’20

Progressive Shrinking

42

Full Full Full Full Partial Partial Partial

Elastic Resolution Elastic Kernel Size

Partial

Elastic Width Elastic Depth

slide-43
SLIDE 43

Once-for-All, ICLR’20

Progressive Shrinking

43

Full Full Full Full Partial Partial Partial

Elastic Resolution Elastic Kernel Size

Partial

Elastic Width Elastic Depth

slide-44
SLIDE 44

Once-for-All, ICLR’20

Progressive Shrinking

44

Randomly sample input image size for each batch

Elastic Kernel Size Elastic Depth Elastic Width

Full Full Full

Elastic Resolution

Full Partial

slide-45
SLIDE 45

Once-for-All, ICLR’20

Progressive Shrinking

45

Randomly sample input image size for each batch

Elastic Kernel Size Elastic Depth Elastic Width

Full Full Full

Elastic Resolution

Full Partial

slide-46
SLIDE 46

Once-for-All, ICLR’20

Progressive Shrinking

46

Randomly sample input image size for each batch

Elastic Kernel Size Elastic Depth Elastic Width

Full Full Full

Elastic Resolution

Full Partial

slide-47
SLIDE 47

Once-for-All, ICLR’20

Progressive Shrinking

47

Randomly sample input image size for each batch

Elastic Kernel Size Elastic Depth Elastic Width

Full Full Full

Elastic Resolution

Full Partial

slide-48
SLIDE 48

Once-for-All, ICLR’20

Progressive Shrinking

48

Randomly sample input image size for each batch

Elastic Kernel Size Elastic Depth Elastic Width

Full Full Full

Elastic Resolution

Full Partial

slide-49
SLIDE 49

Once-for-All, ICLR’20

Progressive Shrinking

49

Randomly sample input image size for each batch

Elastic Kernel Size Elastic Depth Elastic Width

Full Full Full

Elastic Resolution

Full Partial

slide-50
SLIDE 50

Once-for-All, ICLR’20

Progressive Shrinking

50

7x7 Transform Matrix 25x25 5x5 Transform Matrix 9x9 3x3

Start with full kernel size Smaller kernel takes centered weights via a transformation matrix

Elastic Resolution Elastic Kernel Size Elastic Depth Elastic Width

Full Full Full Full Partial Partial

slide-51
SLIDE 51

Once-for-All, ICLR’20

Progressive Shrinking

51

7x7 Transform Matrix 25x25 5x5 Transform Matrix 9x9 3x3

Start with full kernel size Smaller kernel takes centered weights via a transformation matrix

Elastic Resolution Elastic Kernel Size Elastic Depth Elastic Width

Full Full Full Full Partial Partial

slide-52
SLIDE 52

Once-for-All, ICLR’20

Progressive Shrinking

52

7x7 Transform Matrix 25x25 5x5 Transform Matrix 9x9 3x3

Start with full kernel size Smaller kernel takes centered weights via a transformation matrix

Elastic Resolution Elastic Depth Elastic Width

Full Full Full Full Partial Partial

Elastic Kernel Size

slide-53
SLIDE 53

Once-for-All, ICLR’20

Progressive Shrinking

53

7x7 Transform Matrix 25x25 5x5 Transform Matrix 9x9 3x3

Start with full kernel size Smaller kernel takes centered weights via a transformation matrix

Elastic Resolution Elastic Depth Elastic Width

Full Full Full Full Partial Partial

Elastic Kernel Size

slide-54
SLIDE 54

Once-for-All, ICLR’20

Progressive Shrinking

54

7x7 Transform Matrix 25x25 5x5 Transform Matrix 9x9 3x3

Start with full kernel size Smaller kernel takes centered weights via a transformation matrix

Elastic Resolution Elastic Depth Elastic Width

Full Full Full Full Partial Partial

Elastic Kernel Size

slide-55
SLIDE 55

Once-for-All, ICLR’20

Progressive Shrinking

55

7x7 Transform Matrix 25x25 5x5 Transform Matrix 9x9 3x3

Start with full kernel size Smaller kernel takes centered weights via a transformation matrix

Elastic Resolution Elastic Kernel Size Elastic Depth Elastic Width

Full Full Full Full Partial Partial

slide-56
SLIDE 56

Once-for-All, ICLR’20

Progressive Shrinking

56

Elastic Resolution Elastic Kernel Size Elastic Depth Elastic Width

Full Full Full Full Partial Partial

unit i train with full depth unit i shrink the depth O1 O2 unit i shrink the depth O1 O2 O3

Gradually allow later layers in each unit to be skipped to reduce the depth Partial

slide-57
SLIDE 57

Once-for-All, ICLR’20

Progressive Shrinking

57

Elastic Resolution Elastic Kernel Size Elastic Depth Elastic Width

Full Full Full Full Partial Partial

unit i train with full depth unit i shrink the depth O1 O2 unit i shrink the depth O1 O2 O3

Gradually allow later layers in each unit to be skipped to reduce the depth Partial

slide-58
SLIDE 58

Once-for-All, ICLR’20

Progressive Shrinking

58

Elastic Resolution Elastic Kernel Size Elastic Width

Full Full Full Full Partial Partial

Elastic Depth

unit i train with full depth unit i shrink the depth O1 O2 unit i shrink the depth O1 O2 O3

Gradually allow later layers in each unit to be skipped to reduce the depth Partial

slide-59
SLIDE 59

Once-for-All, ICLR’20

Progressive Shrinking

59

Elastic Resolution Elastic Kernel Size Elastic Width

Full Full Full Full Partial Partial

Elastic Depth

unit i train with full depth unit i shrink the depth O1 O2 unit i shrink the depth O1 O2 O3

Gradually allow later layers in each unit to be skipped to reduce the depth Partial

slide-60
SLIDE 60

Once-for-All, ICLR’20

Progressive Shrinking

60

Elastic Resolution Elastic Kernel Size Elastic Width

Full Full Full Full Partial Partial

Elastic Depth

unit i train with full depth unit i shrink the depth O1 O2 unit i shrink the depth O1 O2 O3

Gradually allow later layers in each unit to be skipped to reduce the depth Partial

slide-61
SLIDE 61

Once-for-All, ICLR’20

Progressive Shrinking

61

unit i train with full depth unit i shrink the depth O1 O2 unit i shrink the depth O1 O2 O3

Gradually allow later layers in each unit to be skipped to reduce the depth

Elastic Width

Full Full Full Full Partial Partial Partial

Elastic Resolution Elastic Kernel Size Elastic Depth

slide-62
SLIDE 62

Once-for-All, ICLR’20

Progressive Shrinking

62

train with full width channel importance 0.02 0.15 0.85 0.63 channel sorting progressively shrink the width channel importance 0.82 0.11 0.46 reorg. reorg. progressively shrink the width channel sorting O1 O2 O3 O1 O2 O1

Gradually shrink the width Keep the most important channels when shrinking via channel sorting Full Full Full Full Partial Partial Partial

Elastic Resolution Elastic Kernel Size

Partial

Elastic Width Elastic Depth

slide-63
SLIDE 63

Once-for-All, ICLR’20

Progressive Shrinking

63

train with full width channel importance 0.02 0.15 0.85 0.63 channel sorting progressively shrink the width channel importance 0.82 0.11 0.46 reorg. reorg. progressively shrink the width channel sorting O1 O2 O3 O1 O2 O1

Gradually shrink the width Keep the most important channels when shrinking via channel sorting Full Full Full Full Partial Partial Partial

Elastic Resolution Elastic Kernel Size

Partial

Elastic Width Elastic Depth

slide-64
SLIDE 64

Once-for-All, ICLR’20

Progressive Shrinking

64

train with full width channel importance 0.02 0.15 0.85 0.63 channel sorting progressively shrink the width channel importance 0.82 0.11 0.46 reorg. reorg. progressively shrink the width channel sorting O1 O2 O3 O1 O2 O1

Gradually shrink the width Keep the most important channels when shrinking via channel sorting Full Full Full Full Partial Partial Partial

Elastic Resolution Elastic Kernel Size

Partial

Elastic Width Elastic Depth

slide-65
SLIDE 65

Once-for-All, ICLR’20

Progressive Shrinking

65

train with full width channel importance 0.02 0.15 0.85 0.63 channel sorting progressively shrink the width channel importance 0.82 0.11 0.46 reorg. reorg. progressively shrink the width channel sorting O1 O2 O3 O1 O2 O1

Gradually shrink the width Keep the most important channels when shrinking via channel sorting Full Full Full Full Partial Partial Partial

Elastic Resolution Elastic Kernel Size

Partial

Elastic Width Elastic Depth

slide-66
SLIDE 66

Once-for-All, ICLR’20

Progressive Shrinking

66

train with full width channel importance 0.02 0.15 0.85 0.63 channel sorting progressively shrink the width channel importance 0.82 0.11 0.46 reorg. reorg. progressively shrink the width channel sorting O1 O2 O3 O1 O2 O1

Gradually shrink the width Keep the most important channels when shrinking via channel sorting Full Full Full Full Partial Partial Partial

Elastic Resolution Elastic Kernel Size

Partial

Elastic Width Elastic Depth

slide-67
SLIDE 67

Once-for-All, ICLR’20

Progressive Shrinking

67

train with full width channel importance 0.02 0.15 0.85 0.63 channel sorting progressively shrink the width channel importance 0.82 0.11 0.46 reorg. reorg. progressively shrink the width channel sorting O1 O2 O3 O1 O2 O1

Gradually shrink the width Keep the most important channels when shrinking via channel sorting Full Full Full Full Partial Partial Partial

Elastic Resolution Elastic Kernel Size

Partial

Elastic Width Elastic Depth

slide-68
SLIDE 68

Once-for-All, ICLR’20

Progressive Shrinking

68

Once- for-all Network K = 7 D = 4 W = 6

Train full network

Elastic Kernel Size

D = 4, W = 6 K [7, 5, 3]

  • Sample K at each layer

Generate kernel weights (Fig. 3) Fine-tune weights & transformation matrix

Elastic Width

D [4, 3, 2], K [7, 5, 3]

  • Channel sorting

Sample E at each Fine-tune weights W [6, 4, 3]

  • W

[6, 4]

  • Channel sorting

(Fig. 4) Sample W at each layer; sample K, D

Elastic Resolution

R [128, 132, …, 224]

  • Elastic Depth

W = 6, K [7, 5, 3]

  • Sample D at each

Skip top (4-D) Fine-tune weights D [4, 3, 2]

  • D

[4, 3]

  • Sample D at each

unit; sample K Keep the first D layers at each unit (Fig. 3) Fine-tune weights Fine-tune weights

put it together:

slide-69
SLIDE 69

Once-for-All, ICLR’20

69

Performances of Sub-networks on ImageNet

ImageNet Top-1 Acc (%) 67 70 73 75 78

w/o PS w/ PS

D=2 W=3 K=3 D=2 W=3 K=7 D=2 W=6 K=3 D=2 W=6 K=7 D=4 W=3 K=3 D=4 W=3 K=7 D=4 W=6 K=3 D=4 W=6 K=7

2.5% 2.8% 3.5% 3.4% 3.3% 3.4% 3.7% 3.5%

Sub-networks under various architecture configurations D: depth, W: width, K: kernel size

  • Progressive shrinking consistently improves accuracy of sub-networks on ImageNet.
slide-70
SLIDE 70

Once-for-All, ICLR’20

How about search?

70

for OFA training iterations: forward-backward(); for devices: for search episodes: sample from OFA; if good_model: break;

training search

decouple

direct deploy without training;

//with evolution

slide-71
SLIDE 71

Once-for-All, ICLR’20

2.6x faster than EfficientNet 1.5x faster than MobileNetV3

71

Top-1 ImageNet Acc (%) 76 77 78 79 80 81 50 100 150 200 250 300 350 400

OFA EfficientNet

76.3 78.8 79.8 79.8 78.7 Google Pixel1 Latency (ms) 80.1 2.6x faster 3.8% higher accuracy Google Pixel1 Latency (ms) Top-1 ImageNet Acc (%) 67 69 71 73 75 77 18 24 30 36 42 48 54 60

OFA MobileNetV3

75.2 73.3 70.4 67.4 76.4 74.9 73.3 71.4 4% higher accuracy 1.5x faster

  • Training from scratch cannot achieve the same level of accuracy

Once-for-All, ICLR’20

slide-72
SLIDE 72

Once-for-All, ICLR’20

More accurate than training from scratch

72

Top-1 ImageNet Acc (%) 76 77 78 79 80 81 50 100 150 200 250 300 350 400

OFA EfficientNet OFA - Train from scratch

76.3 78.8 79.8 79.8 78.7 Google Pixel1 Latency (ms) 80.1 2.6x faster 3.8% higher accuracy Google Pixel1 Latency (ms) Top-1 ImageNet Acc (%) 67 69 71 73 75 77 18 24 30 36 42 48 54 60

OFA MobileNetV3 OFA - Train from scatch

75.2 73.3 70.4 67.4 76.4 74.9 73.3 71.4 4% higher accuracy 1.5x faster

  • Training from scratch cannot achieve the same level of accuracy

Once-for-All, ICLR’20

slide-73
SLIDE 73

OFA: 80% Top-1 Accuracy on ImageNet

73

1 2 3 4 5 6 7 8 9

MACs (Billion)

69 71 73 75 77 79 81

ImageNet Top-1 accuracy (%)

2M 4M 8M Handcrafted 16M AutoML 32M 64M

The higher the better The lower the better

Once-for-All (ours) EfficientNet ProxylessNAS MBNetV3 AmoebaNet MBNetV2 PNASNet ShuffleNet DARTS IGCV3-D MobileNetV1 (MBNetV1) NASNet-A InceptionV2 DenseNet-121 DenseNet-169 ResNet-50 ResNetXt-50 InceptionV3 DenseNet-264 DPN-92 ResNet-101 Xception ResNetXt-101

14x less computation

595M MACs 80.0% Top-1 Model Size

  • Once-for-all sets a new state-of-the-art 80% ImageNet top-1 accuracy under

the mobile vision setting (< 600M MACs).

Once-for-All, ICLR’20

slide-74
SLIDE 74

Once-for-All, ICLR’20

74

1 2 3 4 5 6 7 8 9

MACs (Billion)

69 71 73 75 77 79 81

ImageNet Top-1 accuracy (%)

2M 4M 8M Handcrafted 16M AutoML 32M 64M

The higher the better The lower the better

Once-for-All (ours) EfficientNet ProxylessNAS MBNetV3 AmoebaNet MBNetV2 PNASNet ShuffleNet DARTS IGCV3-D MobileNetV1 (MBNetV1) NASNet-A InceptionV2 DenseNet-121 DenseNet-169 ResNet-50 ResNetXt-50 InceptionV3 DenseNet-264 DPN-92 ResNet-101 Xception ResNetXt-101

14x less computation

595M MACs 80.0% Top-1 Model Size

Mobile Setting

OFA: 80% Top-1 Accuracy on ImageNet

slide-75
SLIDE 75

Once-for-All, ICLR’20

OFA Enables Fast Specialization on Diverse Hardware Platforms

75 Samsung S7 Edge Latency (ms) Top-1 ImageNet Acc (%) 67 69 71 73 75 77 25 40 55 70 85 100

OFA MobileNetV3 MobileNetV2

75.2 73.3 70.4 67.4 70.5 73.1 74.7 76.3

Google Pixel2 Latency (ms) 67 69 71 73 75 77 23 28 33 38 43 48 53 58 63 68

75.2 73.3 70.4 67.4 75.8 74.7 73.4 71.5

LG G8 Latency (ms) 67 69 71 73 75 77 7 10 13 16 19 22 25

75.2 73.3 70.4 67.4 76.4 74.7 73.0 71.1

Top-1 ImageNet Acc (%) 58 62 66 69 73 77 10 14 18 22 26 30 NVIDIA 1080Ti Latency (ms) Batch Size = 64

60.3 65.4 69.8 72.0 72.6 73.8 75.3 76.4

58 62 66 69 73 77 9 11 13 15 17 19 Intel Xeon CPU Latency (ms) Batch Size = 1

60.3 65.4 69.8 72.0 71.1 74.6 75.7 72.0

58 62 66 69 73 77 3.0 4.0 5.0 6.0 7.0 8.0 Xilinx ZU3EG FPGA Latency (ms) Batch Size = 1 (Quantized)

59.1 63.3 69.0 71.5 67.0 69.6 72.8 73.7

slide-76
SLIDE 76

Once-for-All, ICLR’20

Diverse Hardware Platforms, 50+ Pretriained Models are Released

76

slide-77
SLIDE 77

Measured results on FPGA

OFA for FPGA Accelerators

Arithmetic Intensity (OPS/Byte) 0.0 12.5 25.0 37.5 50.0 ZU3EG FPGA (GOPS/s) 0.0 20.0 40.0 60.0 80.0

MobileNetV2 MnasNet OFA (Ours)

40% higher 57% higher

  • Non-specialized neural networks do not fully utilize the hardware resource. There is a large room for

improvement via neural network specialization.

Once-for-All, ICLR’20

slide-78
SLIDE 78

We need Green AI Solve the Environmental Problem of NAS

Evolved Transformer

slide-79
SLIDE 79

How to save CO2 emission

  • 2. Lite-transformer: Human-in-the-loop 

  • design. Apply human insights of HW&ML, 


rather than “just search it”

  • 1. Once for all: Amortize the search cost 


across many sub-networks and 
 deployment scenarios

Once-for-All, ICLR’20 Lite Transformer, ICLR’20

slide-80
SLIDE 80

OFA has broad applications

  • Efficient Transformer
  • Efficient Video Recognition
  • Efficient 3D Vision
  • Efficient GAN Compression
slide-81
SLIDE 81

OFA’s Application: Hardware-Aware Transformer

3.7x smaller model size, same performance on WMT’14 En-De; 3x, 1.6x, 1.5x faster on Raspberry Pi, CPU, GPU than Transformer Baseline 12,000x less CO2 than evolved transformer

HAT, ACL’20

626,155 126,000 36,156 11,023 Human Life (Avg. 1 year) American Life (Avg. 1 year) US Car w/ Fuel (Avg. 1 lifetime) Evolved Transformer HAT (Ours)

52 12041×

175K 350K 525K 700K CO2 Emission (lbs)

  • “Nice to meet you”

“Encantada de conocerte” “만나서 반갑습니다” “” “Freut mich, dich kennenzulernen”

Efficient NLP on mobile devices enable real time conversation between speakers using different languages

slide-82
SLIDE 82

Kinetics Top-1 Accuracy (%) 69 70 71 72 73 74 75

Computation (GFLOPs) 10 20 30 40 Same Acc.

OFA + TSM (large) OFA + TSM (small) MobileNetV2 + TSM ResNet50 + TSM ResNet50 + I3D

7x less computation Same Comp. +3.0% Acc.

TSM, ICCV’19

OFA’s Application: Efficient Video Recognition

7x less computation, same performance as TSM+ResNet50 same computation, 3% higher accuracy than TSM+MobileNet-v2

slide-83
SLIDE 83

OFA’s Application: Efficient 3D Recognition

self-driving: a whole trunk of GPU Accuracy v.s. Latency Tradeoff

4x FLOPs reduction and 2x speedup over MinkowskiNet 3.6% better accuracy under the same computation budget.

AR/VR: a whole backpack

  • f computer

followup of PVCNN, NeurIPS’19 (spotlight)

slide-84
SLIDE 84

84

GAN Compression, CVPR’20

OFA’s Application: GAN Compression

8-21x FLOPs reduction on CycleGAN, Pix2pix, GauGAN 1.7x-18.5x speedup on CPU/GPU & Mobile CPU/GPU

slide-85
SLIDE 85

Summary: Once-for-All Network

  • Released 50+ different pre-trained OFA models on diverse hardware platforms (CPU/GPU/FPGA/DSP).

net, image_size = ofa_specialized(net_id, pretrained=True)

  • Released the training code & pre-trained OFA network that provides diverse sub-networks without training.
  • fa_network = ofa_net(net_id, pretrained=True)
  • We introduce once-for-all network for efficient inference on diverse hardware platforms.
  • We present an effective progressive shrinking approach for training once-for-all networks.

Project Page: https://ofa.mit.edu

  • Once-for-all network surpasses MobileNetV3 and EfficientNet by a large margin under all scenarios,

setting a new state-of-the-art 80% ImageNet Top1-accuracy under the mobile setting (< 600M MACs).

  • First place in the 3rd Low-Power Computer Vision Challenge, DSP track @ICCV’19
  • First place in the 4th Low-Power Computer Vision Challenge @NeurIPS’19, both classification & detection.

Train the full model Shrink the model In 4 dimensions Fine-tune both large and small sub-nets

  • nce-for-all

network

Progressive Shrinking

slide-86
SLIDE 86

References

86

Model Compression & NAS

  • Once-For-All: Train One Network and Specialize It for Efficient Deployment, ICLR’20
  • ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware, ICLR’19
  • APQ: Joint Search for Network Architecture, Pruning and Quantization Policy, CVPR’20
  • HAQ: Hardware-Aware Automated Quantization with Mixed Precision, CVPR’19
  • Defensive Quantization: When Efficiency Meets Robustness, ICLR’19
  • AMC: AutoML for Model Compression and Acceleration on Mobile Devices, ECCV’18

Efficient Vision:

  • GAN Compression: Learning Efficient Architectures for Conditional GANs, CVPR’20
  • TSM: Temporal Shift Module for Efficient Video Understanding, ICCV’19
  • PVCNN: Point Voxel CNN for Efficient 3D Deep Learning, NeurIPS’19

Efficient NLP:

  • Lite Transformer with Long Short Term Attention, ICLR’20
  • HAT: Hardware-aware Transformer, ACL’20

Hardware & EDA:

  • SpArch: Efficient Architecture for Sparse Matrix Multiplication, HPCA’20
  • Transferable Transistor Sizing with Graph Neural Networks and Reinforcement Learning, DAC’20
slide-87
SLIDE 87

Make AI Efficient: Tiny Computational Resources Tiny Human Resources

Media Coverage: Website: songhan.mit.edu

youtube.com/c/MITHANLab github.com/mit-han-lab