MorphNet Elad Eban Faster Neural Nets with Hardware-Aware - - PowerPoint PPT Presentation

morphnet
SMART_READER_LITE
LIVE PREVIEW

MorphNet Elad Eban Faster Neural Nets with Hardware-Aware - - PowerPoint PPT Presentation

MorphNet Elad Eban Faster Neural Nets with Hardware-Aware Architecture Learning Where Do Deep-Nets Come From? VGG: Chatfield et al. 2014 Image from: http://www.paddlepaddle.org/ How Do We Improve Deep Nets? Inception - Szegedy et al. 2015


slide-1
SLIDE 1

MorphNet

Faster Neural Nets with Hardware-Aware Architecture Learning

Elad Eban

slide-2
SLIDE 2

Where Do Deep-Nets Come From?

VGG: Chatfield et al. 2014

Image from: http://www.paddlepaddle.org/

slide-3
SLIDE 3

How Do We Improve Deep Nets?

Inception - Szegedy et al. 2015

Image from: http://www.paddlepaddle.org/

slide-4
SLIDE 4

How Do We Improve? Speed? Accuracy?

ResNet - K. He, et al. 2016.

Image from: http://www.paddlepaddle.org/

slide-5
SLIDE 5

Classical Process of Architecture Design

  • Not scalable
  • Not optimal
  • Not customized to YOUR data or task
  • Not designed to YOUR resource constraints
slide-6
SLIDE 6

Rise of the Machines: Network Architecture Search

Neural Architecture Search with Reinforcement Learning 22,400 GPU days! Learning Transferable Architectures for Scalable Image Recognition - RNN 2000 GPU days Efficient Neural Architecture Search via Parameter Sharing ~ 2000 training runs

Huge search space

Figures from: Learning Transferable Architectures for Scalable Image Recognition

slide-7
SLIDE 7

MorphNet: Architecture Learning

Efficient & scalable architecture learning for everyone

  • Trains on your data
  • Start with your architecture
  • Works with your code
  • Resource

constraints guide customization

  • Requires handful
  • f training runs

Simple & effective tool: weighted sparsifying regularization. Idea: Continuous relaxation

  • f combinatorial problem
slide-8
SLIDE 8

Confidential + Proprietary

Learning the Size of Each Layer

Sizes We focus on Topology Architecture search

slide-9
SLIDE 9

Confidential + Proprietary

Conv 1x1 Conv 5x5 Conv 3x3 Conv 1x1 MaxPool 3x3 Conv 1x1 Conv 1x1 Concat Concat

slide-10
SLIDE 10

Confidential + Proprietary

Conv 1x1 Conv 5x5 Conv 3x3 Conv 1x1 MaxPool 3x3 Conv 1x1 Conv 1x1 Concat Concat

slide-11
SLIDE 11

Confidential + Proprietary

Conv 1x1 Conv 3x3 Conv 1x1 MaxPool 3x3 Conv 1x1 Concat Concat

slide-12
SLIDE 12

Main Tool: Weighted sparsifying regularization.

slide-13
SLIDE 13

Sparsity Background

(0,1) (0,1) (0,0)

Sparsity is just - few non zeros Hard to work with in neural nets Continuous relaxation Induces sparsity

slide-14
SLIDE 14

(Group) LASSO: Sparsity in Optimization

Weight matrix

slide-15
SLIDE 15

Confidential + Proprietary

MorphNet Algorithm

Stage 1: Structure learning Export learned structure Stage 2: Finetune or retrain weights

  • f learned structure

Main Tool: Good-old, simple sparsity

Optional 1.1: Uniform expansion

slide-16
SLIDE 16

Shrinking CIFARNet

  • 40%
  • 20%
  • 50%
slide-17
SLIDE 17

Can This Work in Conv-nets?

What do Inception, resnet, dense-net, NAS-net, Amoeba-Net have in common? Problem: The weight matrix is scale invariant.

slide-18
SLIDE 18

L1-Gamma regularization

Actually batch norm has a learned scale parameter: Problem: Still scale invariant. Solution: The scale parameter is the perfect substitute. Zeroing is effectively removing the filter!

slide-19
SLIDE 19

Main Tool: Weighted sparsifying regularization.

slide-20
SLIDE 20

What Do We Actually Care About?

We can now control on the number of filters. But, what we actually care about is: model size, FLOPs and inference time. Notice: FLOPs and model size are a simple function of the number of filters. Solution: Per-layer coefficient that captures the cost.

slide-21
SLIDE 21

What is the Cost of a Filter?

3 3

FLOP coefficient:

5 11 5

Model-size coefficient:

7 11 7

slide-22
SLIDE 22

Inception V2 Based Networks on ImageNet

Baseline: Uniform shrinkage of all layers (width multiplier). FLOP Regularizer: Structure learned with FLOP penalty. Expanded structure: Uniform expansion

  • f learned structure.

Figure from: MorphNet: Fast & Simple Resource-Constrained Structure Learning of Deep Networks

slide-23
SLIDE 23

JFT: Google Scale Image Classification

Resnet-101

Image classification with 300M+ Images, >20K classes. Started with a ResNet-101 architecture.

Figure adapted from: MorphNet: Fast & Simple Resource-Constrained Structure Learning of Deep Networks

The first model with algorithmically learned architecture serving in production.

slide-24
SLIDE 24

ResNet101-Based Learned Structures

40% fewer FLOPs 43% fewer weights FLOP Regularizer M

  • d

e l S i z e R e g u l a r i z e r All models have the same performance.

Figure adapted from: MorphNet: Fast & Simple Resource-Constrained Structure Learning of Deep Networks

slide-25
SLIDE 25

A Custom Architecture Just For You!

Partnered with Google OCR team which maintains models for dozens of scripts which differ in:

  • Number of characters,
  • Character complexity,
  • Word-length,
  • Size of data.

A single fixed architecture was used for all scripts!

slide-26
SLIDE 26

A Custom Architecture Just For You!

Models with 50% of FLOPs (with same accuracy)

Useful for Cyrillic Useful for Arabic

slide-27
SLIDE 27

Source: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis non erat sem

Zooming in On Latency

# A c t i v a t i

  • n

B r u t e f

  • r

c e # F L O P s ? Latency is device specific!

slide-28
SLIDE 28

Source: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis non erat sem

Latency Roofline Model

Each op needs to read inputs, perform calculations, and write outputs. Evaluation time of an op depends on the compute and memory costs. Compute time = FLOPs / compute_rate. Memory time = tensor_size / memory_bandwidth. Latency = max(Compute time, Memory time) Device Specific

slide-29
SLIDE 29

Source: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis non erat sem

Example Latency Costs

Inception V2 Layer Name P100 Latency V100 Latency Ratio Conv2d_2c_3x3 74584 5549 7% Mixed_3c/Branch_2/Conv2d_0a_1x1 2762 1187 43% Mixed_5c/Branch_3/Conv2d_0b_1x1 1381 833 60% Platform Peak Compute Memory Bandwidth P100 9300 GFLOPs/s 732 GB/s V100 125000 GFLOPs/s 900 GB/s

Different platforms have different cost profile Leads to different relative cost

slide-30
SLIDE 30

Source: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis non erat sem

Tesla V100 Latency

slide-31
SLIDE 31

Source: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis non erat sem

Tesla P100 Latency

slide-32
SLIDE 32

Source: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis non erat sem

When Do FLOPs and Latency Differ?

  • Create 5000 sub-Inception V2

models with a random number

  • f filters.
  • Compare FLOPs, V100 and P100

Latency.

V100 - gap between FLOPs and Latency is looser P100 - is compute bound, tracks FLOPs “too” closely

slide-33
SLIDE 33

What Next

If you want to

  • Algorithmically speedup or shrink your model,
  • Easily improve your model

You are invited to use our open source library https://github.com/google-research/morph-net

slide-34
SLIDE 34

Quick User Guide

Exact same API works for different costs and settings: GroupLassoFlops, GammaFlops, GammaModelSize, GammaLatency

slide-35
SLIDE 35

Structure Learning: Regularization Strength

Pick a few regularization strengths. P100 Latency Cost 1.5e-5: ~55% speedup 1e-6: No effect, too weak

slide-36
SLIDE 36

Test Accuracy

Structure Learning: Accuracy Tradeoff

Of course there is a tradeoff 1.5e-5: ~55% speedup 1e-6: No effect, too weak

slide-37
SLIDE 37

Structure Learning: Threshold

Value of gamma, or group LASSO norms usually don’t reach 0.0 so a threshold is needed. L2 Norm of CIFARNet Filters After Structure Learning Usually easy to determine, often the distribution is bimodal. Plot regularized value: L2 or abs(gamma). Dead Filters Alive Filters Any value in this range should work

slide-38
SLIDE 38

Structure Learning: Exporting

slide-39
SLIDE 39

Retraining/Fine Tuning

Problem

  • Extra regularization hurts performance.
  • Some filters are not completely dead.

Options

  • Zero dead filters and finetune.
  • Train learned structure from scratch.

Why

  • Ensures learned structure is stand-alone and not tied to learning procedure.
  • Stabilizes downstream pipeline.
slide-40
SLIDE 40

Under the Hood: Shape Compatibility Constraints

NetworkRegularizers figures out structural dependence in the graph.

conv1 conv2 Add

s k i p

slide-41
SLIDE 41

Under the Hood: Concatenation (as in Inception)

conv1 conv2

Things can get complicated, but it is all handled by the MorphNet framework.

Concat conv3 Add

slide-42
SLIDE 42

Team Effort

Contributors & collaborators: Ariel Gordon, Bo Chen, Ofir Nachum, Hao Wu, Tien-Ju Yang, Edward Choi, Hernan Moraldo, Jesse Dodge, Yonatan Geifman, Shraman Ray Chaudhuri. .

Elad Eban, Max Moroz Yair Movshovitz-Attias, Andrew Poon

slide-43
SLIDE 43

Confidential + Proprietary

Thank You

Elad Eban

Contact: morphnet@google.com

https://github.com/google-research/morph-net