MorphNet Elad Eban Faster Neural Nets with Hardware-Aware - PowerPoint PPT Presentation

MorphNet Elad Eban Faster Neural Nets with Hardware-Aware Architecture Learning

Where Do Deep-Nets Come From? VGG: Chatfield et al. 2014 Image from: http://www.paddlepaddle.org/

How Do We Improve Deep Nets? Inception - Szegedy et al. 2015 Image from: http://www.paddlepaddle.org/

How Do We Improve? Speed? Accuracy? ResNet - K. He, et al. 2016. Image from: http://www.paddlepaddle.org/

Classical Process of Architecture Design ● Not scalable ● Not optimal Not customized to YOUR data or task ● Not designed to YOUR resource constraints ●

Rise of the Machines: Network Architecture Search Huge search space Neural Architecture Search with Reinforcement Learning 22,400 GPU days! Learning Transferable Architectures for Scalable Image Recognition - RNN 2000 GPU days Efficient Neural Architecture Search via Parameter Sharing ~ 2000 training runs Figures from: Learning Transferable Architectures for Scalable Image Recognition

MorphNet: Architecture Learning Efficient & scalable architecture learning for everyone Resource Requires handful Trains on your data ● ● ● constraints guide of training runs ● Start with your architecture customization Works with your code ● Simple & effective tool: weighted Idea: Continuous relaxation sparsifying regularization. of combinatorial problem

Learning the Size of Each Layer Topology Architecture search We focus on Sizes Confidential + Proprietary

Concat Conv Conv Conv Conv 1x1 3x3 5x5 1x1 Conv Conv MaxPool 1x1 1x1 3x3 Concat Confidential + Proprietary

Concat Conv Conv Conv 1x1 3x3 1x1 Conv MaxPool 1x1 3x3 Concat Confidential + Proprietary

Main Tool: Weighted sparsifying regularization.

Sparsity Background Sparsity is just - few non zeros Hard to work with in neural nets Continuous relaxation (0,1) (0,0) (0,1) Induces sparsity

(Group) LASSO: Sparsity in Optimization Weight matrix

MorphNet Algorithm Main Tool: Good-old, simple sparsity Optional 1.1: Uniform expansion Stage 1: Structure learning Stage 2: Finetune or retrain weights of learned structure Export learned structure Confidential + Proprietary

Shrinking CIFARNet -50% -40% -20%

Can This Work in Conv-nets? What do Inception, resnet, dense-net, NAS-net, Amoeba-Net have in common? Problem : The weight matrix is scale invariant.

L1-Gamma regularization Actually batch norm has a learned scale parameter: Problem : Still scale invariant. Solution : The scale parameter is the perfect substitute. Zeroing is effectively removing the filter!

Main Tool: Weighted sparsifying regularization.

What Do We Actually Care About? We can now control on the number of filters. But, what we actually care about is: model size, FLOPs and inference time. Notice : FLOPs and model size are a simple function of the number of filters. Solution: Per-layer coefficient that captures the cost.

What is the Cost of a Filter? 11 3 7 3 11 7 Model-size coefficient: 5 5 FLOP coefficient:

Inception V2 Based Networks on ImageNet Baseline: Uniform shrinkage of all layers (width multiplier). FLOP Regularizer: Structure learned with FLOP penalty. Expanded structure: Uniform expansion of learned structure. Figure from: MorphNet: Fast & Simple Resource-Constrained Structure Learning of Deep Networks

JFT: Google Scale Image Classification Image classification with 300M+ Images, >20K classes. Started with a ResNet-101 architecture. Resnet-101 The first model with algorithmically learned architecture serving in production. Figure adapted from: MorphNet: Fast & Simple Resource-Constrained Structure Learning of Deep Networks

ResNet101-Based Learned Structures FLOP Regularizer 40% fewer FLOPs M o d e l S i z e R e g u l a r i z e r All models have the same performance. 43% fewer weights Figure adapted from: MorphNet: Fast & Simple Resource-Constrained Structure Learning of Deep Networks

A Custom Architecture Just For You! Partnered with Google OCR team which maintains models for dozens of scripts which differ in: ● Number of characters, ● Character complexity , ● Word-length , ● Size of data. A single fixed architecture was used for all scripts!

A Custom Architecture Just For You! Models with 50% of FLOPs (with same accuracy) Useful for Cyrillic Useful for Arabic

Zooming in On Latency n e o c i r t o s a P f v e i O Latency is device specific! t t c L u A F r ? B # # Source: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis non erat sem

Latency Roofline Model Each op needs to read inputs, perform calculations, and write outputs. Evaluation time of an op depends on the compute and memory costs. Compute time = FLOPs / compute_rate . Device Specific Memory time = tensor_size / memory_bandwidth . Latency = max( Compute time, Memory time ) Source: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis non erat sem

Example Latency Costs Platform Peak Compute Memory Bandwidth Different platforms have P100 9300 GFLOPs/s 732 GB/s different cost profile V100 125000 GFLOPs/s 900 GB/s Leads to different relative cost Inception V2 Layer Name P100 Latency V100 Latency Ratio Conv2d_2c_3x3 74584 5549 7% Mixed_3c/Branch_2/Conv2d_0a_1x1 2762 1187 43% Mixed_5c/Branch_3/Conv2d_0b_1x1 1381 833 60% Source: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis non erat sem

Tesla V100 Latency Source: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis non erat sem

Tesla P100 Latency Source: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis non erat sem

When Do FLOPs and Latency Differ? ● Create 5000 sub-Inception V2 models with a random number of filters. Compare FLOPs, V100 and P100 ● Latency. P100 - is compute bound, tracks FLOPs “too” closely V100 - gap between FLOPs and Latency is looser Source: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis non erat sem

What Next If you want to ● Algorithmically speedup or shrink your model, ● Easily improve your model You are invited to use our open source library https://github.com/google-research/morph-net

Quick User Guide Exact same API works for different costs and settings: GroupLassoFlops, GammaFlops, GammaModelSize, GammaLatency

Structure Learning: Regularization Strength Pick a few regularization strengths. 1e-6: No effect, too weak P100 Latency Cost 1.5e-5: ~55% speedup

Structure Learning: Accuracy Tradeoff Of course there is a tradeoff 1e-6: No effect, too weak Test Accuracy 1.5e-5: ~55% speedup

Structure Learning: Threshold L2 Norm of CIFARNet Filters Value of gamma, or group LASSO norms After Structure Learning usually don’t reach 0.0 so a threshold is needed. Alive Filters Plot regularized value : L2 or abs(gamma). Dead Filters Usually easy to determine, often the distribution is bimodal. Any value in this range should work

Structure Learning: Exporting …

Retraining/Fine Tuning Problem ● Extra regularization hurts performance. ● Some filters are not completely dead. Options ● Zero dead filters and finetune. ● Train learned structure from scratch. Why ● Ensures learned structure is stand-alone and not tied to learning procedure. ● Stabilizes downstream pipeline.

Under the Hood: Shape Compatibility Constraints conv 1 s k i p Add conv 2 NetworkRegularizers figures out structural dependence in the graph.

Under the Hood: Concatenation (as in Inception) conv 1 Concat Add conv 2 conv 3 Things can get complicated, but it is all handled by the MorphNet framework.

Team Effort Elad Eban, Max Moroz Yair Movshovitz-Attias, Andrew Poon Contributors & collaborators: Ariel Gordon, Bo Chen, Ofir Nachum, Hao Wu, Tien-Ju Yang, Edward Choi, Hernan Moraldo, Jesse Dodge, Yonatan Geifman, Shraman Ray Chaudhuri. .

Thank You Elad Eban Contact: morphnet@google.com https://github.com/google-research/morph-net Confidential + Proprietary

MorphNet Elad Eban Faster Neural Nets with Hardware-Aware - PowerPoint PPT Presentation

MorphNet Elad Eban Faster Neural Nets with Hardware-Aware Architecture Learning Where Do Deep-Nets Come From? VGG: Chatfield et al. 2014 Image from: http://www.paddlepaddle.org/ How Do We Improve Deep Nets? Inception - Szegedy et al. 2015

THE CULTURA RAL, S SCIE IENTIFIC IC A AND S SOCIA IAL DIM IMENSIO IONS OF E EU LAC

CALIBERS A Bandwidth Calendaring Paradigm For Science Workflows Nathan Hanford, Dipak Ghosal

Multidimensional Optimizations Biostatistics 615/815 Lecture 19: . . . . . . Summary . .

Part II Legionella biology Explaining building colonization and disease Legionella are a

Why do we need alternative potash? David Manning Professor of Soil Science, Newcastle University

Swedens number one recruiter for communication, PR & marketing This PM contains a summary

Colorado Beekeeper Mentorship and Associate Trainee Program Colorado State University Extension

How are living Taxonomy things classified? the classification of living things Taxonomy

Indicators of Sustainability & Landscape Diversity Katherine and Nicole What is

Classification & Phylogeny April 2013 www.njctl.org Slide 3 / 92 Vocabulary Click on each

THE MICROSCOPIC LIFE IN THE HYPERSALINE WATERS OF THE MESSOLONGHI SALTWORKS (W. GREECE) by

Universidad Politcnica de Cartagena REACTIVE CONDUCTING POLYMERS AS ACTUATING SENSORS AND

DISINFECTANTS Disinfectant Requirements Disinfectants used in potable water must meet the

welcome you to todays webinar The Science Behind Wastewater Treatment Joshua Williams Process

waterborne pathogen related outbreaks in Australian Hospitals By Morten Schnoor, Pall Water

The Seakeeping Committee Final report and recommendations to the 25th ITTC Four committee

The Artist as a Visual Communicator Why is it art? Why do people create art? Why is it art?

on One Health Chantal Britt Co Commun unications s & Publications s Manager

THE WAY WE USED TO BUILD THE WAY WE BUILD NOW WE CAN DO BETTER MCDONALDS HARMONY WITH

1 What makes something a secret? What is worth keeping secret? Should secrets be

Basic Probability Theory (I) Intro to Bayesian Data Analysis & Cognitive Modeling Adrian

Highlights of 2012 January 6 The U.S. Bureau of Labor Statistics reports that the unemployment

Austerity: a failed experiment on the people Martin McKee Auckland 29 th August 2014 Twitter:

Quality Indicators on Global Software Development Projects: Does Getting to Know You Really

MorphNet Elad Eban Faster Neural Nets with Hardware-Aware - PowerPoint PPT Presentation

MorphNet Elad Eban Faster Neural Nets with Hardware-Aware Architecture Learning Where Do Deep-Nets Come From? VGG: Chatfield et al. 2014 Image from: http://www.paddlepaddle.org/ How Do We Improve Deep Nets? Inception - Szegedy et al. 2015

THE CULTURA RAL, S SCIE IENTIFIC IC A AND S SOCIA IAL DIM IMENSIO IONS OF E EU LAC

CALIBERS A Bandwidth Calendaring Paradigm For Science Workflows Nathan Hanford, Dipak Ghosal

Multidimensional Optimizations Biostatistics 615/815 Lecture 19: . . . . . . Summary . .

Part II Legionella biology Explaining building colonization and disease Legionella are a

Why do we need alternative potash? David Manning Professor of Soil Science, Newcastle University

Swedens number one recruiter for communication, PR &amp; marketing This PM contains a summary

Colorado Beekeeper Mentorship and Associate Trainee Program Colorado State University Extension

How are living Taxonomy things classified? the classification of living things Taxonomy

Indicators of Sustainability &amp; Landscape Diversity Katherine and Nicole What is

Classification &amp; Phylogeny April 2013 www.njctl.org Slide 3 / 92 Vocabulary Click on each

THE MICROSCOPIC LIFE IN THE HYPERSALINE WATERS OF THE MESSOLONGHI SALTWORKS (W. GREECE) by

Universidad Politcnica de Cartagena REACTIVE CONDUCTING POLYMERS AS ACTUATING SENSORS AND

DISINFECTANTS Disinfectant Requirements Disinfectants used in potable water must meet the

welcome you to todays webinar The Science Behind Wastewater Treatment Joshua Williams Process

waterborne pathogen related outbreaks in Australian Hospitals By Morten Schnoor, Pall Water

The Seakeeping Committee Final report and recommendations to the 25th ITTC Four committee

The Artist as a Visual Communicator Why is it art? Why do people create art? Why is it art?

on One Health Chantal Britt Co Commun unications s &amp; Publications s Manager

THE WAY WE USED TO BUILD THE WAY WE BUILD NOW WE CAN DO BETTER MCDONALDS HARMONY WITH

1 What makes something a secret? What is worth keeping secret? Should secrets be

Basic Probability Theory (I) Intro to Bayesian Data Analysis &amp; Cognitive Modeling Adrian

Highlights of 2012 January 6 The U.S. Bureau of Labor Statistics reports that the unemployment

Austerity: a failed experiment on the people Martin McKee Auckland 29 th August 2014 Twitter:

Quality Indicators on Global Software Development Projects: Does Getting to Know You Really

Swedens number one recruiter for communication, PR & marketing This PM contains a summary

Indicators of Sustainability & Landscape Diversity Katherine and Nicole What is

Classification & Phylogeny April 2013 www.njctl.org Slide 3 / 92 Vocabulary Click on each

on One Health Chantal Britt Co Commun unications s & Publications s Manager

Basic Probability Theory (I) Intro to Bayesian Data Analysis & Cognitive Modeling Adrian