Global Optimality in Neural Network Training Benjamin D. Haeffele - PowerPoint PPT Presentation

Global Optimality in Neural Network Training Benjamin D. Haeffele and René Vidal Johns Hopkins University, Center for Imaging Science. Baltimore, USA

Questions in Deep Learning Architecture Design Optimization Generalization

Questions in Deep Learning Are there principled ways to design networks? • How many layers? • Size of layers? • Choice of layer types? • How does architecture impact expressiveness? [1] [1 ] Cohen, et al., “On the expressive power of deep learning: A tensor analysis.” COLT. (2016)

Questions in Deep Learning How to train neural networks?

Questions in Deep Learning How to train neural networks? • Problem is non-convex.

Questions in Deep Learning How to train neural networks? X • Problem is non-convex.

Questions in Deep Learning How to train neural networks? X • Problem is non-convex. • What does the loss surface look like? [1] • Any guarantees for network training? [2] • How to guarantee optimality? • When will local descent succeed? [1] Choromanska, et al., "The loss surfaces of multilayer networks." Artificial Intelligence and Statistics. (2015) [2] Janzamin, et al., "Beating the perils of non-convexity: Guaranteed training of neural networks using tensor methods." arXiv (2015).

Questions in Deep Learning Performance Guarantees? X Complex  Simple • How do networks generalize? • How should networks be regularized? • How to prevent overfitting?

Interrelated Problems Architecture • Optimization can impact generalization. [1] • Architecture has a strong effect on the Generalization/ generalization of networks. [2] Regularization Optimization • Some architectures could be easier to optimize than others. [1] Neyshabur, et al., “In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning.” ICLR workshop. (2015). [2] Zhang, et al., “Understanding deep learning requires rethinking generalization.” ICLR. (2017).

Today’s Talk: The Questions Architecture • Are there properties of the network architecture that allow efficient optimization? Generalization/ Regularization Optimization

Today’s Talk: The Questions Architecture • Are there properties of the network architecture that allow efficient optimization? • Positive Homogeneity • Parallel Subnetwork Structure Generalization/ Regularization Optimization

Today’s Talk: The Questions Architecture • Are there properties of the network architecture that allow efficient optimization? • Positive Homogeneity • Parallel Subnetwork Structure Generalization/ • Are there properties of the Regularization Optimization regularization that allow efficient optimization?

Today’s Talk: The Questions Architecture • Are there properties of the network architecture that allow efficient optimization? • Positive Homogeneity • Parallel Subnetwork Structure Generalization/ • Are there properties of the Regularization Optimization regularization that allow efficient optimization? • Positive Homogeneity • Adapt network architecture to data [1] [1] Bengio, et al., “Convex neural networks .” NIPS. (2005)

Today’s Talk: The Results Optimization

Today’s Talk: The Results Optimization • A local minimum such that one subnetwork is all zero is a global minimum.

Today’s Talk: The Results Optimization • Once the size of the network becomes large enough... • Local descent can reach a global minimum from any initialization. Today’s Framework Non-Convex Function

Outline 1. Network properties that allow Architecture efficient optimization • Positive Homogeneity • Parallel Subnetwork Structure 2. Network size from regularization Generalization/ Regularization Optimization 3. Theoretical guarantees • Sufficient conditions for global optimality • Local descent can reach global minimizers

Key Property 1: Positive Homogeneity • Start with a network. Network Outputs Network Weights

Key Property 1: Positive Homogeneity • Scale the weights by a non-negative constant.

Key Property 1: Positive Homogeneity • The network output scales by the constant to some power.

Key Property 1: Positive Homogeneity • The network output scales by the constant to some power. Network Mapping

Key Property 1: Positive Homogeneity • The network output scales by the constant to some power. Network Mapping - Degree of positive homogeneity

Most Modern Networks Are Positively Homogeneous • Example: Rectified Linear Units (ReLUs)

Most Modern Networks Are Positively Homogeneous • Example: Rectified Linear Units (ReLUs) Doesn’t change rectification

Most Modern Networks Are Positively Homogeneous • Simple Network Conv Conv Max + Linear Out Input + Pool ReLU ReLU

Most Modern Networks Are Positively Homogeneous • Simple Network Conv Conv Max + Linear Out Input + Pool ReLU ReLU • Typically each weight layer increases degree of homogeneity by 1.

Most Modern Networks Are Positively Homogeneous Some Common Positively Homogeneous Layers  Fully Connected + ReLU  Convolution + ReLU  Max Pooling  Linear Layers  Mean Pooling  Max Out  Many possibilities…

Most Modern Networks Are Positively Homogeneous Some Common Positively Homogeneous Layers  Fully Connected + ReLU  Convolution + ReLU  Max Pooling X Not Sigmoids  Linear Layers  Mean Pooling  Max Out  Many possibilities…

Outline 1. Network properties that allow Architecture efficient optimization • Positive Homogeneity • Parallel Subnetwork Structure 2. Network regularization Generalization/ Regularization Optimization 3. Theoretical guarantees • Sufficient conditions for global optimality • Local descent can reach global minimizers

Key Property 2: Parallel Subnetworks • Subnetworks with identical architecture connected in parallel.

Key Property 2: Parallel Subnetworks • Subnetworks with identical architecture connected in parallel. • Simple Example: Single hidden layer network

Key Property 2: Parallel Subnetworks • Subnetworks with identical architecture connected in parallel. • Simple Example: Single hidden layer network • Subnetwork: One ReLU hidden unit

Key Property 2: Parallel Subnetworks • Any positively homogeneous subnetwork can be used • Subnetwork: Multiple ReLU layers

Key Property 2: Parallel Subnetworks • Example: Parallel AlexNets[1] • Subnetwork: AlexNet AlexNet AlexNet Input AlexNet Output AlexNet AlexNet [1] Krizhevsky, Sutskever, and Hinton. "Imagenet classification with deep convolutional neural networks." NIPS, 2012.

Outline 1. Network properties that allow efficient Architecture optimization • Positive Homogeneity • Parallel Subnetwork Structure 2. Network regularization Generalization/ Regularization Optimization 3. Theoretical guarantees • Sufficient conditions for global optimality • Local descent can reach global minimizers

Basic Regularization: Weight Decay Network Weights

Global Optimality in Neural Network Training Benjamin D. Haeffele - PowerPoint PPT Presentation

Global Optimality in Neural Network Training Benjamin D. Haeffele and Ren Vidal Johns Hopkins University, Center for Imaging Science. Baltimore, USA Questions in Deep Learning Architecture Design Optimization Generalization Questions in

Optimality Conditions Fabio Schoen 2008 http://gol.dsi.unifi.it/users/schoen Optimality

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Stability, Networks: Stability, Networks: Control, and Optimality Control, and Optimality

Optimality-based Domain Reductions for Global Optimization A. Caprara, M. Locatelli, M. Monaci

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Global Neural CCG Parsing with Optimality Guarantees Kenton Lee Mike Lewis Luke Zettlemoyer

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

AM 205: lecture 18 Last time: optimization methods Today: conditions for optimality

Adaptive Caching Algorithms with Optimality Guarantees for NDN Networks Stratis Ioannidis and

Neural Networks and their Application to Go Neural Networks Learning Blackjack Theory Training

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

Neural Network II Neural Network II Week 8 1 Team Homework Assignment #10 Team Homework

GLOBAL RISKS GLOBAL RISKS GLOBAL RISKS - GLOBAL RISKS - - - GLOBAL RISKS GLOBAL RISKS

Artificial Neural Networks Oliver Schulte - CMPT 726 Feed-forward Networks Network Training

Deep Learning Primer Nishith Khandwala Neural Networks Overview Neural Network Basics

Deep Equilibrium Models Shaojie Bai Carnegie Mellon University joint work with J. Zico Kolter

Deep Argument Inspection Linux Plumbers Conference 2019 Kees Cook <keescook@chromium.org>

Deep Canonical Correlation Analysis Galen Andrew 1 Raman Arora 2 Jeff Bilmes 1 Karen Livescu 2 1

Deep Learning for Broad Coverage Semantics: SRL, Coreference, and Beyond Luke Zettlemoyer *

Deep Fisher Networks and Class Saliency Maps for Object Classification and Localisation Karn

Deep Learning in Computer Vision Caner Hazrba Deep Learning in Action 24. June 15

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Deep Learning to Evaluate Secure RSA Implementations Mathieu Carbone, Vincent Conin, Marie-Angela

Global Optimality in Neural Network Training Benjamin D. Haeffele - PowerPoint PPT Presentation

Global Optimality in Neural Network Training Benjamin D. Haeffele and Ren Vidal Johns Hopkins University, Center for Imaging Science. Baltimore, USA Questions in Deep Learning Architecture Design Optimization Generalization Questions in

Optimality Conditions Fabio Schoen 2008 http://gol.dsi.unifi.it/users/schoen Optimality

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Stability, Networks: Stability, Networks: Control, and Optimality Control, and Optimality

Optimality-based Domain Reductions for Global Optimization A. Caprara, M. Locatelli, M. Monaci

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Global Neural CCG Parsing with Optimality Guarantees Kenton Lee Mike Lewis Luke Zettlemoyer

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

AM 205: lecture 18 Last time: optimization methods Today: conditions for optimality

Adaptive Caching Algorithms with Optimality Guarantees for NDN Networks Stratis Ioannidis and

Neural Networks and their Application to Go Neural Networks Learning Blackjack Theory Training

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

Neural Network II Neural Network II Week 8 1 Team Homework Assignment #10 Team Homework

GLOBAL RISKS GLOBAL RISKS GLOBAL RISKS - GLOBAL RISKS - - - GLOBAL RISKS GLOBAL RISKS

Artificial Neural Networks Oliver Schulte - CMPT 726 Feed-forward Networks Network Training

Deep Learning Primer Nishith Khandwala Neural Networks Overview Neural Network Basics

Deep Equilibrium Models Shaojie Bai Carnegie Mellon University joint work with J. Zico Kolter

Deep Argument Inspection Linux Plumbers Conference 2019 Kees Cook &lt;keescook@chromium.org&gt;

Deep Canonical Correlation Analysis Galen Andrew 1 Raman Arora 2 Jeff Bilmes 1 Karen Livescu 2 1

Deep Learning for Broad Coverage Semantics: SRL, Coreference, and Beyond Luke Zettlemoyer *

Deep Fisher Networks and Class Saliency Maps for Object Classification and Localisation Karn

Deep Learning in Computer Vision Caner Hazrba Deep Learning in Action 24. June 15

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Deep Learning to Evaluate Secure RSA Implementations Mathieu Carbone, Vincent Conin, Marie-Angela

Deep Argument Inspection Linux Plumbers Conference 2019 Kees Cook <keescook@chromium.org>