Global Optimality in Neural Network Training Benjamin D. Haeffele - - PowerPoint PPT Presentation

global optimality in neural network training
SMART_READER_LITE
LIVE PREVIEW

Global Optimality in Neural Network Training Benjamin D. Haeffele - - PowerPoint PPT Presentation

Global Optimality in Neural Network Training Benjamin D. Haeffele and Ren Vidal Johns Hopkins University, Center for Imaging Science. Baltimore, USA Questions in Deep Learning Architecture Design Optimization Generalization Questions in


slide-1
SLIDE 1

Global Optimality in Neural Network Training

Benjamin D. Haeffele and René Vidal

Johns Hopkins University, Center for Imaging Science. Baltimore, USA

slide-2
SLIDE 2

Questions in Deep Learning

Architecture Design Optimization Generalization

slide-3
SLIDE 3

Questions in Deep Learning

Are there principled ways to design networks?

  • How many layers?
  • Size of layers?
  • Choice of layer types?
  • How does architecture impact expressiveness? [1]

[1] Cohen, et al., “On the expressive power of deep learning: A tensor analysis.” COLT. (2016)

slide-4
SLIDE 4

Questions in Deep Learning

How to train neural networks?

slide-5
SLIDE 5

Questions in Deep Learning

  • Problem is non-convex.

How to train neural networks?

slide-6
SLIDE 6

Questions in Deep Learning

  • Problem is non-convex.

How to train neural networks?

X

slide-7
SLIDE 7

Questions in Deep Learning

  • Problem is non-convex.
  • What does the loss surface look like? [1]
  • Any guarantees for network training? [2]
  • How to guarantee optimality?
  • When will local descent succeed?

How to train neural networks?

X

[1] Choromanska, et al., "The loss surfaces of multilayer networks." Artificial Intelligence and Statistics. (2015) [2] Janzamin, et al., "Beating the perils of non-convexity: Guaranteed training of neural networks using tensor methods." arXiv (2015).

slide-8
SLIDE 8

Questions in Deep Learning

Performance Guarantees?

  • How do networks generalize?
  • How should networks be regularized?
  • How to prevent overfitting?

X Complex Simple

slide-9
SLIDE 9

Interrelated Problems

  • Optimization can impact
  • generalization. [1]
  • Architecture has a strong effect on the

generalization of networks. [2]

  • Some architectures could be easier to
  • ptimize than others.

[1] Neyshabur, et al., “In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning.” ICLR workshop. (2015). [2] Zhang, et al., “Understanding deep learning requires rethinking generalization.” ICLR. (2017).

Architecture Optimization Generalization/ Regularization

slide-10
SLIDE 10

Today’s Talk: The Questions

  • Are there properties of the network

architecture that allow efficient

  • ptimization?

Optimization Generalization/ Regularization Architecture

slide-11
SLIDE 11

Today’s Talk: The Questions

  • Are there properties of the network

architecture that allow efficient

  • ptimization?
  • Positive Homogeneity
  • Parallel Subnetwork Structure

Optimization Generalization/ Regularization Architecture

slide-12
SLIDE 12

Today’s Talk: The Questions

  • Are there properties of the network

architecture that allow efficient

  • ptimization?
  • Positive Homogeneity
  • Parallel Subnetwork Structure
  • Are there properties of the

regularization that allow efficient

  • ptimization?

Optimization Generalization/ Regularization Architecture

slide-13
SLIDE 13

Today’s Talk: The Questions

  • Are there properties of the network

architecture that allow efficient

  • ptimization?
  • Positive Homogeneity
  • Parallel Subnetwork Structure
  • Are there properties of the

regularization that allow efficient

  • ptimization?
  • Positive Homogeneity
  • Adapt network architecture to data [1]

Optimization Generalization/ Regularization Architecture

[1] Bengio, et al., “Convex neural networks.” NIPS. (2005)

slide-14
SLIDE 14

Today’s Talk: The Results

Optimization

slide-15
SLIDE 15

Today’s Talk: The Results

Optimization

  • A local minimum such that
  • ne subnetwork is all zero is

a global minimum.

slide-16
SLIDE 16

Today’s Talk: The Results

  • Once the size of the network

becomes large enough...

  • Local descent can reach a

global minimum from any initialization.

Optimization

Non-Convex Function Today’s Framework

slide-17
SLIDE 17
  • 1. Network properties that allow

efficient optimization

  • Positive Homogeneity
  • Parallel Subnetwork Structure
  • 2. Network size from regularization
  • 3. Theoretical guarantees
  • Sufficient conditions for global optimality
  • Local descent can reach global minimizers

Optimization Generalization/ Regularization Architecture

Outline

slide-18
SLIDE 18

Key Property 1: Positive Homogeneity

  • Start with a network.

Network Outputs Network Weights

slide-19
SLIDE 19

Key Property 1: Positive Homogeneity

  • Scale the weights by a non-negative constant.
slide-20
SLIDE 20

Key Property 1: Positive Homogeneity

  • Scale the weights by a non-negative constant.
slide-21
SLIDE 21

Key Property 1: Positive Homogeneity

  • The network output scales by the constant to some power.
slide-22
SLIDE 22

Key Property 1: Positive Homogeneity

  • The network output scales by the constant to some power.

Network Mapping

slide-23
SLIDE 23

Key Property 1: Positive Homogeneity

  • The network output scales by the constant to some power.

Network Mapping

  • Degree of positive homogeneity
slide-24
SLIDE 24

Most Modern Networks Are Positively Homogeneous

  • Example: Rectified Linear Units (ReLUs)
slide-25
SLIDE 25

Most Modern Networks Are Positively Homogeneous

  • Example: Rectified Linear Units (ReLUs)
slide-26
SLIDE 26

Most Modern Networks Are Positively Homogeneous

  • Example: Rectified Linear Units (ReLUs)
slide-27
SLIDE 27

Most Modern Networks Are Positively Homogeneous

  • Example: Rectified Linear Units (ReLUs)
slide-28
SLIDE 28

Most Modern Networks Are Positively Homogeneous

  • Example: Rectified Linear Units (ReLUs)

Doesn’t change rectification

slide-29
SLIDE 29

Most Modern Networks Are Positively Homogeneous

  • Example: Rectified Linear Units (ReLUs)

Doesn’t change rectification

slide-30
SLIDE 30

Most Modern Networks Are Positively Homogeneous

  • Simple Network

Input Conv + ReLU Linear Out Max Pool Conv + ReLU

slide-31
SLIDE 31

Most Modern Networks Are Positively Homogeneous

  • Simple Network

Input Conv + ReLU Linear Out Max Pool Conv + ReLU

slide-32
SLIDE 32

Most Modern Networks Are Positively Homogeneous

  • Simple Network

Input Conv + ReLU Linear Out Max Pool Conv + ReLU

slide-33
SLIDE 33

Most Modern Networks Are Positively Homogeneous

  • Simple Network

Input Conv + ReLU Linear Out Max Pool Conv + ReLU

slide-34
SLIDE 34

Most Modern Networks Are Positively Homogeneous

  • Simple Network

Input Conv + ReLU Linear Out Max Pool Conv + ReLU

slide-35
SLIDE 35

Most Modern Networks Are Positively Homogeneous

  • Simple Network

Input Conv + ReLU Linear Out Max Pool Conv + ReLU

slide-36
SLIDE 36

Most Modern Networks Are Positively Homogeneous

  • Simple Network

Input Conv + ReLU Linear Out Max Pool Conv + ReLU

slide-37
SLIDE 37

Most Modern Networks Are Positively Homogeneous

  • Simple Network

Input Conv + ReLU Linear Out Max Pool Conv + ReLU

slide-38
SLIDE 38

Most Modern Networks Are Positively Homogeneous

  • Simple Network

Input Conv + ReLU Linear Out Max Pool Conv + ReLU

slide-39
SLIDE 39

Most Modern Networks Are Positively Homogeneous

  • Simple Network

Input Conv + ReLU Linear Out Max Pool Conv + ReLU

slide-40
SLIDE 40

Most Modern Networks Are Positively Homogeneous

  • Simple Network

Input Conv + ReLU Linear Out Max Pool Conv + ReLU

slide-41
SLIDE 41

Most Modern Networks Are Positively Homogeneous

  • Simple Network

Input Conv + ReLU Linear Out Max Pool Conv + ReLU

  • Typically each weight layer increases degree of homogeneity by 1.
slide-42
SLIDE 42

Most Modern Networks Are Positively Homogeneous

Some Common Positively Homogeneous Layers

Fully Connected + ReLU Convolution + ReLU Max Pooling Linear Layers Mean Pooling Max Out Many possibilities…

slide-43
SLIDE 43

Most Modern Networks Are Positively Homogeneous

Some Common Positively Homogeneous Layers

Fully Connected + ReLU Convolution + ReLU Max Pooling Linear Layers Mean Pooling Max Out Many possibilities…

X Not Sigmoids

slide-44
SLIDE 44
  • 1. Network properties that allow

efficient optimization

  • Positive Homogeneity
  • Parallel Subnetwork Structure
  • 2. Network regularization
  • 3. Theoretical guarantees
  • Sufficient conditions for global optimality
  • Local descent can reach global minimizers

Optimization Generalization/ Regularization Architecture

Outline

slide-45
SLIDE 45

Key Property 2: Parallel Subnetworks

  • Subnetworks with identical architecture connected in parallel.
slide-46
SLIDE 46

Key Property 2: Parallel Subnetworks

  • Subnetworks with identical architecture connected in parallel.
  • Simple Example: Single hidden layer network
slide-47
SLIDE 47

Key Property 2: Parallel Subnetworks

  • Subnetworks with identical architecture connected in parallel.
  • Simple Example: Single hidden layer network
slide-48
SLIDE 48

Key Property 2: Parallel Subnetworks

  • Subnetworks with identical architecture connected in parallel.
  • Simple Example: Single hidden layer network
  • Subnetwork: One ReLU hidden unit
slide-49
SLIDE 49

Key Property 2: Parallel Subnetworks

  • Subnetwork: Multiple ReLU layers
  • Any positively homogeneous subnetwork can be used
slide-50
SLIDE 50

Key Property 2: Parallel Subnetworks

  • Subnetwork: AlexNet
  • Example: Parallel AlexNets[1]

AlexNet AlexNet AlexNet AlexNet AlexNet Input Output

[1] Krizhevsky, Sutskever, and Hinton. "Imagenet classification with deep convolutional neural networks." NIPS, 2012.

slide-51
SLIDE 51

Optimization Generalization/ Regularization Architecture

  • 1. Network properties that allow efficient
  • ptimization
  • Positive Homogeneity
  • Parallel Subnetwork Structure
  • 2. Network regularization
  • 3. Theoretical guarantees
  • Sufficient conditions for global optimality
  • Local descent can reach global minimizers

Outline

slide-52
SLIDE 52

Basic Regularization: Weight Decay

Network Weights

slide-53
SLIDE 53

Basic Regularization: Weight Decay

Network Weights

slide-54
SLIDE 54

Basic Regularization: Weight Decay

Network Weights

slide-55
SLIDE 55

Basic Regularization: Weight Decay

Network Weights

slide-56
SLIDE 56

Basic Regularization: Weight Decay

Network Weights

slide-57
SLIDE 57

Basic Regularization: Weight Decay

Network Weights

Degrees of positive homogeneity don’t match = Bad things happen.

slide-58
SLIDE 58

Basic Regularization: Weight Decay

Network Weights Proposition: There will always exist non-optimal local minima.

Degrees of positive homogeneity don’t match = Bad things happen.

slide-59
SLIDE 59

Adapting the size of the network via regularization

  • Start with a positively homogeneous network with parallel structure
slide-60
SLIDE 60

Adapting the size of the network via regularization

  • Take the weights of one subnetwork.
slide-61
SLIDE 61

Adapting the size of the network via regularization

  • Define a regularization function on the weights.
slide-62
SLIDE 62

Adapting the size of the network via regularization

  • Define a regularization function on the weights.
  • Non-negative.
  • Positively homogeneous with same

degree as network mapping.

slide-63
SLIDE 63

Adapting the size of the network via regularization

  • Define a regularization function on the weights.
  • Non-negative.
  • Positively homogeneous with same

degree as network mapping.

slide-64
SLIDE 64

Adapting the size of the network via regularization

  • Define a regularization function on the weights.
  • Non-negative.
  • Positively homogeneous with same

degree as network mapping.

slide-65
SLIDE 65

Adapting the size of the network via regularization

  • Define a regularization function on the weights.
  • Non-negative.
  • Positively homogeneous with same

degree as network mapping. Example: Product of norms

slide-66
SLIDE 66
  • Sum over all the subnetworks.

Adapting the size of the network via regularization

slide-67
SLIDE 67
  • Sum over all the subnetworks.

Adapting the size of the network via regularization

slide-68
SLIDE 68
  • Sum over all the subnetworks.

Adapting the size of the network via regularization

slide-69
SLIDE 69
  • Sum over all the subnetworks.

Adapting the size of the network via regularization

# of Subnetworks

slide-70
SLIDE 70
  • Allow the number of subnetworks to vary.

Adapting the size of the network via regularization

  • Adding a subnetwork is penalized

by an additional term in the sum.

  • Acts to constrain the number of

subnetworks. # of Subnetworks

slide-71
SLIDE 71

Architecture Optimization Generalization/ Regularization

  • 1. Network properties that allow efficient
  • ptimization
  • Positive Homogeneity
  • Parallel Subnetwork Structure
  • 2. Network regularization
  • 3. Theoretical guarantees
  • Sufficient conditions for global optimality
  • Local descent can reach global minimizers

Outline

slide-72
SLIDE 72

Our problem

slide-73
SLIDE 73
  • The non-convex problem we’re interested in

Our problem

slide-74
SLIDE 74
  • The non-convex problem we’re interested in

Our problem

slide-75
SLIDE 75
  • The non-convex problem we’re interested in

Our problem

Loss Function:

Assume convex and once differentiable in

Examples:

  • Cross-entropy
  • Least-squares

Labels

slide-76
SLIDE 76

Why do all this?

slide-77
SLIDE 77
  • Induces a convex function on the network outputs.

Why do all this?

slide-78
SLIDE 78
  • Induces a convex function on the network outputs.

Why do all this?

Induced Function:

Comes from the regularization

slide-79
SLIDE 79
  • Induces a convex function on the network outputs.

Why do all this?

Induced Function:

Comes from the regularization

slide-80
SLIDE 80
  • Induces a convex function on the network outputs.

Why do all this?

Induced Function:

Comes from the regularization

slide-81
SLIDE 81
  • Induces a convex function on the network outputs.

Why do all this?

Induced Function:

Comes from the regularization

slide-82
SLIDE 82
  • Induces a convex function on the network outputs.

Why do all this?

  • The convex problem provides an achievable lower bound for the

non-convex network training problem.

slide-83
SLIDE 83
  • Induces a convex function on the network outputs.

Why do all this?

  • The convex problem provides an achievable lower bound for the

non-convex network training problem.

  • Use the convex function as an analysis tool to study the non-convex

network training problem.

slide-84
SLIDE 84
  • Theorem: A local minimum

such that one subnetwork is all zero is a global minimum.

Sufficient Conditions for Global Optimality

slide-85
SLIDE 85
  • Theorem: A local minimum

such that one subnetwork is all zero is a global minimum.

Sufficient Conditions for Global Optimality

slide-86
SLIDE 86
  • Theorem: A local minimum

such that one subnetwork is all zero is a global minimum.

Sufficient Conditions for Global Optimality

  • Intuition: The local minimum

satisfies the optimality conditions for the convex problem.

slide-87
SLIDE 87
  • Theorem: If the size of the network is large enough (has

enough subnetworks), then a global minimum can always be reached by local descent from any initialization.

Global Minima from Local Descent

slide-88
SLIDE 88
  • Theorem: If the size of the network is large enough (has

enough subnetworks), then a global minimum can always be reached by local descent from any initialization.

Global Minima from Local Descent

Non-Convex Function Today’s Framework

slide-89
SLIDE 89
  • Theorem: If the size of the network is large enough (has

enough subnetworks), then a global minimum can always be reached by local descent from any initialization.

Global Minima from Local Descent

  • Meta-Algorithm:
  • If not at a local minima, perform local descent
  • At local minima, test if first Theorem is satisfied
  • If not, add a subnetwork in parallel and continue
  • Maximum number of subnetworks guaranteed to be bounded by the dimensions of the network
  • utput

Non-Convex Function Today’s Framework

slide-90
SLIDE 90

Conclusions

  • Network size matters
  • Optimize network weights AND network size
  • Current: Size = Number of parallel subnetworks
  • Future: Size = Number of layers, neurons per layer, etc…
  • Regularization design matters
  • Match the degrees of positive homogeneity between network and regularization
  • Regularization can control the size of the network
  • Not done yet
  • Several practical and theoretical limitations
slide-91
SLIDE 91

Thank You

Vision Lab @ Johns Hopkins University http://www.vision.jhu.edu Center for Imaging Science @ Johns Hopkins University http://www.cis.jhu.edu Work supported by NSF grants 1447822, 1618485 and 1618637