Training Neural Networks: Optimization
Intro to Deep Learning, Fall 2020
1
Training Neural Networks: Optimization Intro to Deep Learning, Fall - - PowerPoint PPT Presentation
Training Neural Networks: Optimization Intro to Deep Learning, Fall 2020 1 Quick Recap Gradient descent, Backprop 2 Quick Recap: Training a network Divergence between desired output and Average over all actual output of net for a given
1
2
– Quantifies the difference between desired output and the actual
Total loss Average over all training instances Divergence between desired output and actual output of net for a given input Output of net in response to input Desired output in response to input
3
loss for the individual instances
learn the network
Solved through gradient descent as
4
loss for the individual instances
the network parameters
Solved through gradient descent as
5
6
divergence at a higher eccentricity
= 𝑋 − 𝜃𝛼 𝑀 𝑋 𝑈 7
– Has different eccentricities w.r.t different weights
9
– Having
can actually help escape local optima
– More likely to find better minima
10
Note: this is actually a reduced step size
11
𝑈
() ()
12
– And this may be a good thing
differences between the dimensions
dimensions, but are complex
improvement are demonstrably superior to other methods
13
14
– But the resulting updates will not be against the gradient and do not guarantee descent
– Shrink step size in directions where the weight oscillates – Expand step size in directions where the weight moves consistently in one direction
previous updates consistently moved weight right
previous updates kept changing direction
but increases along w1 k=1 k=2 k=3
𝑋
= 𝑋 − 𝜃𝛼 𝑀 𝑋 𝑈 15
– Cancels out steps in directions where the weight value oscillates – Adaptively increases step size in directions of consistent change
() ()
Momentum Nestorov
() () () ()
() ()
16
17
18
19
20
21
Input (X)
value
– Gradient descent adjusts parameters to adjust the function value at all points – Repeat this iteratively until we get arbitrarily close to the target function at the training points
22
value
– Gradient descent adjusts parameters to adjust the function value at all points – Repeat this iteratively until we get arbitrarily close to the target function at the training points
23
value
– Gradient descent adjusts parameters to adjust the function value at all points – Repeat this iteratively until we get arbitrarily close to the target function at the training points
24
value
– Gradient descent adjusts parameters to adjust the function value at all points – Repeat this iteratively until we get arbitrarily close to the target function at the training points
25
value
– Gradient descent adjusts parameters to adjust the function value at all points – Repeat this iteratively until we get arbitrarily close to the target function at the training points
26
value
– Gradient descent adjusts parameters to adjust the function value at all points – Repeat this iteratively until we get arbitrarily close to the target function at the training points
27
28
– Keep adjustments small – Eventually, when we have processed all the training points, we will have adjusted the entire function
update
29
– Keep adjustments small – Eventually, when we have processed all the training points, we will have adjusted the entire function
update
30
– Keep adjustments small – Eventually, when we have processed all the training points, we will have adjusted the entire function
update
31
– Keep adjustments small – Eventually, when we have processed all the training points, we will have adjusted the entire function
update
32
– Keep adjustments small – Eventually, when we have processed all the training points, we will have adjusted the entire function
update
33
– Compute
𝒖 𝒖
– Update
34
35
– Compute
𝒖 𝒖
– Update
36
One epoch Over multiple epochs One update
37
38
39
40
41
42
43
44
45
– Compute
𝒖 𝒖
– Update
𝒖
46
47
48
– The final gradient points is the average of individual gradients – It points towards the net direction
49
𝑒𝐹(𝑿(), 𝑿(), … , 𝑿 ) 𝒆𝑥,
()
= 𝟐 𝑼 𝒆𝑬𝒋𝒘(𝒁(𝒀𝒋), 𝒆𝒋; 𝑿(), 𝑿(), … , 𝑿()) 𝒆𝑥,
() 𝒋
50
– The final gradient points is simply the gradient for an individual instance
51
𝑒𝑭 𝒆𝑥,
() = 𝟐
𝑼 𝒆𝑬𝒋𝒘(𝒁(𝒀𝒋), 𝒆𝒋) 𝒆𝑥,
()
= 𝒆𝑬𝒋𝒘(𝒁(𝒀𝒋), 𝒆𝒋) 𝒆𝑥,
() 𝒋
52
Batch SGD
53
54
55
– Correcting the function for individual instances will lead to never-ending, non-convergent updates – We must shrink the learning rate with iterations to prevent this
learning rates will not modify the function
Input (X)
56
– Compute
𝒖 𝒖
– Update
𝒖
57
– Compute
𝒖 𝒖
– Update
𝒖
58
Randomize input order Learning rate reduces with j
functions
– Sufficient condition: step sizes follow the following conditions (Robbins and Munro 1951)
𝜃 = ∞
𝜃
< ∞
– The fastest converging series that satisfies both above requirements is
𝜃 ∝ 1 𝑙
– More generally, the learning rates are heuristically determined
59
get within of the optimal solution
–
() ∗
– Note: here is the optimization objective on the entire training data, although SGD itself updates after every training instance
, for strongly convex functions,
() ∗ () ∗
– Strongly convex Can be placed inside a quadratic bowl, touching at any point – Giving us the iterations to convergence as
report an convergence of
60
61
62
63
64
– Lets try to understand these results..
65
66
𝑀𝑝𝑡𝑡 𝑋 = 1 𝑂 𝑒𝑗𝑤 𝑔 𝑌; 𝑋 , 𝑒
= argmin
𝐹 𝑀𝑝𝑡𝑡 𝑋 = 𝐹 𝑒𝑗𝑤 𝑔 𝑌; 𝑋 , 𝑌
67
Xi di
𝑀𝑝𝑡𝑡 𝑋 = 1 𝑂 𝑒𝑗𝑤 𝑔 𝑌; 𝑋 , 𝑒
= argmin
𝐹 𝑀𝑝𝑡𝑡 𝑋 = 𝐹 𝑒𝑗𝑤 𝑔 𝑌; 𝑋 , 𝑌
68
Xi di The empirical risk is an unbiased estimate of the expected divergence
Though there is no guarantee that minimizing it will minimize the expected divergence
𝑀𝑝𝑡𝑡 𝑋 = 1 𝑂 𝑒𝑗𝑤 𝑔 𝑌; 𝑋 , 𝑒
= argmin
𝐹 𝑀𝑝𝑡𝑡 𝑋 = 𝐹 𝑒𝑗𝑤 𝑔 𝑌; 𝑋 , 𝑌
69
Xi di The empirical risk is an unbiased estimate of the expected divergence
Though there is no guarantee that minimizing it will minimize the expected divergence
The variance of the empirical risk: var(Loss) = 1/N var(div)
The variance of the estimator is proportional to 1/N
The larger this variance, the greater the likelihood that the W that minimizes the empirical risk will differ significantly from the W that minimizes the expected divergence
70
Xi di
71
Xi di The sample divergence is also an unbiased estimate of the expected error
72
Xi di The variance of the sample divergence is the variance of the divergence itself: var(div). This is N times the variance of the empirical average minimized by batch update The sample divergence is also an unbiased estimate of the expected error
– The divergence is a function of the error – We want to find the that minimizes the average divergence
73
74
75
76
77
With only one sample
78
With only one sample
79
With only one sample
80
81
82
points
– Keep adjustments small – If the subsets cover the training set, we will have adjusted the entire function
training data
83
points
– Keep adjustments small – If the subsets cover the training set, we will have adjusted the entire function
training data
84
points
– Keep adjustments small – If the subsets cover the training set, we will have adjusted the entire function
training data
85
points
– Keep adjustments small – If the subsets cover the training set, we will have adjusted the entire function
training data
86
– Randomly permute
– ∆𝑋
= 0
– For every layer 𝑙: » Compute 𝛼𝐸𝑗𝑤(𝑍
, 𝑒)
» ∆𝑋
= ∆𝑋 +
, 𝑒)
– For every layer k:
𝑋
= 𝑋 − 𝜃∆𝑋
has converged
87
– Randomly permute
– ∆𝑋
= 0
– For every layer 𝑙: » Compute 𝛼𝐸𝑗𝑤(𝑍
, 𝑒)
» ∆𝑋
= ∆𝑋 +
, 𝑒)
– For every layer k:
𝑋
= 𝑋 − 𝜃∆𝑋
has converged
88
Mini-batch size Shrinking step size
89
Xi di
90
Xi di The minibatch loss is also an unbiased estimate of the expected loss
91
Xi di The variance of the minibatch loss: var(BatchLoss) = 1/b var(div) This will be much smaller than the variance of the sample error in SGD The minibatch loss is also an unbiased estimate of the expected error
– Apparently an improvement of
– But since the batch size is , we perform times as many computations per iteration as SGD – We actually get a degradation of
– The objectives are generally not convex; mini-batches are more effective with the right learning rates – We also get additional benefits of vector processing
92
93
94
95
than full-batch training
– Provided they are provided in random order
iterations
– Otherwise the learning will continuously “chase” the latest sample
– Estimates have lower variance than SGD – Convergence rate is theoretically worse than SGD – But we compensate by being able to perform batch processing
96
97
98
99
100
SGD instance
loss
,
– Randomly permute
– 𝛼𝑀𝑝𝑡𝑡 = 0
– For every layer 𝑙: » Compute 𝛼𝐸𝑗𝑤(𝑍
, 𝑒)
» 𝛼𝑀𝑝𝑡𝑡 +=
𝛼𝑬𝒋𝒘(𝑍 , 𝑒)
– For every layer k: Δ𝑋
= 𝛾Δ𝑋 − 𝜃(𝛼𝑀𝑝𝑡𝑡)
𝑋
= 𝑋 + ∆𝑋
has converged
101
– First extend the previous step – Then compute the gradient at the resultant position – Add the two to obtain the final step
– The accelerated gradient smooths out the variance in the gradients
102
103
SGD instance
loss
= 0
– Randomly permute 𝑌, 𝑒 , 𝑌, 𝑒 ,…, 𝑌, 𝑒 – For 𝑢 = 1: 𝑐: 𝑈
– 𝑋
= 𝑋 + 𝛾Δ𝑋
𝛼𝑀𝑝𝑡𝑡 = 0
– For every layer 𝑙: » Compute 𝛼𝐸𝑗𝑤(𝑍
, 𝑒)
» 𝛼𝑀𝑝𝑡𝑡 +=
, 𝑒)
– For every layer k: 𝑋
= 𝑋 − 𝜃𝛼𝑀𝑝𝑡𝑡𝑈
Δ𝑋
= 𝛾Δ𝑋 − 𝜃𝛼𝑀𝑝𝑡𝑡𝑈
has converged
104
105
movement
– In the example, total motion in the vertical direction is much greater than in the horizontal direction – Can happen even when momentum or Nestorov are used
– Second order term
106
1 2 3 4 5
Step X component Y component
1 1 +2.5 2 1
3 2 +2.5 4 1
5 1.5 1.5
– Scale updates in every component in inverse proportion to the total movement of that component in recent past
components
– In the above example it would scale down Y component – And scale up X component (in comparison)
107
– Updates are by parameter – Derivative of loss w.r.t any individual parameter is shown as
– The squared derivative is
derivative
– The mean squared derivative is a running estimate of the average squared derivative. We will show this as
– scale down updates with large mean squared derivatives – scale up updates with small mean squared derivatives
108
– Maintain a running estimate of the mean squared value of derivatives for each parameter – Scale update of the parameter by the inverse of the root mean squared derivative
– Maintain a running estimate of the mean squared value of derivatives for each parameter – Scale update of the parameter by the inverse of the root mean squared derivative
Note similarity to RPROP The magnitude of the derivative is being normalized out
– Randomly shuffle inputs to change their order – Initialize: ; for all weights in all layers,
(incrementing in blocks of inputs)
– Compute » Output 𝒁(𝒀𝒖𝒄) » Compute gradient
𝒆𝑬𝒋𝒘(𝒁(𝒀𝒖𝒄),𝒆𝒖𝒄) 𝒆𝒙
» Compute 𝜖𝐸 +=
𝒆𝒙
∀𝑗, 𝑘, 𝑙
𝑭 𝝐𝒙
𝟑 𝑬 𝒍 = 𝜹𝑭 𝝐𝒙 𝟑 𝑬 𝒍𝟐 + 𝟐 − 𝜹
𝝐𝒙
𝟑 𝑬 𝒍
𝒙𝒍𝟐 = 𝒙𝒍 − 𝜽 𝑭 𝝐𝒙
𝟑 𝑬 𝒍 + 𝝑
𝝐𝒙𝑬
111
Typical values:
gradient
– Considers both first and second moments
– Maintain a running estimate of the mean derivative for each parameter – Maintain a running estimate of the mean squared value of derivatives for each parameter – Scale update of the parameter by the inverse of the root mean squared derivative
= 𝑛 1 − 𝜀 , 𝑤 = 𝑤 1 − 𝛿 𝑥 = 𝑥 − 𝜃 𝑤 + 𝜗 𝑛
current gradient
– Maintain a running estimate of the mean derivative for each parameter – Maintain a running estimate of the mean squared value of derivatives for each parameter – Scale update of the parameter by the inverse of the root mean squared derivative
Ensures that the and terms do not dominate in early iterations
,
, ,
114
115
116
117
118