Neural Networks: Optimization Part 1
Intro to Deep Learning, Fall 2020
1
Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall - - PowerPoint PPT Presentation
Neural Networks: Optimization Part 1 Intro to Deep Learning, Fall 2020 1 Story so far Neural networks are universal approximators Can model any odd thing Provided they have the right architecture We must train them to approximate
1
– Can model any odd thing – Provided they have the right architecture
– Specify the architecture – Learn their weights and biases
– We do so through empirical risk minimization
2
/!0
3
0 ∑2 +!, 345 67, 87
4
7
. ∑0 (!) 123(56, 76)
5
6
Computed using backprop
6
7
8
9
– Shifting the threshold from T1 to T2 does not change classification error – Does not indicate if moving the threshold left was good or not
10
T1 T2 x x y y
– “Distance” == divergence – Perturbing the function changes this quantity,
T2 T1
0.5 0.5
11
1
(1,0), +1 (0,1), +1 (-1,0), -1
12
1
(1,0), +1 (0,1), +1 (-1,0), -1
13
– E.g. ! = #$% 0.99 representing a 99% confidence in the class
,-. 1 + ,/. 0 + 0 = ! ,-. 0 + ,/. 1 + 0 = ! ,-. −1 + ,/. 0 + 0 = −!
– represents a unique line regardless of the value of !
1
(1,0), +1 (0,1), +1 (-1,0), -1
14
1
(1,0), +1 (0,1), +1 (-1,0), -1 (0,-t), +1
15
to derivative of L2 error:
!"#$ = 1 − ( − ) −*+, + .
2
Notation: 0 = ) 1 = logistic activation ! !"#$ !*+ = 2 1 − ( − ) −*+, + . )′ −*+, + . , ! !"#$ !. = −2 1 − ( − ) −*+, + . )′ −*+, + .
1 (1,0), +1 (0,1), +1 (-1,0), -1 (0,-t), +1
16
1-e is the actual achievable value
!"#$ = 1 − ( − ) −*+, + .
2
Notation: 0 = ) 1 = logistic activation ! !"#$ !*+ = 2 1 − ( − ) −*+, + . )′ −*+, + . , ! !"#$ !. = 2 1 − ) −*+, + . )′ −*+, + . ,
17
divergence near the optimal solution for 3 points
gradient) for the 4-point problem!
– Will be found by backprop nearly all the time
1
(1,0), +1 (0,1), +1 (-1,0), -1 (0,-t), +1 % very large
18
1
(1,0), +1 (0,1), +1 (-1,0), -1 (0,-t), +1
19
separable!
where the perceptron succeeds
1
(1,0), +1 (0,1), +1 (-1,0), -1 (0,-t), +1
20
1
21
significantly
1
22
– Perceptron finds the linear separator, – Backprop does not find a separator
significantly
– Assuming weights are constrained to be bounded
1
23
significantly
1
24
significantly
1
25
significantly
1
26
27
28
29
30
– 1000 training points, 660 hidden neurons – Network heavily overdesigned even for shallow nets
– Depth – Data
31
6 layers 11 layers 3 layers 4 layers 6 layers 11 layers 3 layers 4 layers 10000 training instances
32
– In large networks, saddle points are far more common than local minima
– Most local minima are equivalent
– This is not true for small networks
– The slope is zero – The surface increases in some directions, but decreases in others
– Gradient descent algorithms often get “stuck” in saddle points
33
Analysis: Learning from Examples Without Local Minima” : An MLP with a single hidden layer has only saddle points and no local Minima
in high-dimensional non-convex optimization” : An exponential number of saddle points in large networks
large networks, most local minima lie in a band and are equivalent
– Based on analysis of spin glass models
networks of finite size, trained on finite data, you can have horrible local minima
34
loss function
exists, and lies within the capacity of the network to model
– The optimum for the loss function may not be the “true” solution
unpleasant saddle points
– Which backpropagation may find
35
36
37
– We can connect any two points
intersecting it – Many mathematical definitions that are equivalent
– Streetlight effect
Contour plot of convex function
38
converge to a solution if the value updates arrive at a fixed point
– Where the gradient is 0 and further updates do not change the estimate
converge
– It may jitter around the local minimum – It may even diverge
converging jittering diverging
39
iterations arrive at the solution
! = # $(&'() − # $∗ # $(&) − # $∗
– $(&'()is the k-th iteration – $∗is the optimal value of $
the convergence is linear
– In reality, its arriving at the solution exponentially fast # $(&) − # $∗ ≤ !& # $(-) − # $∗
converging
40
Gradient descent with fixed step size % to estimate scalar parameter w
w(#)
,-.-/-01 + = 1
41
!(#) = ! w(') + !) w ' # − w(') + +
, !′′ w(')
# − w(') ,
– Taylor expansion
#./0 = w ' − !′′ w '
1+!′ w '
2! w(') 2w = !′ w(')
that we can arrive at the optimum in a single step using the optimum step size 3456 = !′′ w '
1+ = 718
w('9+) = w(') − 3 2! w(') 2w
42
Gradient descent with fixed step size ! to estimate scalar parameter w
43
! ≈ ! w(&) + " − w(&) *! w(&) *" + 1 2 " − w(&)
*"- + ⋯
– Taylor expansion
/012 = *-! w(&) *"-
45
44
approx " "789
! = 1 2 %&'% + %&) + *
– Since !& = ! (! is scalar), ' can always be made symmetric
! = 1 2 +
,
/ + 0,., + *
– The .,s are uncoupled – For convex (paraboloid) !, the -,, values are all positive – Just a sum of 1 independent quadratic functions
45
,
/ + 0,., + *
46
– All “slices” parallel to an axis are shifted versions of one another " = 1 2 &''('
) + +'(' + , + -(¬(')
'
) + +'(' + ,
47
– All “slices” parallel to an axis are shifted versions of one another " = 1 2 &''('
) + +'(' + , + -(¬(')
'
) + +'(' + ,
48
– I.e. we could optimize each coordinate independently
! = 1 2 %&&'&
( + *&'& + + + ,(¬'&)
! = 1 2 %(('(
( + *('( + + + ,(¬'()
0&,234 = %&&
5&
0(,234 = %((
5&
49
(#$%) = ./ (#) − )
(#)
!(#$%) !(#)
50
'
– Otherwise the learning will diverge
– And will oscillate in all directions where !',)*+ ≤ ! < 2!',)*+
(/01) = 8' (/) − !
(/)
(/)
< =1
=1
51
52
53
'()
*
+*,,-. '/0
*
+*,,-. is large
54
– Quadratic functions form some kind of benchmark – Convergence of gradient descent is linear
benchmark
– Local between current location and nearest local minimum
– Strong convexity – Lifschitz continuity – Lifschitz smoothness – ..and how they affect convergence of gradient descent
55
" #$%# + #$' + (
– Every “slice” is a quadratic bowl
– Others convex functions will be steeper in some regions, but flatter in others
– Take )(log 1/0) steps to get within 0 of the optimal solution
56
– Has a lower bound to its second derivative
– At any location, you can draw a quadratic bowl of fixed convexity (quadratic constant equal to lower bound of 2nd derivative) touching the function at that point, which contains it
quadratic
57
58
– Has a lower bound to its second derivative
– At any location, you can draw a quadratic bowl of fixed convexity (quadratic constant equal to lower bound of 2nd derivative) touching the function at that point, which contains it
quadratic
– The slope of the outer surface is the Lifschitz constant – ! " − ! $ ≤ &|" − $|
59
From wikipedia
– Need not be convex (or even differentiable) – Has an upper bound on second derivative (if it exists)
– Minimum curvature of quadratic must be >= upper bound of second derivative of function (if it exists)
60
61
– Need not be convex (or even differentiable) – Has an upper bound on second derivative (if it exists)
– Minimum curvature of quadratic must be >= upper bound of second derivative of function (if it exists)
62
– Second derivative has upper and lower bounds – Convergence depends on curvature of strong convexity (at least linear)
– Convex, but upper bound on second derivative – Weaker convergence guarantees, if any (at best linear) – This is often a reasonable assumption for the local structure of your loss function
63
– Second derivative has upper and lower bounds – Convergence depends on curvature of strong convexity (at least linear)
– Convex, but upper bound on second derivative – Weaker convergence guarantees, if any (at best linear) – This is often a reasonable assumption for the local structure of your loss function
fast
– Linear convergence
! "($) − ! "∗ ∝ 1 * ! "(+) − ! "∗ – And inversely proportional to learning rate ! "($) − ! "∗ ≤ 1 2.* "(+) − "∗ – Takes O 1/1 iterations to get to within 1 of the solution – An inappropriate learning rate will destroy your happiness
– Convergence behavior will still depend on the nature of the original function
64
'()
*
+*,,-. '/0
*
+*,,-. is large
65
66
– Resulting in different optimal learning rates for different directions – The problem is more difficult when the ellipsoid is not axis aligned: the steps along the two directions are coupled! Moving in one direction changes the gradient along the other
– Then all of them will have identical optimal learning rates – Easier to find a working learning rate
– Equal-value contours are circular – Movement along the coordinate axes become independent
written as ! = 1 2 % &' % & + ) *' % & + + ,- ,. % ,- % ,. % ,- = /-,- % ,. = /.,. % & = % ,- % ,. % & = 0& & = ,- ,. 0 = /- /.
67
68
69
70
71
72
73
# $ %
&! and 0 &!:
&! = %
&!. ,1-./
74
$ %
&! %
&! %
&! & '
&! &(-) '
75
", !
", "($) -
76
! " = .0.2" , = 1 2 "-." + 6-" + 7 , = 1 2 ! "- ! " + 8 6- ! " + 7
– Because of the cross-terms "#$%#%
$
– The major axes of the ellipsoids are the Eigenvectors of !, and their diameters are proportional to the Eigen values of !
– This is merely a rotation of the space from the axis-aligned case – The component-wise optimal learning rates along the major and minor axes of the equal- contour ellipsoids will be different, causing problems
& = 1 2 *+!* + *+- + . & = 1 2 /
#
"##%#
0 + / #1$
"#$%#%
$
+ /
#
2#%# + .
77
minor axes of the contour ellipsoids will differ, causing problems
– Inversely proportional to the eigenvalues of !
directions to obtain the same normalized update rule as before:
"($%&) = "($) − *!+&,
78
! " ≈ ! "(%) + ("! "(%) " − *(%) + + , " − *(%) -.! *(%) " − *(%) + ⋯
79
! " ≈ ! "(%) + ("! "(%) " − *(%) + + , " − *(%) -.! *(%) " − *(%) + ⋯
1 "23" + "24 + 5
"(670) = "(6) − 9:; *(6) <0=
"> "(6) ?
– And should not be greater than 2!
80
&0 &(') 1
– " = 1
Fit a quadratic at each point and find the minimum of that quadratic
81
!/ !(#) 0
– ) = 1
82
!/ !(#) 0
– ) = 1
83
!/ !(#) 0
– ) = 1
84
!/ !(#) 0
– ) = 1
85
!/ !(#) 0
– ) = 1
86
!/ !(#) 0
– ) = 1
87
!/ !(#) 0
– ) = 1
88
!/ !(#) 0
– ) = 1
89
!/ !(#) 0
– ) = 1
90
!/ !(#) 0
– ) = 1
91
!/ !(#) 0
92
93
94
– Boyden-Fletcher-Goldfarb-Shanno (BFGS)
– Levenberg-Marquardt
– Other “Quasi-newton” methods
95
96
97
Note: this is actually a reduced step size
98
– Linear decay: !" =
$% "&'
– Quadratic decay: !" =
$% "&' (
– Exponential decay: !" = !)*+,", where - > 0
1. Train with a fixed learning rate ! until loss (or performance on a held-out data set) stagnates 2. ! ← 1!, where 1 < 1 (typically 0.1) 3. Return to step 1 and continue training from where we left off
99
high or too low for others
100
101
102
!+),
(#$%) = -. (#) − )
(#)
!(#$%) !(#)
103
104
component
– I.e. steps in different directions are not coupled
– If the derivative at the current location recommends continuing in the same direction as before (i.e. has not changed sign from earlier):
– If the derivative has changed sign (i.e. we’ve overshot a minimum)
105
– ∆" = %&'(
)*( ! ,) ),
∆" – ! " = ! " − ∆"
" /(") ! "0 ∆"0 Orange arrow shows direction of derivative, i.e. direction of increasing E(w)
106
" = $ " − ∆"
a > 1 " '(") $ "* $ "+ ,∆"* ∆"* Orange arrow shows direction of derivative, i.e. direction of increasing E(w)
107
" = $ " − ∆"
a > 1 " '(") $ "* $ "+ ,∆"* $ "- ,-∆"* ∆"* Orange arrow shows direction of derivative, i.e. direction of increasing E(w)
108
– If the derivative has changed sign – Return to the previous location
" = ! " + ∆"
– Shrink the step
– Take the smaller step forward
" = ! " − ∆"
" ((") ! "+ ! ",
! ".
∆"+ ! "/ Orange arrow shows direction of derivative, i.e. direction of increasing E(w)
109
– If the derivative has changed sign – Return to the previous location
" = ! " + ∆"
– Shrink the step
– Take the smaller step forward
" = ! " − ∆"
" ((") ! "+ ! ",
! ".
∆"+ ! "/ Orange arrow shows direction of derivative, i.e. direction of increasing E(w)
110
– If the derivative has changed sign – Return to the previous location
" = ! " + ∆"
– Shrink the step
– Take the smaller step forward
" = ! " − ∆"
b < 1 " ((") ! "+ ! ",
! ".
∆"+ Orange arrow shows direction of derivative, i.e. direction of increasing E(w)
111
– If the derivative has changed sign – Return to the previous location
" = ! " + ∆"
– Shrink the step
– Take the smaller step forward
" = ! " − ∆"
b < 1 " ((") ! "+ ! ",
! ".
∆"+ Orange arrow shows direction of derivative, i.e. direction of increasing E(w)
112
– Initialize -.,/,0, ∆-.,/,0 > 0, – 34567 ), *, , =
89::(<=,>,?) 8<=,>,?
– ∆-.,/,0 = sign 34567 ), *, , ∆-.,/,0 – While not converged:
89::(<=,>,?) 8<=,>,?
== sign 7 ), *, , : – ∆-.,/,0 = min(!∆-.,/,0, ∆GHI) – 34567 ), *, , = 7 ), *, ,
–
– ∆-.,/,0 = max(&∆-.,/,0, ∆G/M)
Ceiling and floor on step
113
– Initialize -.,/,0, ∆-.,/,0 > 0, – 34567 ), *, , =
89::(<=,>,?) 8<=,>,?
– ∆-.,/,0 = sign 34567 ), *, , ∆-.,/,0 – While not converged:
89::(<=,>,?) 8<=,>,?
== sign 7 ), *, , : – ∆-.,/,0 = !∆-.,/,0 – 34567 ), *, , = 7 ), *, ,
–
– ∆-.,/,0 = &∆-.,/,0
Obtained via backprop Note: Different parameters updated independently
114
115
!(#$%) = !(#) − )*+ ,(#) -%.
!/ !(#) 0
116
&'
()* = &' ( − ,-- &' (|& / (, 1 ≠ ! 3*,′ &' (|& / (, 1 ≠ !
& ,(&) &7 &()*
Within each component
117
&'
()* = &' ( − , &' (, &' (.* .*/′ &' (|& 2 (, 3 ≠ !
& /(&) &7 &()*
Within each component
118
layer to node ' in the !th layer:
(()*+) = (()) − -. ( ) − -′((()0+)) ∆(()0+)
0+
Finite-difference approximation to double derivative
(2,45
()*+) = (2,45 ()) − ∆(2,45 ())
∆(2,45
()) =
∆(2,45
()0+)
())
− -66. (2,45
()0+) -66. (2,45 ())
119
layer to node ' in the !th layer:
(()*+) = (()) − -. ( ) − -′((()0+)) ∆(()0+)
0+
Finite-difference approximation to double derivative
(2,45
()*+) = (2,45 ()) − ∆(2,45 ())
∆(2,45
()) =
∆(2,45
()0+)
())
− -66. (2,45
()0+) -66. (2,45 ())
Computed using backprop
120
121
– And this may be a good thing
122
smoothly in some directions, but oscillate or diverge in others
– Keep track of oscillations – Emphasize steps in directions that converge smoothly – Shrink steps in directions that bounce around..
123
smoothly in some directions, but oscillate or diverge in others
– Keep track of oscillations – Emphasize steps in directions that converge smoothly – Shrink steps in directions that bounce around..
124
– In directions in which the convergence is smooth, the average will have a large value – In directions in which the estimate swings, the positive and negative swings will cancel out in the average
125
the current step ∆"($) = '∆"($()) − +,-./00 " $()
1
"($) = "($()) + ∆"($)
– Typical ' value is 0.9
– Get longer in directions where gradient retains the same sign – Become shorter in directions where the sign keeps flipping Plain gradient update With momentum
126
", !$, … , !&
– Compute *+,678(:
;, <;)
– Compute *+,-.// += "
? *+,678(: ;, <;)
@
A = @ A − C(*+,-.//)5
127
", !$, … , !&
2 = 0
– Compute gradient ()*789(;
<, =<)
– ()*+,-- += "
@ ()*789(; <, =<)
Δ1
2 = AΔ1 2 − C ()*+,-- 6
1
2 = 1 2 + Δ1 2
128
129
130
131
– First computes the gradient step at the current location – Then adds in the scaled previous step
– To get the final step
132
– First: We take a step against the gradient at the current location – Second: Then we add a scaled version of the previous step
133 1 2
134
135
136
137
138
139
", !$, … , !&
– For all layers ', initialize ()*+,-- = 0, Δ1
2 = 0
– For every layer '
1
2 = 1 2 + 4Δ1 2
– For all 5 = 1: 8
– Compute gradient ()*9:;(=
>, ?>)
– ()*+,-- += "
A ()*9:;(= >, ?>)
– For every layer '
1
2 = 1 2 − C(()*+,--)8
Δ1
2 = 4Δ1 2 − C(()*+,--)8
140
141
– And this may be a good thing
differences between the dimensions
dimensions, but are complex
improvement are demonstrably superior to other methods
142
143