Numerical Computation for Deep Learning
Lecture slides for Chapter 4 of Deep Learning www.deeplearningbook.org Ian Goodfellow Last modified 2017-10-14
Thanks to Justin Gilmer and Jacob Buckman for helpful discussions
Numerical Computation for Deep Learning Lecture slides for Chapter 4 - - PowerPoint PPT Presentation
Numerical Computation for Deep Learning Lecture slides for Chapter 4 of Deep Learning www.deeplearningbook.org Ian Goodfellow Last modified 2017-10-14 Thanks to Justin Gilmer and Jacob Buckman for helpful discussions Numerical concerns for
Lecture slides for Chapter 4 of Deep Learning www.deeplearningbook.org Ian Goodfellow Last modified 2017-10-14
Thanks to Justin Gilmer and Jacob Buckman for helpful discussions
(Goodfellow 2017)
numbers cannot be implemented in a finite computer
number of bits?
an output?
changes
(Goodfellow 2017)
(Goodfellow 2017)
(Goodfellow 2017)
−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 x −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 Global minimum at x = 0. Since f0(x) = 0, gradient descent halts here. For x < 0, we have f0(x) < 0, so we can decrease f by moving rightward. For x > 0, we have f0(x) > 0, so we can decrease f by moving leftward.
f(x) = 1
2x2
f 0(x) = x
4.1: An illustration of how the gradient descent algorithm uses the deriv
Figure 4.1
(Goodfellow 2017)
x f(x) Ideally, we would like to arrive at the global minimum, but this might not be possible. This local minimum performs nearly as well as the global one, so it is an acceptable halting point. This local minimum performs poorly and should be avoided.
Figure 4.3
(Goodfellow 2017)
−50 50 100 150 200 250 Training time (epochs) −2 2 4 6 8 10 12 14 16 Gradient norm 50 100 150 200 250 Training time (epochs) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Classification error rate
(Goodfellow 2017)
the value is locally smallest
(Goodfellow 2017)
(Goodfellow 2017)
Minimum Maximum Saddle point
Figure 4.2
(Goodfellow 2017)
x1 −15 15 x
1
−15 15 f(x1 ,x1 ) −500 500
Figure 4.5
(Gradient descent escapes, see Appendix C of “Qualitatively Characterizing Neural Network Optimization Problems”) Saddle points attract Newton’s method
(Goodfellow 2017)
x f(x)
Negative curvature
x f(x)
No curvature
x f(x)
Positive curvature
Figure 4.4
(Goodfellow 2017)
−3 −2 −1 1 2 3 x0 −3 −2 −1 1 2 3 x1 v(1) v(2)
Before multiplication
−3 −2 −1 1 2 3 x 0 −3 −2 −1 1 2 3 x 0
1
v(1) ¸1v(1) v(2) ¸2v(2)
After multiplication
Effect of eigenvectors and eigenvalues
(Goodfellow 2017)
f(x(0) − ✏g) ≈ f(x(0)) − ✏g>g + 1 2✏2g>Hg. (4.9)
Big gradients speed you up Big eigenvalues slow you down if you align with their eigenvectors
(Goodfellow 2017)
i,j
When the condition number is large, sometimes you hit large eigenvalues and sometimes you hit small ones. The large ones force you to keep the learning rate small, and miss out on moving fast in the small eigenvalue directions.
(Goodfellow 2017)
−30 −20 −10 10 20 x1 −30 −20 −10 10 20 x2
Figure 4.6
(Goodfellow 2017)
(From “Qualitatively Characterizing Neural Network Optimization Problems”) At end of learning:
(Goodfellow 2017)
(Goodfellow 2017)
x max λ
max
α,α≥0 f(x) +
X
i
λig(i)(x) + X
j
αjh(j)(x). (4.19)
In this book, mostly used for theory (e.g.: show Gaussian is highest entropy distribution) In practice, we usually just project back to the constraint region after each step
(Goodfellow 2017)
(Goodfellow 2017)
percentage points of state-of-the-art
values)
(Goodfellow 2017)
schemes to represent real numbers
small delta
(Goodfellow 2017)
have no effect. This can cause large changes downstream: >>> a = np.array([0., 1e-8]).astype('float32') >>> a.argmax() 1 >>> (a + 1).argmax()
(Goodfellow 2017)
(Goodfellow 2017)
logarithm, etc.
(Goodfellow 2017)
rounding error
Var(f(x)) = E h (f(x) − E[f(x)])2i . (3.12)
= E ⇥ f(x)2⇤ − E [f(x)]2 Safe Dangerous
(Goodfellow 2017)
argument!
(Goodfellow 2017)
(Goodfellow 2017)
round to negative?
(Goodfellow 2017)
tf.log(tf.reduce_sum(tf.exp(array))
underflow… and then log is -inf
(Goodfellow 2017)
mx = tf.reduce_max(array) safe_array = array - mx log_sum_exp = mx + tf.log(tf.reduce_sum(exp(safe_array)) Built in version: tf.reduce_logsumexp
(Goodfellow 2017)
m + log X
i
exp(ai − m) = m + log X
i
exp(ai) exp(m) = m + log 1 exp(m) X
i
exp(ai) = m − log exp(m) + log X
i
exp(ai)
(Goodfellow 2017)
(Goodfellow 2017)
safe_logits = logits - tf.reduce_max(logits) softmax = tf.nn.softmax(safe_logits)
(Goodfellow 2017)
the logits hard-coded to 0
(Goodfellow 2017)
softmax and logsumexp in it
the softmax saturates
and logsumexp
(Goodfellow 2017)
stuck, you are probably rounding your gradient to zero somewhere: maybe computing cross-entropy using probabilities instead of logits
learning rate should usually cause explosion
(Goodfellow 2017)
suspect:
(Goodfellow 2017)