Training Neural Nets
COMPSCI 371D — Machine Learning
COMPSCI 371D — Machine Learning Training Neural Nets 1 / 40
Training Neural Nets COMPSCI 371D Machine Learning COMPSCI 371D - - PowerPoint PPT Presentation
Training Neural Nets COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning Training Neural Nets 1 / 40 Outline 1 The Softmax Simplex 2 Loss and Risk 3 Back-Propagation 4 Stochastic Gradient Descent 5 Regularization 6 Network Depth
COMPSCI 371D — Machine Learning Training Neural Nets 1 / 40
1 The Softmax Simplex 2 Loss and Risk 3 Back-Propagation 4 Stochastic Gradient Descent 5 Regularization 6 Network Depth and Batch Normalization 7 Experiments with SGD
COMPSCI 371D — Machine Learning Training Neural Nets 2 / 40
The Softmax Simplex
exp(z) 1T exp(z)
def
e
COMPSCI 371D — Machine Learning Training Neural Nets 3 / 40
The Softmax Simplex
def
i=1 pi = 1}
p
1
p
2
1 1 1/2 1/2
1/3 1/3 1/3 1 1 1 p
3
p
1
p
2
COMPSCI 371D — Machine Learning Training Neural Nets 4 / 40
Loss and Risk
N
n=1 ℓn(w)
COMPSCI 371D — Machine Learning Training Neural Nets 5 / 40
Back-Propagation
(1)
(1)
(2)
(3)
(0)
n
∂w
∂ℓn ∂w(k) = ∂ℓn ∂x(k) ∂x(k) ∂w(k) ∂ℓn ∂x(k−1) = ∂ℓn ∂x(k) ∂x(k) ∂x(k−1)
∂ℓn ∂x(K) = ∂ℓ ∂p
COMPSCI 371D — Machine Learning Training Neural Nets 6 / 40
Back-Propagation
(1)
(1)
(2)
(3)
(0)
n
∂x(k) ∂w(k)
∂x(k) ∂x(k−1)
∂ℓn ∂x(K) = ∂ℓ ∂p
∂ℓn ∂w(k) into a vector ∂ℓn ∂w
COMPSCI 371D — Machine Learning Training Neural Nets 7 / 40
Back-Propagation
(1)
(1)
(2)
(3)
(0)
n
∂x(k) ∂w(k)
∂x(k) ∂x(k−1), are computed
COMPSCI 371D — Machine Learning Training Neural Nets 8 / 40
Back-Propagation
(1)
f f (2) f (3) w(1) w(2) w(3)
(1)
x
(2)
x
(3)
x = p
(0)
xn x = l n
n
y l
∂ℓn ∂w(k) = ∂ℓn ∂x(k) ∂x(k) ∂w(k) ∂ℓn ∂x(k−1) = ∂ℓn ∂x(k) ∂x(k) ∂x(k−1)
∂ℓn ∂x(3) = ∂ℓ ∂p ∂ℓn ∂w(3) = ∂ℓn ∂x(3) ∂x(3) ∂w(3) ∂ℓn ∂x(2) = ∂ℓn ∂x(3) ∂x(3) ∂x(2) ∂ℓn ∂w(2) = ∂ℓn ∂x(2) ∂x(2) ∂w(2) ∂ℓn ∂x(1) = ∂ℓn ∂x(2) ∂x(2) ∂x(1) ∂ℓn ∂w(1) = ∂ℓn ∂x(1) ∂x(1) ∂w(1)
∂x(0) = ∂ℓn ∂x(1) ∂x(1) ∂x(0)
∂w =
∂ℓn ∂w(1) ∂ℓn ∂w(2) ∂ℓn ∂w(3)
COMPSCI 371D — Machine Learning Training Neural Nets 9 / 40
Back-Propagation
COMPSCI 371D — Machine Learning Training Neural Nets 10 / 40
Back-Propagation
∂x = V (easy!)
∂w: What is ∂z ∂V ? Three subscripts: ∂zi ∂vjk .
∂w is a 2 × 8 matrix
COMPSCI 371D — Machine Learning Training Neural Nets 11 / 40
Back-Propagation
∂z ∂w =
∂w1 ∂z1 ∂w2 ∂z1 ∂w3 ∂z1 ∂w4 ∂z1 ∂w5 ∂z1 ∂w6 ∂z1 ∂w7 ∂z1 ∂w8 ∂z2 ∂w1 ∂z2 ∂w2 ∂z2 ∂w3 ∂z2 ∂w4 ∂z2 ∂w5 ∂z2 ∂w6 ∂z2 ∂w7 ∂z2 ∂w8
∂w =
COMPSCI 371D — Machine Learning Training Neural Nets 12 / 40
Stochastic Gradient Descent
COMPSCI 371D — Machine Learning Training Neural Nets 13 / 40
Stochastic Gradient Descent
COMPSCI 371D — Machine Learning Training Neural Nets 14 / 40
Stochastic Gradient Descent
risk
COMPSCI 371D — Machine Learning Training Neural Nets 15 / 40
Stochastic Gradient Descent
N
n=1 ∇ℓn(w) .
N ∇ℓ1(wt), . . . , − α N ∇ℓN(wt)
COMPSCI 371D — Machine Learning Training Neural Nets 16 / 40
Stochastic Gradient Descent
N ∇ℓ1(wt), . . . , − α N ∇ℓN(wt)
COMPSCI 371D — Machine Learning Training Neural Nets 17 / 40
Stochastic Gradient Descent
N ∇ℓn(w)
t
t
t
t
t
t
t
t
t
COMPSCI 371D — Machine Learning Training Neural Nets 18 / 40
Stochastic Gradient Descent
https://towardsdatascience.com/ COMPSCI 371D — Machine Learning Training Neural Nets 19 / 40
Stochastic Gradient Descent
COMPSCI 371D — Machine Learning Training Neural Nets 20 / 40
Stochastic Gradient Descent
B
n=(j−1)B+1 ∇ℓn(w(j−1))
COMPSCI 371D — Machine Learning Training Neural Nets 21 / 40
Stochastic Gradient Descent
200 400 600 800 1000 1200 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
COMPSCI 371D — Machine Learning Training Neural Nets 22 / 40
Regularization
COMPSCI 371D — Machine Learning Training Neural Nets 23 / 40
Regularization
COMPSCI 371D — Machine Learning Training Neural Nets 24 / 40
Regularization
COMPSCI 371D — Machine Learning Training Neural Nets 25 / 40
Regularization COMPSCI 371D — Machine Learning Training Neural Nets 26 / 40
Regularization
COMPSCI 371D — Machine Learning Training Neural Nets 27 / 40
Network Depth and Batch Normalization
COMPSCI 371D — Machine Learning Training Neural Nets 28 / 40
Network Depth and Batch Normalization
(1)
(1)
(2)
(3)
(0)
n
∂ℓn ∂x(k−1) = ∂ℓn ∂x(k) ∂x(k) ∂x(k−1)
∂ℓn ∂x(i) = ∂ℓn ∂x(K) ∂x(K) ∂x(K−1) . . . ∂x(i+1) ∂x(i) = ∂ℓn ∂x(K)JK · . . . · Ji+1
∂ℓn ∂w(i) = ∂ℓn ∂x(i) ∂x(i) ∂w(i), depends on the product
COMPSCI 371D — Machine Learning Training Neural Nets 29 / 40
Network Depth and Batch Normalization
k
x(c)
k
−µ(c)
k
σ(c)
k
COMPSCI 371D — Machine Learning Training Neural Nets 30 / 40
Network Depth and Batch Normalization
σ
COMPSCI 371D — Machine Learning Training Neural Nets 31 / 40
Network Depth and Batch Normalization
σ
COMPSCI 371D — Machine Learning Training Neural Nets 32 / 40
Network Depth and Batch Normalization
1 = 2
2 = 2σ2 before BN
COMPSCI 371D — Machine Learning Training Neural Nets 33 / 40
Network Depth and Batch Normalization
COMPSCI 371D — Machine Learning Training Neural Nets 34 / 40
Experiments with SGD
N
n=1 ℓn(w)
COMPSCI 371D — Machine Learning Training Neural Nets 35 / 40
Experiments with SGD
COMPSCI 371D — Machine Learning Training Neural Nets 36 / 40
Experiments with SGD
2
w
t t+1
w L (
T
) L (
T
) ht
COMPSCI 371D — Machine Learning Training Neural Nets 37 / 40
Experiments with SGD
COMPSCI 371D — Machine Learning Training Neural Nets 38 / 40
Experiments with SGD
COMPSCI 371D — Machine Learning Training Neural Nets 39 / 40
Experiments with SGD
COMPSCI 371D — Machine Learning Training Neural Nets 40 / 40