CS 6316 Machine Learning
Neural Networks
Yangfeng Ji
Department of Computer Science University of Virginia
CS 6316 Machine Learning Neural Networks Yangfeng Ji Department of - - PowerPoint PPT Presentation
CS 6316 Machine Learning Neural Networks Yangfeng Ji Department of Computer Science University of Virginia Overview 1. From Logistic Regression to Neural Networks 2. Expressive Power of Neural Networks 3. Learning Neural Networks 4.
Department of Computer Science University of Virginia
1
3
3
4
x1 x2 x3 x4 Input layer y Output layer
4
5
zk
d
w(1)
k,jx·,j)
k ∈ [K] (4) P(y 1 | x)
K
w(o)
k zk)
(5)
6
zk
d
w(1)
k,jx·,j)
k ∈ [K] (4) P(y 1 | x)
K
w(o)
k zk)
(5)
k,i} and {w(o) k } are two sets of the parameters, and
6
d
k,jx·,j)
K
k zk)
7
d
k,jx·,j)
K
k zk)
7
8
x·,1 x·,2 x·,3 x·,4 Input layer z1 z2 z3 z4 z5 Hidden layer y Output layer
9
(a) Sign function
10
(a) Sign function (b) Tanh function
10
(a) Sign function (b) Tanh function (c) ReLU function [Jarrett et al., 2009]
10
x·,1 x·,2 x·,3 x·,4 Input layer Hidden layer Hidden layer y Output layer
11
d
k,jx·,j)
K
k zk)
13
d
k,jx·,j)
K
k zk)
k
d
k,ix·,j)
13
14
15
15
17
17
18
18
18
i1, the
m
19
i1, the
m
19
i1, the
m
19
∂θ
1More detail will be discussed in the next lecture
20
∂θ
1More detail will be discussed in the next lecture
20
∂θ
1More detail will be discussed in the next lecture
20
21
∂L(θ) ∂w(o)
∂ log σ
∂σ
· ∂(w(o))Tσ(W(1)x) ∂w(o) (20)
21
∂L(θ) ∂w(o)
∂ log σ
∂σ
· ∂(w(o))Tσ(W(1)x) ∂w(o)
· σ(W(1)x) (20)
21
∂L(θ) ∂w(o)
∂ log σ
∂σ
·∂(w(o))Tσ(W(1)x) ∂σ(W(1)x) · ∂σ(W(1)x) ∂W(1)x · ∂W(1)x ∂W(1) (21)
22
∂L(θ) ∂w(o)
∂ log σ
∂σ
·∂(w(o))Tσ(W(1)x) ∂σ(W(1)x) · ∂σ(W(1)x) ∂W(1)x · ∂W(1)x ∂W(1) (21)
22
24
2For simplicity, the transpose operation is ignored from the graph
25
∂σ(x) ∂x
∂aTx ∂x
∂ log(x) ∂x
x
∂Wx ∂x
26
x W(1) · x W(1) σ (w(o))Tz w(o) σ − log p(Y | x)
∂(W(1) · x) ∂σ ∂((w(o))Tz) ∂σ ∂W(1) ∂w(o)
27
x W(1) · x W(1) σ (w(o))Tz w(o) σ p(Y | x) x W(1) · x W(1) σ (w(o))Tz w(o) σ − log p(Y | x)
∂(W(1) · x) ∂σ ∂((w(o))Tz) ∂σ ∂W(1) ∂w(o)
28
x W(1) · x W(1) σ (w(o))Tz w(o) σ p(Y | x) x W(1) · x W(1) σ (w(o))Tz w(o) σ − log p(Y | x)
∂(W(1) · x) ∂σ ∂((w(o))Tz) ∂σ ∂W(1) ∂w(o)
28
29
Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2009). What is the best multi-stage architecture for object recognition? In Proceedings of the 12th International Conference on Computer Vision, pages 2146–2153. IEEE. LeCun, Y. (2020). Self-supervised learning. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088):533–536. Shalev-Shwartz, S. and Ben-David, S. (2014). Understanding machine learning: From theory to algorithms. Cambridge university press.
30