Neural Networks
Part 2
Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison
[Based on slides from Jerry Zhu, Mohit Gupta]
Neural Networks Part 2 Yingyu Liang yliang@cs.wisc.edu Computer - - PowerPoint PPT Presentation
Neural Networks Part 2 Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison [Based on slides from Jerry Zhu, Mohit Gupta] Limited power of one single neuron Perceptron: = (
[Based on slides from Jerry Zhu, Mohit Gupta]
slide 2
π
slide 3
π
slide 4
slide 5
Layer 3 (output) Layer 2 (hidden) Layer 1 (input)
slide 6
π₯11
(2)
π1
2 = π
ΰ·
π
π¦ππ₯1π
(2)
π₯12
(2)
slide 7
π₯11
(2)
π₯21
(2)
π₯12
(2)
π₯22
(2)
π1
2 = π
ΰ·
π
π¦ππ₯1π
(2)
π2
2 = π
ΰ·
π
π¦ππ₯2π
(2)
slide 8
π₯11
(2)
π₯21
(2)
π₯31
(2)
π₯12
(2)
π₯22
(2)
π₯32
(2)
π1
2 = π
ΰ·
π
π¦ππ₯1π
(2)
π2
2 = π
ΰ·
π
π¦ππ₯2π
(2)
π3
2 = π
ΰ·
π
π¦ππ₯3π
(2)
slide 9
π₯11
(2)
π₯21
(2)
π₯31
(2)
π₯12
(2)
π₯22
(2)
π₯32
(2)
π1
2 = π
ΰ·
π
π¦ππ₯1π
(2)
π2
2 = π
ΰ·
π
π¦ππ₯2π
(2)
π3
2 = π
ΰ·
π
π¦ππ₯3π
(2)
π₯1
(3)
π₯2
(3)
π₯3
(3)
π = π ΰ·
π
ππ
2 π₯π (3)
slide 10
π₯11
(3)
π₯12
(3)
π₯13
(3)
π1 = π ΰ·
π
ππ
2 π₯1π (3)
slide 11
π₯11
(3)
π₯12
(3)
π₯13
(3)
π1 = π ΰ·
π
ππ
2 π₯1π (3)
ππΏ = π ΰ·
π
ππ
2 π₯πΏπ (3)
π₯πΏ1
(3)
π₯πΏ2
(3)
π₯πΏ3
(3)
slide 14
slide 15
π¦βπΈ
π=1 πΏ
slide 16
π¦βπΈ
π=1 πΏ
slide 17
π¦βπΈ
π=1 πΏ
slide 18
π¦2 π¦1 π₯11
(4)
Layer (4) Layer (3) Layer (2) Layer (1)
πΉπ¦
ππΉπ¦ ππ₯11
4
slide 19
π¦2 π¦1 = π§ β π 2 π1 π2 π₯11
(4)
Layer (4) Layer (3) Layer (2) Layer (1)
πΉπ¦ πΉπ¦ π1 π§ β π 2
slide 20
π¦2 π¦1 = π§ β π 2 π1 π2 π₯11
(4)
Layer (4) Layer (3) Layer (2) Layer (1)
πΉπ¦ πΉπ¦ π1 π§ β π 2 π¨1
(4) = π₯11 (4)π1 (3) + π₯12 (4)π2 (3)
π π¨1
(4)
π¨1
(4)
slide 21
π¦2 π¦1 = π§ β π 2 π1 π2 π₯11
(4)
Layer (4) Layer (3) Layer (2) Layer (1)
πΉπ¦ πΉπ¦ π1 π§ β π 2 π¨1
(4) = π₯11 (4)π1 (3) + π₯12 (4)π2 (3)
π π¨1
(4)
π¨1
(4)
π₯12
4 π2 (3)
π₯11
4 π1 (3)
π₯12
(4)
slide 22
π¦2 π¦1 = π§ β π 2 π1 π2 π₯11
(4)
Layer (4) Layer (3) Layer (2) Layer (1)
πΉπ¦ πΉπ¦ π1 π§ β π 2 π¨1
(4) = π₯11 (4)π1 (3) + π₯12 (4)π2 (3)
π π¨1
(4)
π¨1
(4)
π₯12
4 π2 (3)
π₯11
4 π1 (3)
ππΉπ ππ₯11
(4) = ππΉπ
ππ1 ππ1 ππ¨1
(4)
ππ¨1
(4)
ππ₯11
(4)
By Chain Rule:
ππΉπ ππ1 = 2(π1 β π§1) ππ1 ππ¨1
(4) = πβ² π¨1 (4)
slide 23
π¦2 π¦1 = π§ β π 2 π1 π2 π₯11
(4)
Layer (4) Layer (3) Layer (2) Layer (1)
πΉπ¦ πΉπ¦ π1 π§ β π 2 π¨1
(4) = π₯11 (4)π1 (3) + π₯12 (4)π2 (3)
π π¨1
(4)
π¨1
(4)
π₯12
4 π2 (3)
π₯11
4 π1 (3)
ππΉπ ππ₯11
(4) = 2(π1 β π§1)πβ² π¨1 (4)
ππ¨1
(4)
ππ₯11
(4)
By Chain Rule:
ππΉπ ππ1 = 2(π1 β π§1) ππ1 ππ¨1
(4) = πβ² π¨1 (4)
slide 24
π¦2 π¦1 = π§ β π 2 π1 π2 π₯11
(4)
Layer (4) Layer (3) Layer (2) Layer (1)
πΉπ¦ πΉπ¦ π1 π§ β π 2 π¨1
(4) = π₯11 (4)π1 (3) + π₯12 (4)π2 (3)
π π¨1
(4)
π¨1
(4)
π₯12
4 π2 (3)
π₯11
4 π1 (3)
ππΉπ ππ₯11
(4) = 2(π1 β π§1)πβ² π¨1 (4) π1 (3)
By Chain Rule:
ππΉπ ππ1 = 2(π1 β π§1) ππ1 ππ¨1
(4) = πβ² π¨1 (4)
slide 25
π¦2 π¦1 = π§ β π 2 π1 π2 π₯11
(4)
Layer (4) Layer (3) Layer (2) Layer (1)
πΉπ¦ πΉπ¦ π1 π§ β π 2 π¨1
(4) = π₯11 (4)π1 (3) + π₯12 (4)π2 (3)
π π¨1
(4)
π¨1
(4)
π₯12
4 π2 (3)
π₯11
4 π1 (3)
ππΉπ ππ₯11
(4) = 2(π1 β π§1)π π¨1 (4)
1 β π π¨1
(4)
π1
(3)
By Chain Rule:
ππΉπ ππ1 = 2(π1 β π§1) ππ1 ππ¨1
(4) = πβ² π¨1 (4)
slide 26
π¦2 π¦1 = π§ β π 2 π1 π2 π₯11
(4)
Layer (4) Layer (3) Layer (2) Layer (1)
πΉπ¦ πΉπ¦ π1 π§ β π 2 π¨1
(4) = π₯11 (4)π1 (3) + π₯12 (4)π2 (3)
π π¨1
(4)
π¨1
(4)
π₯12
4 π2 (3)
π₯11
4 π1 (3)
ππΉπ ππ₯11
(4) = 2 π1 β π§1 π1 1 β π1 π1 (3)
By Chain Rule:
ππΉπ ππ1 = 2(π1 β π§1) ππ1 ππ¨1
(4) = πβ² π¨1 (4)
slide 27
π¦2 π¦1 = π§ β π 2 π1 π2 π₯11
(4)
Layer (4) Layer (3) Layer (2) Layer (1)
πΉπ¦ πΉπ¦ π1 π§ β π 2 π¨1
(4) = π₯11 (4)π1 (3) + π₯12 (4)π2 (3)
π π¨1
(4)
π¨1
(4)
π₯12
4 π2 (3)
π₯11
4 π1 (3)
ππΉπ ππ₯11
(4) = 2 π1 β π§1 π1 1 β π1 π1 (3)
By Chain Rule:
ππΉπ ππ1 = 2(π1 β π§1) ππ1 ππ¨1
(4) = πβ² π¨1 (4)
Can be computed by network activation
slide 28
π¦2 π¦1 = π§ β π 2 π1 π2 π₯11
(4)
Layer (4) Layer (3) Layer (2) Layer (1)
πΉπ¦ πΉπ¦ π¨1
(4) = π₯11 (4)π1 (3) + π₯12 (4)π2 (3)
π¨1
(4)
π₯12
4 π2 (3)
π₯11
4 π1 (3)
ππΉπ ππ₯11
(4) = 2 π1 β π§1 π1 1 β π1 π1 (3)
By Chain Rule:
ππΉπ ππ¨1
(4) = 2(π1 β π§1)πβ² π¨1 (4)
slide 29
π¦2 π¦1 = π§ β π 2 π1 π2 π₯11
(4)
Layer (4) Layer (3) Layer (2) Layer (1)
πΉπ¦ πΉπ¦ π¨1
(4) = π₯11 (4)π1 (3) + π₯12 (4)π2 (3)
π¨1
(4)
π₯12
4 π2 (3)
π₯11
4 π1 (3)
ππΉπ ππ₯11
(4) = 2 π1 β π§1 π1 1 β π1 π1 (3)
By Chain Rule:
π1
(4) = ππΉπ
ππ¨1
(4) = 2(π1 β π§1)πβ² π¨1 (4)
slide 30
π¦2 π¦1 = π§ β π 2 π1 π2 π₯11
(4)
Layer (4) Layer (3) Layer (2) Layer (1)
πΉπ¦ πΉπ¦ π1
(4)
π¨1
(4)
π₯12
4 π2 (3)
π₯11
4 π1 (3)
ππΉπ ππ₯11
(4) = π1 (4)π1 (3)
By Chain Rule:
π1
(4) = ππΉπ
ππ¨1
(4) = 2(π1 β π§1)πβ² π¨1 (4)
π1
(3)
slide 31
π¦2 π¦1 = π§ β π 2 π1 π2 π₯11
(4)
Layer (4) Layer (3) Layer (2) Layer (1)
πΉπ¦ πΉπ¦ π1
(4)
π¨1
(4)
π₯12
4 π2 (3)
π₯11
4 π1 (3)
ππΉπ ππ₯11
(4) = π1 (4)π1 (3),
ππΉπ ππ₯12
(4) = π1 (4)π2 (3)
By Chain Rule:
π1
(4) = ππΉπ
ππ¨1
(4) = 2(π1 β π§1)πβ² π¨1 (4)
π₯12
(4)
π2
(3)
slide 32
π¦2 π¦1 = π§ β π 2 π1 π2 π₯21
(4)
Layer (4) Layer (3) Layer (2) Layer (1)
πΉπ¦ πΉπ¦ π2
(4)
π¨1
(4)
π₯12
4 π2 (3)
π₯11
4 π1 (3)
ππΉπ ππ₯21
(4) = π2 (4)π1 (3),
ππΉπ ππ₯22
(4) = π2 (4)π2 (3)
By Chain Rule:
π2
(4) = ππΉπ
ππ¨2
(4) = 2(π2 β π§2)πβ² π¨2 (4)
π₯22
(4)
π2
(3)
π1
(3)
slide 33
π2
(4)
π1
(3)
π2
(3)
π1
(2)
π2
(2)
π1
(4)
π¦2 π¦1 = π§ β π 2 π1 π2
Layer (4) Layer (3) Layer (2) Layer (1)
πΉπ¦
Thus, for any weight in the network: ππΉπ¦ ππ₯
ππ (π) = π π (π)ππ (πβ1)
π
π (π)
: π of ππ’β neuron in Layer π ππ
(πβ1) : Activation of ππ’β neuron in Layer π β 1
π₯
ππ (π)
: Weight from ππ’β neuron in Layer π β 1 to ππ’β neuron in Layer π
slide 34
π2
(4)
π1
(3)
π2
(3)
π1
(2)
π2
(2)
π1
(4)
π¦2 π¦1 = π§ β π 2 π1 π2
Layer (4) Layer (3) Layer (2) Layer (1)
πΉπ¦
Show that for any bias in the network: ππΉπ¦ ππ
π (π) = π π (π)
π
π (π)
: π of ππ’β neuron in Layer π π
π (π)
: bias for the ππ’β neuron in Layer π, i.e., π¨π
(π) = Οπ π₯ ππ (π)ππ (πβ1) + π π (π)
slide 35
π2
(4)
π1
(3)
π2
(3)
π1
(2)
π2
(2)
π1
(4)
π¦2 π¦1 = π§ β π 2 π1 π2
Layer (4) Layer (3) Layer (2) Layer (1)
πΉπ¦
Thus, for any neuron in the network: π
π (π) = ΰ· π
ππ
π+1 π₯ππ π+1
πβ² π¨
π π
π
π (π)
: π of ππ’β Neuron in Layer π ππ
(π+1)
: π of ππ’β Neuron in Layer π + 1 πβ² π¨
π π
: derivative of ππ’β Neuron in Layer π w.r.t. its linear combination input π₯ππ
(π+1)
: Weight from ππ’β Neuron in Layer π to ππ’β Neuron in Layer π + 1
slide 36
a. Compute Activations for the Entire Network b. Compute π for Neurons in the Output Layer using Network Activation and Desired Activation π
π (π) = 2 π§π β ππ ππ(1 β ππ)
c. Compute π for all Neurons in the previous Layers π
π (π) = ΰ· π
ππ
π+1 π₯ππ π+1 ππ π (1 β ππ π )
d. Compute Gradient of Cost w.r.t each Weight and Bias for the Training Image using π ππΉπ¦ ππ₯
ππ (π) = π π (π)ππ (πβ1)
ππΉπ¦ ππ
π (π) = π π (π)
slide 37
Entire Training Set ππΉ ππ₯
ππ (π) = 1
π ΰ· ππΉπ ππ₯
ππ (π)
ππΉ ππ
π (π) = 1
π ΰ· ππΉπ ππ
π (π)
π₯
ππ (π)βͺ π₯ ππ (π) β π ππΉ
ππ₯
ππ π
π
π π βͺ π π π β π ππΉ
ππ
π (π)