Natural Language Processing with Deep Learning CS224N/Ling284
Christopher Manning Lecture 4: Gradients by hand (matrix calculus) and algorithmically (the backpropagation algorithm)
Natural Language Processing with Deep Learning CS224N/Ling284 - - PowerPoint PPT Presentation
Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 4: Gradients by hand (matrix calculus) and algorithmically (the backpropagation algorithm) 1. Introduction Assignment 2 is all about making sure you
Christopher Manning Lecture 4: Gradients by hand (matrix calculus) and algorithmically (the backpropagation algorithm)
Assignment 2 is all about making sure you really understand the math of neural networks … then we’ll let the software do it! We’ll go through it quickly today, but also look at the readings! This will be a tough week for some! à Make sure to get help if you need it Visit office hours Friday/Tuesday Note: Monday is MLK Day – No office hours, sorry! But we will be on Piazza Read tutorial materials given in the syllabus
2
𝐾" 𝜄 = 𝜏 𝑡 = 1 1 + 𝑓*+
3
x = [ xmuseums xin xParis xare xamazing ]
Update equation: How can we compute ∇-𝐾(𝜄)?
𝛽 = step size or learning rate
4
Lecture 4: Gradients by hand and algorithmically
5
you use matrices”
intuition; watch last week’s lecture for an example
material in more detail
textbook: http://web.stanford.edu/class/math51/textbook.html
6
45 46 = 3𝑦8
“How much will the output change if we change the input a bit?”
7
8
9
10
11
Function has n outputs and n inputs → n by n Jacobian
12
13
14
15
16
17
18 Fine print: This is the correct Jacobian. Later we discuss the “shape convention”; using it the answer would be h.
19
x = [ xmuseums xin xParis xare xamazing ]
20
x = [ xmuseums xin xParis xare xamazing ]
will compute the gradient of the score for simplicity
21
22
23
24
25
26
Useful Jacobians from previous slide
27
Useful Jacobians from previous slide
28
Useful Jacobians from previous slide
29
Useful Jacobians from previous slide
30
Useful Jacobians from previous slide
31
32
The same! Let’s avoid duplicated computation…
33
34
35
36
37
38
39
40
x1 x2 x3 +1 f(z1)= h1 h2 =f(z2) s
𝜖𝑡 𝜖𝑿 = 𝜺 𝜖𝒜 𝜖𝑿 = 𝜺 𝜖 𝜖𝑿 𝑿𝒚 + 𝒄 𝜖𝑨C 𝜖𝑋
CE
= 𝜖 𝜖𝑋
CE
𝑿CF𝒚 + 𝑐C =
H HIJK ∑MNO 4
𝑋
CM𝑦M = 𝑦E
because is a column vector…
41
derivative a column vector, resulting in
reorder terms
42
dimensionality!
Keep straight what variables feed into what computations
derivative wrt fc when c = y (the correct class), then consider derivative wrt fc when c ¹ y (all the incorrect classes)
confused by matrix calculus!
arrives at a hidden layer has the same dimensionality as that hidden layer
43
44
45
46
47
48
“downstream gradient”
Upstream gradient
49
Downstream gradient
Downstream gradient Upstream gradient
respect to its input
Local gradient
50
Downstream gradient Upstream gradient
respect to its input
Local gradient
51
Chain rule!
Downstream gradient Upstream gradient
respect to it’s input
Local gradient
52
53
Downstream gradients Upstream gradient Local gradients
54
55
max
56
Forward prop steps
max
57
Forward prop steps 6 3 2 1 2 2
max
58
Forward prop steps 6 3 2 1 2 2 Local gradients
max
59
Forward prop steps 6 3 2 1 2 2 Local gradients
max
60
Forward prop steps 6 3 2 1 2 2 Local gradients
max
61
Forward prop steps 6 3 2 1 2 2 Local gradients
max
62
Forward prop steps 6 3 2 1 2 2 Local gradients upstream * local = downstream 1 1*3 = 3 1*2 = 2
max
63
Forward prop steps 6 3 2 1 2 2 Local gradients upstream * local = downstream 1 3 2 3*1 = 3 3*0 = 0
max
64
Forward prop steps 6 3 2 1 2 2 Local gradients upstream * local = downstream 1 3 2 3 2*1 = 2 2*1 = 2
max
65
Forward prop steps 6 3 2 1 2 2 Local gradients 1 3 2 3 2 2
66
67
max
68
6 3 2 1 2 2 1 2 2 2
max
69
6 3 2 1 2 2 1 3 3
max
70
6 3 2 1 2 2 1 3 2
71
72
computed gradients by hand
73
Compute gradient wrt each node using gradient wrt successors Done correctly, big O() complexity of fprop and bprop is the same In general our nets have regular layer-structure and so we can use matrices and Jacobians…
= successors of
Single scalar output
74
automatically inferred from the symbolic expression of the fprop
to compute its output and how to compute the gradient wrt its inputs given the gradient wrt its output
PyTorch, etc.) do backpropagation for you but mainly leave layer/node writer to hand-calculate the local derivative
75
76
77
78
to do this everywhere.
79
80
implemented for you?
models
backprop-e2f06eab496b
81