Natural Language Processing with Deep Learning CS224N/Ling284
Christopher Manning Lecture 4: Backpropagation and computation graphs
Natural Language Processing with Deep Learning CS224N/Ling284 - - PowerPoint PPT Presentation
Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 4: Backpropagation and computation graphs Lecture Plan Lecture 4: Backpropagation and computation graphs 1. Matrix gradients for our simple neural net
Christopher Manning Lecture 4: Backpropagation and computation graphs
Lecture 4: Backpropagation and computation graphs
c. Nonlinearities
f. Learning rates
2
3
! = [ xmuseums xin xParis xare xamazing]
" = $ % % = &! + ( ) = *+"
4
x1 x2 x3 +1 f(z1)= h1 h2 =f(z2) s
!" !# = % !& !# = % ! !# #' + ) !*+ !,
+-
= ! !,
+-
#+.' + /+ =
0123 ∑567 8
,
+595 = 9-
!" !#
$%
= '$(%
5
Error signal from above Local gradient signal
dimensionality!
Keep straight what variables feed into what computations
derivative wrt fc when c = y (the correct class), then consider derivative wrt fc when c ¹ y (all the incorrect classes)
confused by matrix calculus!
arrives at a hidden layer has the same dimensionality as that hidden layer
6
simply be split up for each word vector:
xin xParis xare xamazing ]
7
principle) be more helpful in determining named entities.
just before the center word is indicative for the center word to be a location
8
for movie review sentiment using single words.
TV telly television
9
10
TV telly television
This can be bad!
Answer:
about words not in your training data and will know more about words that are in your training data
vectors
train = update = fine-tune word vectors to the task
11
12
13
14
15
16
“downstream gradient”
Upstream gradient
17
Downstream gradient
Downstream gradient Upstream gradient
respect to it’s input
Local gradient
18
Downstream gradient Upstream gradient
respect to it’s input
Local gradient
19
Chain rule!
Downstream gradient Upstream gradient
respect to it’s input
Local gradient
20
21
Downstream gradients Upstream gradient Local gradients
22
23
max
24
Forward prop steps
max
25
Forward prop steps 6 3 2 1 2 2
max
26
Forward prop steps 6 3 2 1 2 2 Local gradients
max
27
Forward prop steps 6 3 2 1 2 2 Local gradients
max
28
Forward prop steps 6 3 2 1 2 2 Local gradients
max
29
Forward prop steps 6 3 2 1 2 2 Local gradients
max
30
Forward prop steps 6 3 2 1 2 2 Local gradients upstream * local = downstream 1 1*3 = 3 1*2 = 2
max
31
Forward prop steps 6 3 2 1 2 2 Local gradients upstream * local = downstream 1 3 2 3*1 = 3 3*0 = 0
max
32
Forward prop steps 6 3 2 1 2 2 Local gradients upstream * local = downstream 1 3 2 3 2*1 = 2 2*1 = 2
max
33
Forward prop steps 6 3 2 1 2 2 Local gradients 1 3 2 3 2 2
34
35
max
36
6 3 2 1 2 2 1 2 2 2
max
37
6 3 2 1 2 2 1 3 3
max
38
6 3 2 1 2 2 1 3 2
39
40
computed gradients by hand
41
Compute gradient wrt each node using gradient wrt successors Done correctly, big O() complexity of fprop and bprop is the same In general our nets have regular layer-structure and so we can use matrices and Jacobians…
= successors of
Single scalar output
42
automatically inferred from the symbolic expression of the fprop
to compute its output and how to compute the gradient wrt its inputs given the gradient wrt its output
PyTorch, etc.) do backpropagation for you but mainly leave layer/node writer to hand-calculate the local derivative
43
44
45
46
to do this everywhere.
47
48
implemented for you?
models
backprop-e2f06eab496b
49
all parameters !, e.g., L2 regularization:
50
model power T r a i n i n g e r r
T e s t e r r
51
52
logistic (“sigmoid”) tanh hard tanh tanh is just a rescaled and shifted sigmoid (2 as steep, [−1,1]): Both logistic and tanh are still used in particular uses, but are no longer the defaults for making deep networks tanh(z) = 2logistic(2z)−1
1 1 −1
ReLU (rectified Leaky ReLU Parametric ReLU linear unit) hard tanh
ReLU — it trains quickly and performs well due to good gradient backflow
rect(z) = max(z,0)
biases to optimal value if weights were 0 (e.g., mean target or inverse sigmoid of mean target)
numbers get neither too big or too small
nin (previous layer size) and fan-out nout (next layer size):
the learning rate (next slide)
you often do better with one of a family of more sophisticated “adaptive” optimizers that scale the parameter adjustment by an accumulated gradient.
rates to decrease as you train
$%&'(, for epoch t
rate that the optimizer shrinks – so may be able to start high