Aykut Erdem // Hacettepe University // Fall 2019
Lecture 12:
Computational Graph Backpropagation
BBM406
Fundamentals of Machine Learning
Illustration: 3Blue1Brown
BBM406 Fundamentals of Machine Learning Lecture 12: Computational - - PowerPoint PPT Presentation
Illustration: 3Blue1Brown BBM406 Fundamentals of Machine Learning Lecture 12: Computational Graph Backpropagation Aykut Erdem // Hacettepe University // Fall 2019 Last time Multilayer Perceptron Layer Representation y W 4 y i =
Aykut Erdem // Hacettepe University // Fall 2019
Lecture 12:
Computational Graph Backpropagation
Illustration: 3Blue1Brown
Last time… Multilayer Perceptron
2
linear mapping Wx and nonlinear function
to measure quality of estimate so far
yi = Wixi xi+1 = σ(yi)
x1 x2 x3 x4 y
W1 W2 W3 W4
l(y, yi)
slide by Alex SmolaLast time… Forward Pass
(j indexing hidden units, k indexing the output units, D number of inputs)
3
slide by Raquel Urtasun, Richard Zemel, Sanja FidlerX
= g(wk0 +
J
X
j=1
hj(x)wkj)
σ(z) = 1 1 + exp(−z), tanh(z) = exp(z) − exp(−z) exp(z) + exp(−z), ReLU(z) = max(0, z) hj(x) = f (vj0 +
D
X
i=1
xivji)
Last time… Forward Pass in Python
biases and W3?
4
slide by Raquel Urtasun, Richard Zemel, Sanja Fidler[http://cs231n.github.io/neural-networks-1/]
5
Recap: Loss function/Optimization
6
0.09 2.9 4.48 8.02 3.78 1.06
6.04 5.31
3.58 4.49
3.42 4.64 2.65 5.1 2.64 5.55
6.14
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnsonquantifies our unhappiness with the scores across the training data.
efficiently finding the parameters that minimize the loss function. (optimization)
TODO:
We defined a (linear) score function:
7
slide by Fei-Fei Li & Andrej Karpathy & Justin JohnsonSoftmax Classifier (Multinomial Logistic Regression)
8
slide by Fei-Fei Li & Andrej Karpathy & Justin JohnsonSoftmax Classifier (Multinomial Logistic Regression)
9
slide by Fei-Fei Li & Andrej Karpathy & Justin JohnsonSoftmax Classifier (Multinomial Logistic Regression)
10
slide by Fei-Fei Li & Andrej Karpathy & Justin JohnsonSoftmax Classifier (Multinomial Logistic Regression)
11
slide by Fei-Fei Li & Andrej Karpathy & Justin JohnsonSoftmax Classifier (Multinomial Logistic Regression)
12
slide by Fei-Fei Li & Andrej Karpathy & Justin JohnsonSoftmax Classifier (Multinomial Logistic Regression)
13
slide by Fei-Fei Li & Andrej Karpathy & Justin JohnsonSoftmax Classifier (Multinomial Logistic Regression)
14
slide by Fei-Fei Li & Andrej Karpathy & Justin JohnsonSoftmax Classifier (Multinomial Logistic Regression)
15
slide by Fei-Fei Li & Andrej Karpathy & Justin JohnsonSoftmax Classifier (Multinomial Logistic Regression)
Softmax Classifier (Multinomial Logistic Regression)
16
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson17
18
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnsonto compute the gradient
19
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnsonto compute the gradient
20 there are also more fancy update formulas (momentum, Adagrad, RMSProp, Adam, …)
slide by Fei-Fei Li & Andrej Karpathy & Justin JohnsonThe effects of different update form formulas
21
(image credits to Alec Radford)
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson22
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson23
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson24
x W
*
hinge lossR
+L s (scores)
slide by Fei-Fei Li & Andrej Karpathy & Justin JohnsonConvolutional Network (AlexNet)
input image weights loss
25
slide by Fei-Fei Li & Andrej Karpathy & Justin JohnsonNeural Turing Machine
input tape loss
26
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson27
e.g. x = -2, y = 5, z = -4
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson28
e.g. x = -2, y = 5, z = -4 Want:
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson29
e.g. x = -2, y = 5, z = -4 Want:
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson30
e.g. x = -2, y = 5, z = -4 Want:
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson31
e.g. x = -2, y = 5, z = -4 Want:
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson32
e.g. x = -2, y = 5, z = -4 Want:
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson33
e.g. x = -2, y = 5, z = -4 Want:
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson34
e.g. x = -2, y = 5, z = -4 Want:
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson35
e.g. x = -2, y = 5, z = -4 Want:
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson36
e.g. x = -2, y = 5, z = -4 Want:
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson37
e.g. x = -2, y = 5, z = -4 Want:
slide by Fei-Fei Li & Andrej Karpathy & Justin JohnsonChain rule:
38
e.g. x = -2, y = 5, z = -4 Want:
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson39
e.g. x = -2, y = 5, z = -4 Want:
slide by Fei-Fei Li & Andrej Karpathy & Justin JohnsonChain rule:
40
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnsonf
activations
41
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnsonf
activations
“local gradient”
42
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnsonf
activations gradients
“local gradient”
43
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnsonf
activations gradients
“local gradient”
44
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnsonf
activations gradients
“local gradient”
45
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnsonf
activations gradients
“local gradient”
46
Another example:
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson47
Another example:
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson48
Another example:
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson49
Another example:
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson50
Another example:
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson51
Another example:
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson52
Another example:
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson53
Another example:
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson54
Another example:
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson55
Another example:
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson(-1) * (-0.20) = 0.20
56
Another example:
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson57
Another example:
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson[local gradient] x [its gradient] [1] x [0.2] = 0.2 [1] x [0.2] = 0.2 (both inputs!)
58
Another example:
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson59
Another example:
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson[local gradient] x [its gradient] x0: [2] x [0.2] = 0.4 w0: [-1] x [0.2] = -0.2
0.40
60
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnsonsigmoid function sigmoid gate
0.40
61
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnsonsigmoid function sigmoid gate (0.73) * (1 - 0.73) = 0.2
0.40
62
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson63
slide by Fei-Fei Li & Andrej Karpathy & Justin JohnsonGradients add at branches
+
Implementation: forward/backward API
64
Graph (or Net) object. (Rough pseudo code)
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 0.40Implementation: forward/backward API
65
(x,y,z are scalars)
* x y z
slide by Fei-Fei Li & Andrej Karpathy & Justin JohnsonImplementation: forward/backward API
66
(x,y,z are scalars)
* x y z
slide by Fei-Fei Li & Andrej Karpathy & Justin Johnsongradient formula by hand for all parameters
along a computational graph to compute the gradients of all inputs/parameters/intermediates
nodes implement the forward() / backward() API.
intermediates needed for gradient computation in memory
the loss function with respect to the inputs.
67
68
Mini-batch SGD Loop: 1.Sample a batch of data 2.Forward prop it through the graph, get loss 3.Backprop to calculate the gradients 4.Update the parameters using the gradient
Next Lecture: Introduction to Deep Learning
69