Lecture #03 – Multi-layer Perceptrons
Aykut Erdem // Hacettepe University // Spring 2020
CMP784
DEEP LEARNING
Image: Jose-Luis Olivares
CMP784 DEEP LEARNING Lecture #03 Multi-layer Perceptrons Aykut - - PowerPoint PPT Presentation
Image: Jose-Luis Olivares CMP784 DEEP LEARNING Lecture #03 Multi-layer Perceptrons Aykut Erdem // Hacettepe University // Spring 2020 Breaking news! Practical 1 is out! Learning neural word embeddings Due Friday, Mar. 26,
Lecture #03 – Multi-layer Perceptrons
Aykut Erdem // Hacettepe University // Spring 2020
DEEP LEARNING
Image: Jose-Luis Olivares
Breaking news!
—Learning neural word embeddings —Due Friday, Mar. 26, 23:59:59
in two weeks!
− Choose your papers and your roles
2Previously on CMP784
Puppy or bagel? // Karen Zack
Lecture overview
—Hugo Larochelle’s Neural networks slides —Nick Locascio’s MIT 6.S191 slides —Efstratios Gavves and Max Willing’s UvA deep learning class —Leonid Sigal’s CPSC532L class —Richard Socher’s CS224d class —Dan Jurafsky’s CS124 class
4A Brief History of Neural Networks
5 Image: VUNI Inc.today
The Perceptron
The Perceptron
7x0 x1 x2 xn 1 ∑ inputs bias weights w0 w1 w2 wn b … sum non-linearity
Perceptron Forward Pass
(or input activation)
where
w are the weights (parameters) b is the bias term g(·) is called the activation function
8i wixi = b + w>x
P
i wixi)
x1 x2 xn 1 ∑ inputs bias weights w0 w1 w2 wn b … sum non-linearity
Output Activation of The Neuron
9Bi t ri s ed
(x a(x
(from Pascal Vincent’s slides) Image credit: Pascal Vincent
i wixi)
Bias only changes the position of the riff Range is determined by g(·)
x0 x1 x2 xn 1 ∑ inputs bias weights w0 w1 w2 wn b … sum non-linearity
Linear Activation Function
10i wixi)
x0 x1 x2 xn 1 ∑ inputs bias weights w0 w1 w2 wn b … sum non-linearity tion
Sigmoid Activation Function
11x0 x1 x2 xn 1 ∑ inputs bias weights w0 w1 w2 wn b … sum non-linearity
1 1+exp(a)
s
i wixi)
the neuron’s
between 0 and 1
positive
Increasing
Perceptron Forward Pass
122 3
5 1 ∑ inputs bias weights 0.1 0.5 2.5 0.2 3.0 sum non-linearity
i wixi)
Perceptron Forward Pass
132 3
5 1 ∑ inputs bias weights 0.1 0.5 2.5 0.2 3.0 sum non-linearity
(x
(2*0.1) + (3*0.5) + (-1*2.5) + (5*0.2) + (1*3.0)
Perceptron Forward Pass
142 3
5 1 ∑ inputs bias weights 0.1 0.5 2.5 0.2 3.0 sum non-linearity
h(x) = g(3.2) = σ(3.2) 1 1 + e−3.2 = 0.96
Hyperbolic Tangent (tanh) Activation Function
15x0 x1 x2 xn 1 ∑ inputs bias weights w0 w1 w2 wn b … sum non-linearity
i wixi)
neuron’s output between
Increasing
h(a) = exp(a)exp(a)
exp(a)+exp(a) = exp(2a)1 exp(2a)+1
exp(a)+exp(a)
Rectified Linear (ReLU) Activation Function
16x0 x1 x2 xn 1 ∑ inputs bias weights w0 w1 w2 wn b … sum non-linearity
i wixi)
by 0 (always non-negative)
bounded
increasing
produce units with sparse activities
Decision Boundary of a Neuron
—with sigmoid, one can interpret neuron as estimating p(y = 1 | x) —also known as logistic regression classifier —if activation is greater than 0.5, predict 1 —otherwise predict 0
Same idea can be applied to a tanh activation
17Image credit: Pascal Vincent
(from Pascal Vincent’s slides)
han
Decision boundary is linear
Capacity of Single Neuron
1 1 1 1 1 1
OR (x1, x2)
AND (x1, x2)
AND (x1, x2) (x1
, x2) , x2) , x2)
(x1 (x1
Capacity of Single Neuron
1 1
?
XOR (x1, x2)
(x1
1 1
XOR (x1, x2) AND (x1, x2)
AND (x1, x2)
, x2)
Perceptron Diagram Simplified
20x0 x1 x2 xn 1 ∑ inputs bias weights w0 w1 w2 wn b … sum non-linearity
Perceptron Diagram Simplified
21x0 x1 x2 xn
inputs …
Multi-Output Perceptron
—We need multiple outputs (1 output per class) —We need to estimate conditional probability p(y = c|x) —Discriminative Learning
—Strictly positive —sums to one
x0 x1 x2 xn
inputs …
h
exp(a1) P
c exp(ac) . . .
exp(aC) P
c exp(ac)
i>
Multi-Layer Perceptron
Single Hidden Layer Neural Network
x0 x1 xn h1 inputs … hidden layer h2 h0 hn
layer
⇣ a(x)i = b(1)
i
+ P
j W (1) i,j xj
⌘
⇣ b(2) + w(2)h(1)x ⌘
>
Multi-Layer Perceptron (MLP)
layers.
—layer pre-activation for k>0 —hidden layer activation from 1 to L: —output layer activation (k=L+1)
25x0 x1 xn h1 inputs … hidden layer h2 h0 hn
layer
Deep Neural Network
26x0 x1 xn h1 inputs … hidden layer h2 h0 hn
layer h1 h2 h0 hn …
Capacity of Neural Networks
Image credit: Pascal Vincent
R´ eseaux de neurones
1 1 1 1 .5x1 x2 x
1y1 y2 z zk
wkj wjix1 x2 x1 x2 x1 x2 y1 y2
sortie k entr´ ee i cach´ ee j biais
Input Hidden Output bias
(from Pascal Vincent’s slides)
Capacity of Neural Networks
Image credit: Pascal Vincent
y1 y2 y4 y3 y3 y4 y2 y1 x1 x2 z1 z1 x1 x2(from Pascal Vincent’s slides)
Capacity of Neural Networks
Image credit: Pascal Vincent (from Pascal Vincent’s slides)
Universal Approximation
—“a single hidden layer neural network with a linear output unit can approximate any continuous function arbitrarily well, given enough hidden units’’
find the necessary parameter values.
30Applying Neural Networks
Example Problem: Will my flight be delayed?
32Wind Speed: 45 mph
Example Problem: Will my flight be delayed?
33[-20, 45]
x0 x1 h1 h2 h0
Example Problem: Will my flight be delayed?
34Actual: 1
[-20, 45]
x0 x1 h1 h2 h0
Quantifying Loss
35Predicted: 0.05 Actual: 1
[-20, 45]
x0 x1 h1 h2 h0
Predicted
`(f(x(i); ✓), y(i))
Actual
Total Loss
36[ [-20, 45], [80, 0], [4, 15], [45, 60], ]
x0 x1 h1 h2 h0
Predicted Actual
J(✓) = 1 N X
i
`(f(x(i); ✓), y(i))
Input Predicted Actual
[ 0.05 0.02 0.96 0.35 ] [ 1 1 1 ]
Total Loss
37[ [-20, 45], [80, 0], [4, 15], [45, 60], ]
x0 x1 h1 h2 h0
Predicted Actual
J(✓) = 1 N X
i
`(f(x(i); ✓), y(i))
Input Predicted Actual
[ 0.05 0.02 0.96 0.35 ] [ 1 1 1 ]
Binary Cross Entropy Loss
38[ [-20, 45], [80, 0], [4, 15], [45, 60], ]
x0 x1 h1 h2 h0
Input Predicted Actual
[ 0.05 0.02 0.96 0.35 ] [ 1 1 1 ]
Jcross entropy(θ) = 1 N X
i
y(i) log(f(x(i); θ)) + (1 − y(i)) log(1 − f(x(i); θ)))
Binary Cross Entropy Loss
39[ [-20, 45], [80, 0], [4, 15], [45, 60], ]
x0 x1 h1 h2 h0
Input Predicted Actual
[ 0.05 0.02 0.96 0.35 ] [ 1 1 1 ]
JMSE(θ) = 1 N X
i
⇣ f(x(i); θ) − y(i)⌘2
Training Neural Networks
arg min
θ
1 T X
t
l(f(x(t); θ), y(t)) + λΩ(θ)
Training
—For classification problems, we would like to minimize classification error —Loss function can sometimes be viewed as a surrogate for what we want to optimize (e.g. upper bound)
41Loss function Regularizer
Loss is a function of the model’s parameters
42How to minimize loss?
43How to minimize loss?
44Compute:
How to minimize loss?
45Move in direction opposite
How to minimize loss?
46Move in direction opposite
How to minimize loss?
47Repeat!
This is called Stochastic Gradient Descent (SGD)
48Repeat!
Stochastic Gradient Descent (SGD)
Why is it Stochastic Gradient Descent?
true gradient!
𝜄𝑢+1 𝜄𝑢 𝜃𝑢𝛼𝜄ℒ
𝜄𝑢+1 𝜄𝑢 𝜃𝑢𝛼𝜄ℒ
estimate!
⎯ Smoother convergence ⎯ Allows for larger learning rates
⎯ Can parallelize computation + achieve significant speed increases on GPU’s
θ)
Training epoch = Iteration of all examples
Stochastic Gradient Descent (SGD)
—initialize —for N iterations
—for each training example or batch
—the loss function —a procedure to compute the parameter gradients: —the regularizer (and the gradient )
52ze:
mple
r 8
Training epoch = Iteration over all examples
ent: ,
,
r
ion:
𝜄𝑢+1 𝜄𝑢 𝜃𝑢𝛼𝜄ℒ
θ)
Training epoch = Iteration of all examples
Stochastic Gradient Descent (SGD)
—initialize —for N iterations
—for each training example or batch
—the loss function —a procedure to compute the parameter gradients: —the regularizer (and the gradient )
53ze:
mple
r 8
Training epoch = Iteration over all examples
ent: ,
,
r
ion:
𝜄𝑢+1 𝜄𝑢 𝜃𝑢𝛼𝜄ℒ
What is a neural network again?
functions
aL(x; θ1,...,L) = hL(hL−1(. . . h1(x, θ1), θL−1), θL)
✓∗ ← arg min
θ
X
(x,y)∈(X,Y )
`(y, aL(x; ✓1,...,L))
Neural network models
h1(xi; θ)
Input
h2(xi; θ) h3(xi; θ) h4(xi; θ) h5(xi; θ) Loss
Forward connections (Feedforward architecture)
Neural network models
Input Interweaved connections (Directed Acyclic Graph architecture – DAGNN)
h1(xi; θ) h2(xi; θ) h3(xi; θ) h4(xi; θ) h5(xi; θ) Loss h2(xi; θ) h4(xi; θ)
Neural network models
Input
h2(xi; θ) h3(xi; θ) h4(xi; θ) h5(xi; θ) Loss
Loopy connections (Recurrent architecture, special care needed)
h1(xi; θ)
Neural network models
Functions → Modules
h1(xi; θ)
Input
h2(xi; θ) h3(xi; θ) h4(xi; θ) h5(xi; θ) Loss h1(xi; θ)
Input
h2(xi; θ) h3(xi; θ) h4(xi; θ) h5(xi; θ) Loss
Input
h1(xi; θ) h2(xi; θ) h3(xi; θ) h4(xi; θ) h5(xi; θ) Loss h2(xi; θ) h4(xi; θ)
What is a module
⎯ Contains trainable parameters (𝜄) ⎯ Receives as an argument an input 𝑦 ⎯ And returns an output 𝑏 based on the activation function h(...)
differentiable (almost) everywhere
→ store module input
⎯ easy to get module output fast ⎯ easy to compute derivatives
59Input
h1(xi; θ) h2(xi; θ) h3(xi; θ) h4(xi; θ) h5(xi; θ) Loss h2(xi; θ) h4(xi; θ)
Anything goes or do special constraints exist?
computations
form recurrent connections (revisited later)
60What is a module
network
module h𝑚(... )
Input
h1(xi; θ) h2(xi; θ) h3(xi; θ) h4(xi; θ) h5(xi; θ) Loss h2(xi; θ) h4(xi; θ) where
al = hl(xl; θ) al = xl+1 xl = al−1
What is a module
for our data
module 𝜖h𝑚(𝑦𝑚;𝜄𝑚) w.r.t. their inputs 𝑦𝑚 and parameters 𝜄𝑚
and traverse it backwards
their gradients
with the backpropagation algorithm
62h1(xi; θ) h2(xi; θ) h3(xi; θ) h4(xi; θ) h5(xi; θ) h2(xi; θ) h4(xi; θ)
dLoss(Input)
Again, what is a neural network again?
⎯ x: input, θl: parameters for layer l, al = hl(x, θl): (non-)linear function
we need the gradients
functions, like 𝑏𝑀 (... )?
63aL(x; θ1,...,L) = hL(hL−1(. . . h1(x, θ1), θL−1), θL)
✓∗ ← arg min
θ
X
(x,y)∈(X,Y )
`(y, aL(x; ✓1,...,L))
✓ θt+1 = θt − ηt ∂L ∂θt ◆
∂L ∂θl , l = 1, . . . , L
Again, what is a neural network again?
⎯ x: input, θl: parameters for layer l, al = hl(x, θl): (non-)linear function
we need the gradients
compute the grad adients s for su such ch a a co complicat cated funct ction encl closi sing other funct ctions, s, like ke 𝑏𝑀 (... )? ?
64aL(x; θ1,...,L) = hL(hL−1(. . . h1(x, θ1), θL−1), θL)
✓∗ ← arg min
θ
X
(x,y)∈(X,Y )
`(y, aL(x; ✓1,...,L))
✓ θt+1 = θt − ηt ∂L ∂θt ◆
∂L ∂θl , l = 1, . . . , L
How do we compute gradients?
Numerical Differentiation
∂f(x) ∂xi ≈= lim
h→0
f(x + h1i) − f(x) h
1i - Vector of all zeros, except for one 1 in i-th location
Numerical Differentiation
∂f(x) ∂xi ≈= lim
h→0
f(x + h1i) − f(x) h ∂f(x) ∂xi ≈= lim
h→0
f(x + h1i) − f(x − h1i) 2h
1i - Vector of all zeros, except for one 1 in i-th location
Numerical Differentiation
for learning (they are very good tools for checking the correctness of implementation though, e.g., use h = 0.000001).
68 slide adopted from T. Chen, H. Shen, A. Krishnamurthy∂f(x) ∂xi ≈= lim
h→0
f(x + h1i) − f(x) h ∂f(x) ∂xi ≈= lim
h→0
f(x + h1i) − f(x − h1i) 2h
1i - Vector of all zeros, except for one 1 in i-th location
Numerical Differentiation
for learning (they are very good tools for checking the correctness of implementation though, e.g., use h = 0.000001).
69 slide adopted from T. Chen, H. Shen, A. Krishnamurthy1i - Vector of all zeros, except for one 1 in i-th location 1ij - Matrix of all zeros, except for one 1 in (i,j)-th location
∂L(W, b) ∂wij ≈ lim
h→0
L(W + h1ij, b) − L(W, b) h ∂L(W, b) ∂bj ≈ lim
h→0
L(W, b + h1j) − L(W, b) h
∂L(W, b) ∂wij ≈ lim
h→0
L(W + h1ij, b) − L(W + h1ij, b) 2h ∂L(W, b) ∂bj ≈ lim
h→0
L(W, b + h1j) − L(W, b + h1j) 2h
Symbolic Differentiation
computational gr graph ph (a symbolic tree)
Implements differentiation rules for composite functions:
ln x1 x2 + sin +
−
y v2 v4 v3 v5 + sin +
−
ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y
slide adopted from T. Chen, H. Shen, A. Krishnamurthyd (f(x) + g(x)) dx = df(x) dx + dg(x) dx d (f(x) · g(x)) dx = df(x) dx g(x) + f(x)dg(x) dx d(f(g(x))) dx = df(g(x)) dx · dg(x) dx
Sum Rule Product Rule Chain Rule
Pr Proble
m: For complex functions, expressions can be exponentially large; also difficult to deal with piece-wise functions (creates many symbolic cases)
y = f(x1, x2) = ln(x1) + x1x2 − sin(x2)
Automatic Differentiation (AutoDiff)
Intuit itio ion: Interleave symbolic differentiation and simplification
Key Id Idea: : Apply symbolic differentiation at the elementary
Success of de deep learning learning owes A LOT to success of AutoDiff algorithms (also to advances in parallel architectures, and large datasets, ...)
slide adopted from T. Chen, H. Shen, A. Krishnamurthyy = f(x1, x2) = ln(x1) + x1x2 − sin(x2)
Automatic Differentiation (AutoDiff)
node is an input, intermediate, or output variable
ational al grap aph (a DAG) with variable
+ sin +
−
ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y
slide adopted from T. Chen, H. Shen, A. Krishnamurthyy = f(x1, x2) = ln(x1) + x1x2 − sin(x2)
Automatic Differentiation (AutoDiff)
node is an input, intermediate, or output variable
ational al grap aph (a DAG) with variable
+ sin +
−
ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 Computational graph is governed by these equations:
slide adopted from T. Chen, H. Shen, A. Krishnamurthyy = f(x1, x2) = ln(x1) + x1x2 − sin(x2)
Automatic Differentiation (AutoDiff)
node is an input, intermediate, or output variable
ational al grap aph (a DAG) with variable
+ sin +
−
ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 Computational graph is governed by these equations:
slide adopted from T. Chen, H. Shen, A. KrishnamurthyLets see how we can evalu evaluat ate a a funct ction using computational graph (DNN inferences)
y = f(x1, x2) = ln(x1) + x1x2 − sin(x2)
Automatic Differentiation (AutoDiff)
node is an input, intermediate, or output variable
ational al grap aph (a DAG) with variable
+ sin +
−
ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y Lets see how we can evalu evaluat ate a a funct ction using computational graph (DNN inferences)
v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)
Forwar ard Eval valuat ation Trace: y = f(x1, x2) = ln(x1) + x1x2 − sin(x2)
Automatic Differentiation (AutoDiff)
node is an input, intermediate, or output variable
ational al grap aph (a DAG) with variable
+ sin +
−
ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y Lets see how we can evalu evaluat ate a a funct ction using computational graph (DNN inferences)
v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)
Forwar ard Eval valuat ation Trace:
2
y = f(x1, x2) = ln(x1) + x1x2 − sin(x2)
Automatic Differentiation (AutoDiff)
node is an input, intermediate, or output variable
ational al grap aph (a DAG) with variable
+ sin +
−
ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y Lets see how we can evalu evaluat ate a a funct ction using computational graph (DNN inferences)
v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)
Forwar ard Eval valuat ation Trace:
2 5
y = f(x1, x2) = ln(x1) + x1x2 − sin(x2)
Automatic Differentiation (AutoDiff)
node is an input, intermediate, or output variable
ational al grap aph (a DAG) with variable
+ sin +
−
ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y Lets see how we can evalu evaluat ate a a funct ction using computational graph (DNN inferences)
v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)
Forwar ard Eval valuat ation Trace:
2 5 ln(2) = 0.693
y = f(x1, x2) = ln(x1) + x1x2 − sin(x2)
Automatic Differentiation (AutoDiff)
node is an input, intermediate, or output variable
ational al grap aph (a DAG) with variable
+ sin +
−
ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y Lets see how we can evalu evaluat ate a a funct ction using computational graph (DNN inferences)
v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)
Forwar ard Eval valuat ation Trace:
2 5 ln(2) = 0.693 2 x 5 = 10
y = f(x1, x2) = ln(x1) + x1x2 − sin(x2)
Automatic Differentiation (AutoDiff)
node is an input, intermediate, or output variable
ational al grap aph (a DAG) with variable
+ sin +
−
ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y Lets see how we can evalu evaluat ate a a funct ction using computational graph (DNN inferences)
v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)
Forwar ard Eval valuat ation Trace:
2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959
y = f(x1, x2) = ln(x1) + x1x2 − sin(x2)
Automatic Differentiation (AutoDiff)
node is an input, intermediate, or output variable
ational al grap aph (a DAG) with variable
+ sin +
−
ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y Lets see how we can evalu evaluat ate a a funct ction using computational graph (DNN inferences)
v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)
Forwar ard Eval valuat ation Trace:
2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693
y = f(x1, x2) = ln(x1) + x1x2 − sin(x2)
Automatic Differentiation (AutoDiff)
node is an input, intermediate, or output variable
ational al grap aph (a DAG) with variable
+ sin +
−
ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y Lets see how we can evalu evaluat ate a a funct ction using computational graph (DNN inferences)
v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)
Forwar ard Eval valuat ation Trace:
2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652
y = f(x1, x2) = ln(x1) + x1x2 − sin(x2)
Automatic Differentiation (AutoDiff)
node is an input, intermediate, or output variable
ational al grap aph (a DAG) with variable
y = f(x1, x2) = ln(x1) + x1x2 − sin(x2) + sin +
−
ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y Lets see how we can evalu evaluat ate a a funct ction using computational graph (DNN inferences)
v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)
Forwar ard Eval valuat ation Trace:
2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652
Automatic Differentiation (AutoDiff)
84y = f(x1, x2) = ln(x1) + x1x2 − sin(x2)
+ sin +
−
ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)
Forwar ard Eval valuat ation Trace:
2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652
Automatic Differentiation (AutoDiff)
85y = f(x1, x2) = ln(x1) + x1x2 − sin(x2)
Lets see how we can evalu evaluat ate a a funct ction using computational graph (DNN inferences)
∂f(x1, x2) ∂x1
We will do this with for forwa ward mo mode first, by introducing a derivative of each variable node with respect to the input variable.
v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)
Forwar ard Eval valuat ation Trace:
2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652
+ sin +
−
ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y
Automatic Differentiation (AutoDiff)
86y = f(x1, x2) = ln(x1) + x1x2 − sin(x2)
Forward Derivative Trace: ∂f(x1, x2) ∂x1
v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)
Forwar ard Eval valuat ation Trace:
2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652
+ sin +
−
ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y
Forwar ard Derivat vative ve Trace:
∂v0 ∂x1 ∂v
1∂v
Automatic Differentiation (AutoDiff)
87y = f(x1, x2) = ln(x1) + x1x2 − sin(x2)
Forward Derivative Trace: ∂f(x1, x2) ∂x1
v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)
Forwar ard Eval valuat ation Trace:
2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652
+ sin +
−
ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y
Forwar ard Derivat vative ve Trace:
∂v0 ∂x1 ∂v
1∂v 1
Automatic Differentiation (AutoDiff)
88y = f(x1, x2) = ln(x1) + x1x2 − sin(x2)
Forward Derivative Trace: ∂f(x1, x2) ∂x1
v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)
Forwar ard Eval valuat ation Trace:
2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652
+ sin +
−
ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y
Forwar ard Derivat vative ve Trace:
∂v0 ∂x1 ∂v
1∂v1 ∂x1 ∂v ∂x1 1
Automatic Differentiation (AutoDiff)
89y = f(x1, x2) = ln(x1) + x1x2 − sin(x2)
Forward Derivative Trace: ∂f(x1, x2) ∂x1
v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)
Forwar ard Eval valuat ation Trace:
2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652
+ sin +
−
ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y
Forwar ard Derivat vative ve Trace:
∂v0 ∂x1 ∂v
1∂v1 ∂x1 ∂v ∂x1 1
Automatic Differentiation (AutoDiff)
90y = f(x1, x2) = ln(x1) + x1x2 − sin(x2)
Forward Derivative Trace: ∂f(x1, x2) ∂x1
v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)
Forwar ard Eval valuat ation Trace:
2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652
+ sin +
−
ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y
Forwar ard Derivat vative ve Trace:
∂v0 ∂x1 ∂v
1∂v
1∂v1 ∂x1 ∂v ∂x1 ∂v2 ∂x1 = 1
Automatic Differentiation (AutoDiff)
91y = f(x1, x2) = ln(x1) + x1x2 − sin(x2)
Forward Derivative Trace: ∂f(x1, x2) ∂x1
v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)
Forwar ard Eval valuat ation Trace:
2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652
+ sin +
−
ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y
Forwar ard Derivat vative ve Trace:
∂v0 ∂x1 ∂v
1∂v
1∂v1 ∂x1 ∂v ∂x1 ∂v2 ∂x1 = Chain Rule 1
Automatic Differentiation (AutoDiff)
92y = f(x1, x2) = ln(x1) + x1x2 − sin(x2)
Forward Derivative Trace: ∂f(x1, x2) ∂x1
v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)
Forwar ard Eval valuat ation Trace:
2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652
+ sin +
−
ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y
Forwar ard Derivat vative ve Trace:
∂v0 ∂x1 ∂v
1∂v1 ∂x1 ∂v ∂x1 ∂v2 ∂x1 = 1 v0 ∂v0 ∂x1 Chain Rule 1
Automatic Differentiation (AutoDiff)
93y = f(x1, x2) = ln(x1) + x1x2 − sin(x2)
Forward Derivative Trace: ∂f(x1, x2) ∂x1
v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)
Forwar ard Eval valuat ation Trace:
2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652
+ sin +
−
ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y
Forwar ard Derivat vative ve Trace:
∂v0 ∂x1 ∂v
1∂v1 ∂x1 ∂v ∂x1 ∂v2 ∂x1 = 1 v0 ∂v0 ∂x1 Chain Rule 1 1/2 * 1 = 0.5
Automatic Differentiation (AutoDiff)
94y = f(x1, x2) = ln(x1) + x1x2 − sin(x2)
Forward Derivative Trace: ∂f(x1, x2) ∂x1
v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)
Forwar ard Eval valuat ation Trace:
2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652
+ sin +
−
ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y
Forwar ard Derivat vative ve Trace:
∂v0 ∂x1 ∂v
1 1∂v3 ∂x1 = ∂v0 ∂x1 · v1
1∂v1 ∂x1 ∂v ∂x1 ∂v2 ∂x1 = 1 v0 ∂v0 ∂x1 1 1/2 * 1 = 0.5
Automatic Differentiation (AutoDiff)
95y = f(x1, x2) = ln(x1) + x1x2 − sin(x2)
Forward Derivative Trace: ∂f(x1, x2) ∂x1
v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)
Forwar ard Eval valuat ation Trace:
2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652
+ sin +
−
ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y
Forwar ard Derivat vative ve Trace:
∂v0 ∂x1 ∂v
1 1∂v3 ∂x1 = ∂v0 ∂x1 · v1
1∂v1 ∂x1 ∂v ∂x1 ∂v2 ∂x1 = 1 v0 ∂v0 ∂x1 1 1/2 * 1 = 0.5 Product Rule
Automatic Differentiation (AutoDiff)
96y = f(x1, x2) = ln(x1) + x1x2 − sin(x2)
Forward Derivative Trace: ∂f(x1, x2) ∂x1
v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)
Forwar ard Eval valuat ation Trace:
2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652
+ sin +
−
ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y
Forwar ard Derivat vative ve Trace:
∂v0 ∂x1 ∂v
1 1∂v3 ∂x1 = ∂v0 ∂x1 · v1 + v0 · ∂v1 ∂x1
1∂v1 ∂x1 ∂v ∂x1 ∂v2 ∂x1 = 1 v0 ∂v0 ∂x1 1 1/2 * 1 = 0.5 Product Rule
Automatic Differentiation (AutoDiff)
97y = f(x1, x2) = ln(x1) + x1x2 − sin(x2)
Forward Derivative Trace: ∂f(x1, x2) ∂x1
v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)
Forwar ard Eval valuat ation Trace:
2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652
+ sin +
−
ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y
Forwar ard Derivat vative ve Trace:
∂v0 ∂x1 ∂v
1 1∂v3 ∂x1 = ∂v0 ∂x1 · v1 + v0 · ∂v1 ∂x1
1∂v1 ∂x1 ∂v ∂x1 ∂v2 ∂x1 = 1 v0 ∂v0 ∂x1 1 1/2 * 1 = 0.5 Product Rule 1*5 + 2*0 = 5
Automatic Differentiation (AutoDiff)
98y = f(x1, x2) = ln(x1) + x1x2 − sin(x2)
Forward Derivative Trace: ∂f(x1, x2) ∂x1
v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)
Forwar ard Eval valuat ation Trace:
2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652
+ sin +
−
ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y
Forwar ard Derivat vative ve Trace:
∂v0 ∂x1 ∂v
1 1∂v3 ∂x1 = ∂v0 ∂x1 · v1 + v0 · ∂v1 ∂x1
1∂v1 ∂x1 ∂v ∂x1 ∂v2 ∂x1 = 1 v0 ∂v0 ∂x1
1 1 1∂v6 ∂x1 = ∂v5 ∂x1 − ∂v4 ∂x1 ∂v4 ∂x1 = ∂v1 ∂x1 cos(v1)
1 1∂v5 ∂x1 = ∂v2 ∂x1 + ∂v3 ∂x1
1 1 1∂y ∂x1 = ∂v6 ∂x1 1 1/2 * 1 = 0.5 1*5 + 2*0 = 5 0 * cos(5) = 0 0.5 + 5 = 5.5 5.5 – 0 = 5.5 5.5
Automatic Differentiation (AutoDiff)
99y = f(x1, x2) = ln(x1) + x1x2 − sin(x2)
Forward Derivative Trace: ∂f(x1, x2) ∂x1
+ sin +
−
ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y
Forwar ard Derivat vative ve Trace:
∂v0 ∂x1 ∂v
1 1∂v3 ∂x1 = ∂v0 ∂x1 · v1 + v0 · ∂v1 ∂x1
1∂v1 ∂x1 ∂v ∂x1 ∂v2 ∂x1 = 1 v0 ∂v0 ∂x1
1 1 1∂v6 ∂x1 = ∂v5 ∂x1 − ∂v4 ∂x1 ∂v4 ∂x1 = ∂v1 ∂x1 cos(v1)
1 1∂v5 ∂x1 = ∂v2 ∂x1 + ∂v3 ∂x1
1 1 1∂y ∂x1 = ∂v6 ∂x1 1 1/2 * 1 = 0.5 1*5 + 2*0 = 5 0 * cos(5) = 0 0.5 + 5 = 5.5 5.5 – 0 = 5.5 5.5
We now have: ∂f(x1, x2) ∂x1
= 5.5
Automatic Differentiation (AutoDiff)
100y = f(x1, x2) = ln(x1) + x1x2 − sin(x2)
Forward Derivative Trace: ∂f(x1, x2) ∂x1
+ sin +
−
ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y
Forwar ard Derivat vative ve Trace:
∂v0 ∂x1 ∂v
1 1∂v3 ∂x1 = ∂v0 ∂x1 · v1 + v0 · ∂v1 ∂x1
1∂v1 ∂x1 ∂v ∂x1 ∂v2 ∂x1 = 1 v0 ∂v0 ∂x1
1 1 1∂v6 ∂x1 = ∂v5 ∂x1 − ∂v4 ∂x1 ∂v4 ∂x1 = ∂v1 ∂x1 cos(v1)
1 1∂v5 ∂x1 = ∂v2 ∂x1 + ∂v3 ∂x1
1 1 1∂y ∂x1 = ∂v6 ∂x1 1 1/2 * 1 = 0.5 1*5 + 2*0 = 5 0 * cos(5) = 0 0.5 + 5 = 5.5 5.5 – 0 = 5.5 5.5
We now have: Still need: ∂f(x1, x2) ∂x1
= 5.5 ∂f(x1, x2) ∂x2
AutoDiff: Forward Mode
ard mode mode needs m forward passes to get a full Jacobian (all gradients of
y = f(x) : Rm → Rn
AutoDiff: Forward Mode
ard mode mode needs m forward passes to get a full Jacobian (all gradients of
y = f(x) : Rm → Rn
slide adopted from T. Chen, H. Shen, A. KrishnamurthyPr Probl
image as an input, plus all the weights and biases of layers = millions of inputs! and very few outputs (many DNNs have n = 1) image as an input, plus all the weights and biases of layers = millions of inputs!
AutoDiff: Forward Mode
ard mode mode needs m forward passes to get a full Jacobian (all gradients of
se mode mode computes all gradients in n backwards passes (so for most DNNs in a single back pass — back ck pr propa
gation ion)
103y = f(x) : Rm → Rn
slide adopted from T. Chen, H. Shen, A. KrishnamurthyPr Probl
image as an input, plus all the weights and biases of layers = millions of inputs! and very few outputs (many DNNs have n = 1) image as an input, plus all the weights and biases of layers = millions of inputs!
AutoDiff: Reverse Mode
104v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)
Forwar ard Eval valuat ation Trace:
2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652
+ sin +
−
ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y
x1 x2 y ¯ v0 ¯ v1 ¯ v2 ¯ v3 ¯ v4 ¯ v5 ¯ v6
Traverse the original graph in the reverse topological
introduce an ad adjo join int node node, which computes derivative of the output with respect to the local node (using Chain rule):
"local cal" derivative
AutoDiff: Reverse Mode
105v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)
Forwar ard Eval valuat ation Trace:
2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652
+ sin +
−
ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y
Backwards Derivative ∂v3 ¯ v4 = ¯ v6 ∂v6 ∂v4 ∂v4 ¯ v5 = ¯ v6 ∂v6 ∂v5 ∂y ∂v5 ¯ v6 = ∂y ∂v6
1 1x1 = 1
= ¯ v6 · 1 = ¯ v6 · (−1)
1x-1 = -1
∂v2 ¯ v3 = ¯ v5 ∂v5 ∂v3 = ¯ v5 · (1)
1x1 = 1
∂v1 ∂v1 ¯ v2 = ¯ v5 ∂v5 ∂v2 = ¯ v5 · (1)
1x1 = 1
∂v1 ∂v0
1 = ¯
v3 ∂v3 ∂v1 + ¯ v4 ∂v4 ∂v1 ∂v ¯ v1 = ¯ = ¯ v3v0 + ¯ v4cos(v1)
1.716
= ¯ v3v1 + ¯ v2 1 v0
5.5
¯ v0 = ¯ v3 ∂v3 ∂v0 + ¯ v2 ∂v2 ∂v0
Backw ackwar ards Derivat vative ve Trace:
AutoDiff: Reverse Mode
106v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)
Forwar ard Eval valuat ation Trace:
2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652
+ sin +
−
ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y
Backwards Derivative ∂v3 ¯ v4 = ¯ v6 ∂v6 ∂v4 ∂v4 ¯ v5 = ¯ v6 ∂v6 ∂v5 ∂y ∂v5 ¯ v6 = ∂y ∂v6
1 1x1 = 1
= ¯ v6 · 1 = ¯ v6 · (−1)
1x-1 = -1
∂v2 ¯ v3 = ¯ v5 ∂v5 ∂v3 = ¯ v5 · (1)
1x1 = 1
∂v1 ∂v1 ¯ v2 = ¯ v5 ∂v5 ∂v2 = ¯ v5 · (1)
1x1 = 1
∂v1 ∂v0
1 = ¯
v3 ∂v3 ∂v1 + ¯ v4 ∂v4 ∂v1 ∂v ¯ v1 = ¯ = ¯ v3v0 + ¯ v4cos(v1)
1.716
= ¯ v3v1 + ¯ v2 1 v0
5.5
¯ v0 = ¯ v3 ∂v3 ∂v0 + ¯ v2 ∂v2 ∂v0
Backw ackwar ards Derivat vative ve Trace:
AutoDiff: Reverse Mode
107v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)
Forwar ard Eval valuat ation Trace:
2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652
+ sin +
−
ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y
Backwards Derivative ∂v3 ¯ v4 = ¯ v6 ∂v6 ∂v4 ∂v4 ¯ v5 = ¯ v6 ∂v6 ∂v5 ∂y ∂v5 ¯ v6 = ∂y ∂v6
1 1x1 = 1
= ¯ v6 · 1 = ¯ v6 · (−1)
1x-1 = -1
∂v2 ¯ v3 = ¯ v5 ∂v5 ∂v3 = ¯ v5 · (1)
1x1 = 1
∂v1 ∂v1 ¯ v2 = ¯ v5 ∂v5 ∂v2 = ¯ v5 · (1)
1x1 = 1
∂v1 ∂v0
1 = ¯
v3 ∂v3 ∂v1 + ¯ v4 ∂v4 ∂v1 ∂v ¯ v1 = ¯ = ¯ v3v0 + ¯ v4cos(v1)
1.716
= ¯ v3v1 + ¯ v2 1 v0
5.5
¯ v0 = ¯ v3 ∂v3 ∂v0 + ¯ v2 ∂v2 ∂v0
Backw ackwar ards Derivat vative ve Trace:
1
AutoDiff: Reverse Mode
108v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)
Forwar ard Eval valuat ation Trace:
2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652
+ sin +
−
ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y
Backwards Derivative ∂v3 ¯ v4 = ¯ v6 ∂v6 ∂v4 ∂v4 ¯ v5 = ¯ v6 ∂v6 ∂v5 ∂y ∂v5 ¯ v6 = ∂y ∂v6
1 1x1 = 1
= ¯ v6 · 1 = ¯ v6 · (−1)
1x-1 = -1
∂v2 ¯ v3 = ¯ v5 ∂v5 ∂v3 = ¯ v5 · (1)
1x1 = 1
∂v1 ∂v1 ¯ v2 = ¯ v5 ∂v5 ∂v2 = ¯ v5 · (1)
1x1 = 1
∂v1 ∂v0
1 = ¯
v3 ∂v3 ∂v1 + ¯ v4 ∂v4 ∂v1 ∂v ¯ v1 = ¯ = ¯ v3v0 + ¯ v4cos(v1)
1.716
= ¯ v3v1 + ¯ v2 1 v0
5.5
¯ v0 = ¯ v3 ∂v3 ∂v0 + ¯ v2 ∂v2 ∂v0
Backw ackwar ards Derivat vative ve Trace:
1
AutoDiff: Reverse Mode
109v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)
Forwar ard Eval valuat ation Trace:
2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652
+ sin +
−
ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y
Backwards Derivative ∂v3 ¯ v4 = ¯ v6 ∂v6 ∂v4 ∂v4 ¯ v5 = ¯ v6 ∂v6 ∂v5 ∂y ∂v5 ¯ v6 = ∂y ∂v6
1 1x1 = 1
= ¯ v6 · 1 = ¯ v6 · (−1)
1x-1 = -1
∂v2 ¯ v3 = ¯ v5 ∂v5 ∂v3 = ¯ v5 · (1)
1x1 = 1
∂v1 ∂v1 ¯ v2 = ¯ v5 ∂v5 ∂v2 = ¯ v5 · (1)
1x1 = 1
∂v1 ∂v0
1 = ¯
v3 ∂v3 ∂v1 + ¯ v4 ∂v4 ∂v1 ∂v ¯ v1 = ¯ = ¯ v3v0 + ¯ v4cos(v1)
1.716
= ¯ v3v1 + ¯ v2 1 v0
5.5
¯ v0 = ¯ v3 ∂v3 ∂v0 + ¯ v2 ∂v2 ∂v0
Backw ackwar ards Derivat vative ve Trace:
1
AutoDiff: Reverse Mode
110v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)
Forwar ard Eval valuat ation Trace:
2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652
+ sin +
−
ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y
Backwards Derivative ∂v3 ¯ v4 = ¯ v6 ∂v6 ∂v4 ∂v4 ¯ v5 = ¯ v6 ∂v6 ∂v5 ∂y ∂v5 ¯ v6 = ∂y ∂v6
1 1x1 = 1
= ¯ v6 · 1 = ¯ v6 · (−1)
1x-1 = -1
∂v2 ¯ v3 = ¯ v5 ∂v5 ∂v3 = ¯ v5 · (1)
1x1 = 1
∂v1 ∂v1 ¯ v2 = ¯ v5 ∂v5 ∂v2 = ¯ v5 · (1)
1x1 = 1
∂v1 ∂v0
1 = ¯
v3 ∂v3 ∂v1 + ¯ v4 ∂v4 ∂v1 ∂v ¯ v1 = ¯ = ¯ v3v0 + ¯ v4cos(v1)
1.716
= ¯ v3v1 + ¯ v2 1 v0
5.5
¯ v0 = ¯ v3 ∂v3 ∂v0 + ¯ v2 ∂v2 ∂v0
Backw ackwar ards Derivat vative ve Trace:
1
AutoDiff: Reverse Mode
111v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)
Forwar ard Eval valuat ation Trace:
2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652
+ sin +
−
ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y
Backwards Derivative ∂v3 ¯ v4 = ¯ v6 ∂v6 ∂v4 ∂v4 ¯ v5 = ¯ v6 ∂v6 ∂v5 ∂y ∂v5 ¯ v6 = ∂y ∂v6
1 1x1 = 1
= ¯ v6 · 1 = ¯ v6 · (−1)
1x-1 = -1
∂v2 ¯ v3 = ¯ v5 ∂v5 ∂v3 = ¯ v5 · (1)
1x1 = 1
∂v1 ∂v1 ¯ v2 = ¯ v5 ∂v5 ∂v2 = ¯ v5 · (1)
1x1 = 1
∂v1 ∂v0
1 = ¯
v3 ∂v3 ∂v1 + ¯ v4 ∂v4 ∂v1 ∂v ¯ v1 = ¯ = ¯ v3v0 + ¯ v4cos(v1)
1.716
= ¯ v3v1 + ¯ v2 1 v0
5.5
¯ v0 = ¯ v3 ∂v3 ∂v0 + ¯ v2 ∂v2 ∂v0
Backw ackwar ards Derivat vative ve Trace:
1 1x1 = 1
AutoDiff: Reverse Mode
112v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)
Forwar ard Eval valuat ation Trace:
2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652
+ sin +
−
ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y
Backwards Derivative ∂v3 ¯ v4 = ¯ v6 ∂v6 ∂v4 ∂v4 ¯ v5 = ¯ v6 ∂v6 ∂v5 ∂y ∂v5 ¯ v6 = ∂y ∂v6
1 1x1 = 1
= ¯ v6 · 1 = ¯ v6 · (−1)
1x-1 = -1
∂v2 ¯ v3 = ¯ v5 ∂v5 ∂v3 = ¯ v5 · (1)
1x1 = 1
∂v1 ∂v1 ¯ v2 = ¯ v5 ∂v5 ∂v2 = ¯ v5 · (1)
1x1 = 1
∂v1 ∂v0
1 = ¯
v3 ∂v3 ∂v1 + ¯ v4 ∂v4 ∂v1 ∂v ¯ v1 = ¯ = ¯ v3v0 + ¯ v4cos(v1)
1.716
= ¯ v3v1 + ¯ v2 1 v0
5.5
¯ v0 = ¯ v3 ∂v3 ∂v0 + ¯ v2 ∂v2 ∂v0
Backw ackwar ards Derivat vative ve Trace:
1 1x1 = 1
AutoDiff: Reverse Mode
113v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)
Forwar ard Eval valuat ation Trace:
2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652
+ sin +
−
ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y
Backwards Derivative ∂v3 ¯ v4 = ¯ v6 ∂v6 ∂v4 ∂v4 ¯ v5 = ¯ v6 ∂v6 ∂v5 ∂y ∂v5 ¯ v6 = ∂y ∂v6
1 1x1 = 1
= ¯ v6 · 1 = ¯ v6 · (−1)
1x-1 = -1
∂v2 ¯ v3 = ¯ v5 ∂v5 ∂v3 = ¯ v5 · (1)
1x1 = 1
∂v1 ∂v1 ¯ v2 = ¯ v5 ∂v5 ∂v2 = ¯ v5 · (1)
1x1 = 1
∂v1 ∂v0
1 = ¯
v3 ∂v3 ∂v1 + ¯ v4 ∂v4 ∂v1 ∂v ¯ v1 = ¯ = ¯ v3v0 + ¯ v4cos(v1)
1.716
= ¯ v3v1 + ¯ v2 1 v0
5.5
¯ v0 = ¯ v3 ∂v3 ∂v0 + ¯ v2 ∂v2 ∂v0
Backw ackwar ards Derivat vative ve Trace:
1 1x1 = 1
AutoDiff: Reverse Mode
114v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)
Forwar ard Eval valuat ation Trace:
2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652
+ sin +
−
ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y
Backwards Derivative ∂v3 ¯ v4 = ¯ v6 ∂v6 ∂v4 ∂v4 ¯ v5 = ¯ v6 ∂v6 ∂v5 ∂y ∂v5 ¯ v6 = ∂y ∂v6
1 1x1 = 1
= ¯ v6 · 1 = ¯ v6 · (−1)
1x-1 = -1
∂v2 ¯ v3 = ¯ v5 ∂v5 ∂v3 = ¯ v5 · (1)
1x1 = 1
∂v1 ∂v1 ¯ v2 = ¯ v5 ∂v5 ∂v2 = ¯ v5 · (1)
1x1 = 1
∂v1 ∂v0
1 = ¯
v3 ∂v3 ∂v1 + ¯ v4 ∂v4 ∂v1 ∂v ¯ v1 = ¯ = ¯ v3v0 + ¯ v4cos(v1)
1.716
= ¯ v3v1 + ¯ v2 1 v0
5.5
¯ v0 = ¯ v3 ∂v3 ∂v0 + ¯ v2 ∂v2 ∂v0
Backw ackwar ards Derivat vative ve Trace:
1 1x1 = 1
AutoDiff: Reverse Mode
115v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)
Forwar ard Eval valuat ation Trace:
2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652
+ sin +
−
ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y
Backwards Derivative ∂v3 ¯ v4 = ¯ v6 ∂v6 ∂v4 ∂v4 ¯ v5 = ¯ v6 ∂v6 ∂v5 ∂y ∂v5 ¯ v6 = ∂y ∂v6
1 1x1 = 1
= ¯ v6 · 1 = ¯ v6 · (−1)
1x-1 = -1
∂v2 ¯ v3 = ¯ v5 ∂v5 ∂v3 = ¯ v5 · (1)
1x1 = 1
∂v1 ∂v1 ¯ v2 = ¯ v5 ∂v5 ∂v2 = ¯ v5 · (1)
1x1 = 1
∂v1 ∂v0
1 = ¯
v3 ∂v3 ∂v1 + ¯ v4 ∂v4 ∂v1 ∂v ¯ v1 = ¯ = ¯ v3v0 + ¯ v4cos(v1)
1.716
= ¯ v3v1 + ¯ v2 1 v0
5.5
¯ v0 = ¯ v3 ∂v3 ∂v0 + ¯ v2 ∂v2 ∂v0
Backw ackwar ards Derivat vative ve Trace:
1 1x1 = 1 1x-1 = -1
AutoDiff: Reverse Mode
116v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)
Forwar ard Eval valuat ation Trace:
2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652
+ sin +
−
ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y
Backwards Derivative ∂v3 ¯ v4 = ¯ v6 ∂v6 ∂v4 ∂v4 ¯ v5 = ¯ v6 ∂v6 ∂v5 ∂y ∂v5 ¯ v6 = ∂y ∂v6
1 1x1 = 1
= ¯ v6 · 1 = ¯ v6 · (−1)
1x-1 = -1
∂v2 ¯ v3 = ¯ v5 ∂v5 ∂v3 = ¯ v5 · (1)
1x1 = 1
∂v1 ∂v1 ¯ v2 = ¯ v5 ∂v5 ∂v2 = ¯ v5 · (1)
1x1 = 1
∂v1 ∂v0
1 = ¯
v3 ∂v3 ∂v1 + ¯ v4 ∂v4 ∂v1 ∂v ¯ v1 = ¯ = ¯ v3v0 + ¯ v4cos(v1)
1.716
= ¯ v3v1 + ¯ v2 1 v0
5.5
¯ v0 = ¯ v3 ∂v3 ∂v0 + ¯ v2 ∂v2 ∂v0
Backw ackwar ards Derivat vative ve Trace:
1 1x1 = 1 1x-1 = -1
AutoDiff: Reverse Mode
117v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)
Forwar ard Eval valuat ation Trace:
2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652
+ sin +
−
ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y
Backwards Derivative ∂v3 ¯ v4 = ¯ v6 ∂v6 ∂v4 ∂v4 ¯ v5 = ¯ v6 ∂v6 ∂v5 ∂y ∂v5 ¯ v6 = ∂y ∂v6
1 1x1 = 1
= ¯ v6 · 1 = ¯ v6 · (−1)
1x-1 = -1
∂v2 ¯ v3 = ¯ v5 ∂v5 ∂v3 = ¯ v5 · (1)
1x1 = 1
∂v1 ∂v1 ¯ v2 = ¯ v5 ∂v5 ∂v2 = ¯ v5 · (1)
1x1 = 1
∂v1 ∂v0
1 = ¯
v3 ∂v3 ∂v1 + ¯ v4 ∂v4 ∂v1 ∂v ¯ v1 = ¯ = ¯ v3v0 + ¯ v4cos(v1)
1.716
= ¯ v3v1 + ¯ v2 1 v0
5.5
¯ v0 = ¯ v3 ∂v3 ∂v0 + ¯ v2 ∂v2 ∂v0
Backw ackwar ards Derivat vative ve Trace:
1 1x1 = 1 1x-1 = -1
AutoDiff: Reverse Mode
118v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)
Forwar ard Eval valuat ation Trace:
2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652
+ sin +
−
ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y
Backwards Derivative ∂v3 ¯ v4 = ¯ v6 ∂v6 ∂v4 ∂v4 ¯ v5 = ¯ v6 ∂v6 ∂v5 ∂y ∂v5 ¯ v6 = ∂y ∂v6
1 1x1 = 1
= ¯ v6 · 1 = ¯ v6 · (−1)
1x-1 = -1
∂v2 ¯ v3 = ¯ v5 ∂v5 ∂v3 = ¯ v5 · (1)
1x1 = 1
∂v1 ∂v1 ¯ v2 = ¯ v5 ∂v5 ∂v2 = ¯ v5 · (1)
1x1 = 1
∂v1 ∂v0
1 = ¯
v3 ∂v3 ∂v1 + ¯ v4 ∂v4 ∂v1 ∂v ¯ v1 = ¯ = ¯ v3v0 + ¯ v4cos(v1)
1.716
= ¯ v3v1 + ¯ v2 1 v0
5.5
¯ v0 = ¯ v3 ∂v3 ∂v0 + ¯ v2 ∂v2 ∂v0
Backw ackwar ards Derivat vative ve Trace:
1 1x1 = 1 1x-1 = -1
AutoDiff: Reverse Mode
119v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)
Forwar ard Eval valuat ation Trace:
2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652
+ sin +
−
ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y
Backwards Derivative ∂v3 ¯ v4 = ¯ v6 ∂v6 ∂v4 ∂v4 ¯ v5 = ¯ v6 ∂v6 ∂v5 ∂y ∂v5 ¯ v6 = ∂y ∂v6
1 1x1 = 1
= ¯ v6 · 1 = ¯ v6 · (−1)
1x-1 = -1
∂v2 ¯ v3 = ¯ v5 ∂v5 ∂v3 = ¯ v5 · (1)
1x1 = 1
∂v1 ∂v1 ¯ v2 = ¯ v5 ∂v5 ∂v2 = ¯ v5 · (1)
1x1 = 1
∂v1 ∂v0
1 = ¯
v3 ∂v3 ∂v1 + ¯ v4 ∂v4 ∂v1 ∂v ¯ v1 = ¯ = ¯ v3v0 + ¯ v4cos(v1)
1.716
= ¯ v3v1 + ¯ v2 1 v0
5.5
¯ v0 = ¯ v3 ∂v3 ∂v0 + ¯ v2 ∂v2 ∂v0
Backw ackwar ards Derivat vative ve Trace:
1 1x1 = 1 1x-1 = -1 1x1 = 1
AutoDiff: Reverse Mode
120v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)
Forwar ard Eval valuat ation Trace:
2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652
+ sin +
−
ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y
Backwards Derivative ∂v3 ¯ v4 = ¯ v6 ∂v6 ∂v4 ∂v4 ¯ v5 = ¯ v6 ∂v6 ∂v5 ∂y ∂v5 ¯ v6 = ∂y ∂v6
1 1x1 = 1
= ¯ v6 · 1 = ¯ v6 · (−1)
1x-1 = -1
∂v2 ¯ v3 = ¯ v5 ∂v5 ∂v3 = ¯ v5 · (1)
1x1 = 1
∂v1 ∂v1 ¯ v2 = ¯ v5 ∂v5 ∂v2 = ¯ v5 · (1)
1x1 = 1
∂v1 ∂v0
1 = ¯
v3 ∂v3 ∂v1 + ¯ v4 ∂v4 ∂v1 ∂v ¯ v1 = ¯ = ¯ v3v0 + ¯ v4cos(v1)
1.716
= ¯ v3v1 + ¯ v2 1 v0
5.5
¯ v0 = ¯ v3 ∂v3 ∂v0 + ¯ v2 ∂v2 ∂v0
Backw ackwar ards Derivat vative ve Trace:
1 1x1 = 1 1x-1 = -1 1x1 = 1
AutoDiff: Reverse Mode
121v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)
Forwar ard Eval valuat ation Trace:
2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652
+ sin +
−
ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y
Backwards Derivative ∂v3 ¯ v4 = ¯ v6 ∂v6 ∂v4 ∂v4 ¯ v5 = ¯ v6 ∂v6 ∂v5 ∂y ∂v5 ¯ v6 = ∂y ∂v6
1 1x1 = 1
= ¯ v6 · 1 = ¯ v6 · (−1)
1x-1 = -1
∂v2 ¯ v3 = ¯ v5 ∂v5 ∂v3 = ¯ v5 · (1)
1x1 = 1
∂v1 ∂v1 ¯ v2 = ¯ v5 ∂v5 ∂v2 = ¯ v5 · (1)
1x1 = 1
∂v1 ∂v0
1 = ¯
v3 ∂v3 ∂v1 + ¯ v4 ∂v4 ∂v1 ∂v ¯ v1 = ¯ = ¯ v3v0 + ¯ v4cos(v1)
1.716
= ¯ v3v1 + ¯ v2 1 v0
5.5
¯ v0 = ¯ v3 ∂v3 ∂v0 + ¯ v2 ∂v2 ∂v0
Backw ackwar ards Derivat vative ve Trace:
1 1x1 = 1 1x-1 = -1 1x1 = 1
AutoDiff: Reverse Mode
122v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)
Forwar ard Eval valuat ation Trace:
2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652
+ sin +
−
ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y
Backwards Derivative ∂v3 ¯ v4 = ¯ v6 ∂v6 ∂v4 ∂v4 ¯ v5 = ¯ v6 ∂v6 ∂v5 ∂y ∂v5 ¯ v6 = ∂y ∂v6
1 1x1 = 1
= ¯ v6 · 1 = ¯ v6 · (−1)
1x-1 = -1
∂v2 ¯ v3 = ¯ v5 ∂v5 ∂v3 = ¯ v5 · (1)
1x1 = 1
∂v1 ∂v1 ¯ v2 = ¯ v5 ∂v5 ∂v2 = ¯ v5 · (1)
1x1 = 1
∂v1 ∂v0
1 = ¯
v3 ∂v3 ∂v1 + ¯ v4 ∂v4 ∂v1 ∂v ¯ v1 = ¯ = ¯ v3v0 + ¯ v4cos(v1)
1.716
= ¯ v3v1 + ¯ v2 1 v0
5.5
¯ v0 = ¯ v3 ∂v3 ∂v0 + ¯ v2 ∂v2 ∂v0
Backw ackwar ards Derivat vative ve Trace:
1 1x1 = 1 1x-1 = -1 1x1 = 1
AutoDiff: Reverse Mode
123v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)
Forwar ard Eval valuat ation Trace:
2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652
+ sin +
−
ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y
Backwards Derivative ∂v3 ¯ v4 = ¯ v6 ∂v6 ∂v4 ∂v4 ¯ v5 = ¯ v6 ∂v6 ∂v5 ∂y ∂v5 ¯ v6 = ∂y ∂v6
1 1x1 = 1
= ¯ v6 · 1 = ¯ v6 · (−1)
1x-1 = -1
∂v2 ¯ v3 = ¯ v5 ∂v5 ∂v3 = ¯ v5 · (1)
1x1 = 1
∂v1 ∂v1 ¯ v2 = ¯ v5 ∂v5 ∂v2 = ¯ v5 · (1)
1x1 = 1
∂v1 ∂v0
1 = ¯
v3 ∂v3 ∂v1 + ¯ v4 ∂v4 ∂v1 ∂v ¯ v1 = ¯ = ¯ v3v0 + ¯ v4cos(v1)
1.716
= ¯ v3v1 + ¯ v2 1 v0
5.5
¯ v0 = ¯ v3 ∂v3 ∂v0 + ¯ v2 ∂v2 ∂v0
Backw ackwar ards Derivat vative ve Trace:
1 1x1 = 1 1x-1 = -1 1x1 = 1 1x1 = 1
AutoDiff: Reverse Mode
124v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)
Forwar ard Eval valuat ation Trace:
2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652
+ sin +
−
ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y
Backwards Derivative ∂v3 ¯ v4 = ¯ v6 ∂v6 ∂v4 ∂v4 ¯ v5 = ¯ v6 ∂v6 ∂v5 ∂y ∂v5 ¯ v6 = ∂y ∂v6
1 1x1 = 1
= ¯ v6 · 1 = ¯ v6 · (−1)
1x-1 = -1
∂v2 ¯ v3 = ¯ v5 ∂v5 ∂v3 = ¯ v5 · (1)
1x1 = 1
∂v1 ∂v1 ¯ v2 = ¯ v5 ∂v5 ∂v2 = ¯ v5 · (1)
1x1 = 1
∂v1 ∂v0
1 = ¯
v3 ∂v3 ∂v1 + ¯ v4 ∂v4 ∂v1 ∂v ¯ v1 = ¯ = ¯ v3v0 + ¯ v4cos(v1)
1.716
= ¯ v3v1 + ¯ v2 1 v0
5.5
¯ v0 = ¯ v3 ∂v3 ∂v0 + ¯ v2 ∂v2 ∂v0
Backw ackwar ards Derivat vative ve Trace:
1 1x1 = 1 1x-1 = -1 1x1 = 1 1x1 = 1
AutoDiff: Reverse Mode
125v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)
Forwar ard Eval valuat ation Trace:
2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652
+ sin +
−
ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y
Backwards Derivative ∂v3 ¯ v4 = ¯ v6 ∂v6 ∂v4 ∂v4 ¯ v5 = ¯ v6 ∂v6 ∂v5 ∂y ∂v5 ¯ v6 = ∂y ∂v6
1 1x1 = 1
= ¯ v6 · 1 = ¯ v6 · (−1)
1x-1 = -1
∂v2 ¯ v3 = ¯ v5 ∂v5 ∂v3 = ¯ v5 · (1)
1x1 = 1
∂v1 ∂v1 ¯ v2 = ¯ v5 ∂v5 ∂v2 = ¯ v5 · (1)
1x1 = 1
∂v1 ∂v0
1 = ¯
v3 ∂v3 ∂v1 + ¯ v4 ∂v4 ∂v1 ∂v ¯ v1 = ¯ = ¯ v3v0 + ¯ v4cos(v1)
1.716
= ¯ v3v1 + ¯ v2 1 v0
5.5
¯ v0 = ¯ v3 ∂v3 ∂v0 + ¯ v2 ∂v2 ∂v0
Backw ackwar ards Derivat vative ve Trace:
1 1x1 = 1 1x-1 = -1 1x1 = 1 1x1 = 1
AutoDiff: Reverse Mode
126v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)
Forwar ard Eval valuat ation Trace:
2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652
+ sin +
−
ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y
Backwards Derivative ∂v3 ¯ v4 = ¯ v6 ∂v6 ∂v4 ∂v4 ¯ v5 = ¯ v6 ∂v6 ∂v5 ∂y ∂v5 ¯ v6 = ∂y ∂v6
1 1x1 = 1
= ¯ v6 · 1 = ¯ v6 · (−1)
1x-1 = -1
∂v2 ¯ v3 = ¯ v5 ∂v5 ∂v3 = ¯ v5 · (1)
1x1 = 1
∂v1 ∂v1 ¯ v2 = ¯ v5 ∂v5 ∂v2 = ¯ v5 · (1)
1x1 = 1
∂v1 ∂v0
1 = ¯
v3 ∂v3 ∂v1 + ¯ v4 ∂v4 ∂v1 ∂v ¯ v1 = ¯ = ¯ v3v0 + ¯ v4cos(v1)
1.716
= ¯ v3v1 + ¯ v2 1 v0
5.5
¯ v0 = ¯ v3 ∂v3 ∂v0 + ¯ v2 ∂v2 ∂v0
Backw ackwar ards Derivat vative ve Trace:
1 1x1 = 1 1x-1 = -1 1x1 = 1 1x1 = 1
AutoDiff: Reverse Mode
127v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)
Forwar ard Eval valuat ation Trace:
2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652
+ sin +
−
ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y
Backwards Derivative ∂v3 ¯ v4 = ¯ v6 ∂v6 ∂v4 ∂v4 ¯ v5 = ¯ v6 ∂v6 ∂v5 ∂y ∂v5 ¯ v6 = ∂y ∂v6
1 1x1 = 1
= ¯ v6 · 1 = ¯ v6 · (−1)
1x-1 = -1
∂v2 ¯ v3 = ¯ v5 ∂v5 ∂v3 = ¯ v5 · (1)
1x1 = 1
∂v1 ∂v1 ¯ v2 = ¯ v5 ∂v5 ∂v2 = ¯ v5 · (1)
1x1 = 1
∂v1 ∂v0
1 = ¯
v3 ∂v3 ∂v1 + ¯ v4 ∂v4 ∂v1 ∂v ¯ v1 = ¯ = ¯ v3v0 + ¯ v4cos(v1)
1.716
= ¯ v3v1 + ¯ v2 1 v0
5.5
¯ v0 = ¯ v3 ∂v3 ∂v0 + ¯ v2 ∂v2 ∂v0
Backw ackwar ards Derivat vative ve Trace:
1 1x1 = 1 1x-1 = -1 1x1 = 1 1x1 = 1
AutoDiff: Reverse Mode
128v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)
Forwar ard Eval valuat ation Trace:
2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652
+ sin +
−
ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y
Backwards Derivative ∂v3 ¯ v4 = ¯ v6 ∂v6 ∂v4 ∂v4 ¯ v5 = ¯ v6 ∂v6 ∂v5 ∂y ∂v5 ¯ v6 = ∂y ∂v6
1 1x1 = 1
= ¯ v6 · 1 = ¯ v6 · (−1)
1x-1 = -1
∂v2 ¯ v3 = ¯ v5 ∂v5 ∂v3 = ¯ v5 · (1)
1x1 = 1
∂v1 ∂v1 ¯ v2 = ¯ v5 ∂v5 ∂v2 = ¯ v5 · (1)
1x1 = 1
∂v1 ∂v0
1 = ¯
v3 ∂v3 ∂v1 + ¯ v4 ∂v4 ∂v1 ∂v ¯ v1 = ¯ = ¯ v3v0 + ¯ v4cos(v1)
1.716
= ¯ v3v1 + ¯ v2 1 v0
5.5
¯ v0 = ¯ v3 ∂v3 ∂v0 + ¯ v2 ∂v2 ∂v0
Backw ackwar ards Derivat vative ve Trace:
1 1x1 = 1 1x-1 = -1 1x1 = 1 1x1 = 1 1.716
AutoDiff: Reverse Mode
129v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)
Forwar ard Eval valuat ation Trace:
2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652
+ sin +
−
ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y
Backwards Derivative ∂v3 ¯ v4 = ¯ v6 ∂v6 ∂v4 ∂v4 ¯ v5 = ¯ v6 ∂v6 ∂v5 ∂y ∂v5 ¯ v6 = ∂y ∂v6
1 1x1 = 1
= ¯ v6 · 1 = ¯ v6 · (−1)
1x-1 = -1
∂v2 ¯ v3 = ¯ v5 ∂v5 ∂v3 = ¯ v5 · (1)
1x1 = 1
∂v1 ∂v1 ¯ v2 = ¯ v5 ∂v5 ∂v2 = ¯ v5 · (1)
1x1 = 1
∂v1 ∂v0
1 = ¯
v3 ∂v3 ∂v1 + ¯ v4 ∂v4 ∂v1 ∂v ¯ v1 = ¯ = ¯ v3v0 + ¯ v4cos(v1)
1.716
= ¯ v3v1 + ¯ v2 1 v0
5.5
¯ v0 = ¯ v3 ∂v3 ∂v0 + ¯ v2 ∂v2 ∂v0
Backw ackwar ards Derivat vative ve Trace:
1 1x1 = 1 1x-1 = -1 1x1 = 1 1x1 = 1 1.716 5.5
AutoDiff: Reverse Mode
130v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)
Forwar ard Eval valuat ation Trace:
2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652
+ sin +
−
ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y
Backwards Derivative ∂v3 ¯ v4 = ¯ v6 ∂v6 ∂v4 ∂v4 ¯ v5 = ¯ v6 ∂v6 ∂v5 ∂y ∂v5 ¯ v6 = ∂y ∂v6
1 1x1 = 1
= ¯ v6 · 1 = ¯ v6 · (−1)
1x-1 = -1
∂v2 ¯ v3 = ¯ v5 ∂v5 ∂v3 = ¯ v5 · (1)
1x1 = 1
∂v1 ∂v1 ¯ v2 = ¯ v5 ∂v5 ∂v2 = ¯ v5 · (1)
1x1 = 1
∂v1 ∂v0
1 = ¯
v3 ∂v3 ∂v1 + ¯ v4 ∂v4 ∂v1 ∂v ¯ v1 = ¯ = ¯ v3v0 + ¯ v4cos(v1)
1.716
= ¯ v3v1 + ¯ v2 1 v0
5.5
¯ v0 = ¯ v3 ∂v3 ∂v0 + ¯ v2 ∂v2 ∂v0
Backw ackwar ards Derivat vative ve Trace:
1 1x1 = 1 1x-1 = -1 1x1 = 1 1x1 = 1 1. 1.716 716 5. 5.5
A
anular arities
131Automatic Differentiation (AutoDiff)
y = f(x1, x2) = ln(x1) + x1x2 − sin(x2)
+ sin +
−
ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y v0 v1 v2 x1 x2 y f(x1, x2) = l Elementar ary funct ction granularity: Co Complex funct ction granularity:
Backpropagation: Practical Issues
132x5
x4
x3
x2 x1
y1 y2
1st Hidden Layer 2nd Hidden Layer Output Layer
Wh1, bh1 Wh2, bh2 Wo, bo
vector form
2nd Hidden Layer Output Layer 1st Hidden Layer Input Layer
Easier to deal with in ve vect ctor fo form rm
Backpropagation: Practical Issues
133Backpropagation: Practical Issues
134"local cal" Jacobians (matrix of partial derivatives, e.g. |x| x |y| "backp ackprop" Gradient
Jacobian of Sigmoid layer
x, y ∈ R2048
x y
sigmoid
Jacobian of Sigmoid layer
x, y ∈ R2048
x y
sigmoid
− What is the dimension of Jaco Jacobian an?
Jacobian of Sigmoid layer
x, y ∈ R2048
x y
sigmoid
− What is the dimension of Jaco Jacobian an? − What does it look like?
Jacobian of Sigmoid layer
x, y ∈ R2048
x y
sigmoid
− What is the dimension of Jaco Jacobian an? − What does it look like? If we are working with a mini batch of 100 inputs-output pairs, Jacobian is a matrix 204,800 x 204,800!
Backpropagation: Common questions
stion: Does BackProp only work for certain layers? Answ swer: No, for any differentiable functions
stion: What is computational cost of BackProp? Answ swer: On average about twice the forward pass
stion: Is BackProp a dual of forward propagation? Answ swer: Yes
139 slide adopted from Marc’Aurelio RanzatoBackpropagation: Common questions
stion: Does BackProp only work for certain layers? Answ swer: No, for any differentiable functions
stion: What is computational cost of BackProp? Answ swer: On average about twice the forward pass
stion: Is BackProp a dual of forward propagation? Answ swer: Yes
140 slide adopted from Marc’Aurelio Ranzato+
Sum Copy Copy Sum
+ FProp BackProp FP FPro rop BackP ckProp
Sum Copy Copy Sum
Demo time
http://playground.tensorflow.org
141Shallow yet very powerful: word2vec
From symbolic to distributed word representations
as atomic symbols: hotel, conference, walk
0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
“hotel”
From symbolic to distributed word representations
20K (speech) – 50K (Pen Treebank) – 500K (A large dictionary) – 13M (Google 1T)
0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
“hotel” “motel”
T = 0
Distributional similarity-based representations
by means of its neighbors
(J. R. Firth 1957:11)
government debt problems turning into bankin crises as has happened in saying that Europe needs unified bankin regulation to replace the hodgepodge banking banking
These words will represent “banking”
Distributional hypothesis
set of contexts in which it occurs in texts
He filled the wampimuk, passed it around and we all drunk some We found a little, hairy wampimuk sleeping behind the tree
146 Slide credit: Marco BaroniTesting the distributional hypothesis: The influence of context on judgements of semantic similarity [McDonald & Ramscar’01]
Distributional semantics
147he curtains open and the moon shining in on the barely ars and the cold , close moon " . And neither of the w rough the night with the moon shining so brightly , it made in the light of the moon . It all boils down , wr surely under a crescent moon , thrilled by ice-white sun , the seasons of the moon ? Home , alone , Jay pla m is dazzling snow , the moon has risen full and cold un and the temple of the moon , driving out of the hug in the dark and now the moon rises , full and amber a bird on the shape of the moon over the trees in front But I could n’t see the moon or the stars , only the rning , with a sliver of moon hanging among the stars they love the sun , the moon and the stars . None of the light of an enormous moon . The plash of flowing w man ’s first step on the moon ; various exhibits , aer the inevitable piece of moon rock . Housing The Airsh
Window based co-occurence matrix
3/31/16 Richard Socher
counts I like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1 1 deep 1 1 learning 0 1 1 NLP 1 1 flying 1 1 . 1 1 1
Three methods for getting short dense vectors
matrix X
Analysis
238
LANDAUER AND DUMAISAppendix An Introduction to Singular Value Decomposition and an LSA Example
Singular Value Decomposition (SVD) A well-known proof in matrix algebra asserts that any rectangular matrix (X) is equal to the product of three other matrices (W, S, and C) of a particular form (see Berry, 1992, and Golub et al., 1981, for the basic math and computer algorithms of SVD). The first of these (W) has rows corresponding to the rows of the original, but has m columns corresponding to new, specially derived variables such that there is no correlation between any two columns; that is, each is linearly independent of the others, which means that no one can be constructed as a linear combination of others. Such derived variables are often called principal components, basis vectors, factors, or dimensions. The third matrix (C) has columns corresponding to the original columns, but m rows composed of derived singular vectors. The second matrix (S) is a diagonal matrix; that is, it is a square m × m matrix with nonzero entriesAn LSA Example
Here is a small example that gives the flavor of the analysis and demonstrates what the technique can accomplish. A2 This example uses as text passages the titles of nine technical memoranda, five about human computer interaction (HCI), and four about mathematical graph theory, topics that are conceptually rather disjoint. The titles are shown below. cl: Human machine interface for ABC computer applications c2: A survey of user opinion of computer system response time c3: The EPS user interface management system c4: System and human system engineering testing of EPS c5: Relation of user perceived response time to error measurement ml: The generation of random, binary, ordered trees m2: The intersection graph of paths in trees m3: Graph minors IV: Widths of trees and well-quasi-ordering m4: Graph minors: A survey The matrix formed to represent this text is shown in Figure A2. (We discuss the highlighted parts of the tables in due course.) The initial matrix has nine columns, one for each title, and we have given it 12 rows, each corresponding to a content word that occurs in at least twowt wt-2 wt+1 wt-1 wt+2 Skip-gram model
Three methods for getting short dense vectors
matrix X
Analysis
238
LANDAUER AND DUMAISAppendix An Introduction to Singular Value Decomposition and an LSA Example
Singular Value Decomposition (SVD) A well-known proof in matrix algebra asserts that any rectangular matrix (X) is equal to the product of three other matrices (W, S, and C) of a particular form (see Berry, 1992, and Golub et al., 1981, for the basic math and computer algorithms of SVD). The first of these (W) has rows corresponding to the rows of the original, but has m columns corresponding to new, specially derived variables such that there is no correlation between any two columns; that is, each is linearly independent of the others, which means that no one can be constructed as a linear combination of others. Such derived variables are often called principal components, basis vectors, factors, or dimensions. The third matrix (C) has columns corresponding to the original columns, but m rows composed of derived singular vectors. The second matrix (S) is a diagonal matrix; that is, it is a square m × m matrix with nonzero entriesAn LSA Example
Here is a small example that gives the flavor of the analysis and demonstrates what the technique can accomplish. A2 This example uses as text passages the titles of nine technical memoranda, five about human computer interaction (HCI), and four about mathematical graph theory, topics that are conceptually rather disjoint. The titles are shown below. cl: Human machine interface for ABC computer applications c2: A survey of user opinion of computer system response time c3: The EPS user interface management system c4: System and human system engineering testing of EPS c5: Relation of user perceived response time to error measurement ml: The generation of random, binary, ordered trees m2: The intersection graph of paths in trees m3: Graph minors IV: Widths of trees and well-quasi-ordering m4: Graph minors: A survey The matrix formed to represent this text is shown in Figure A2. (We discuss the highlighted parts of the tables in due course.) The initial matrix has nine columns, one for each title, and we have given it 12 rows, each corresponding to a content word that occurs in at least twowt wt-2 wt+1 wt-1 wt+2 Skip-gram model
Prediction-based models: An alternative way to get dense vectors
Basic idea of learning neural network word embeddings
words in its context which has a loss function, e.g.,
this loss
152argmaxww · ((wj−1 + wj+1) /2) J(θ) = 1 − wj · ((wj−1 + wj+1) /2)
Unit norm vectors
Neural Embedding Models (Mikolov et al. 2013)
153CBoW model Skip-gram model
Image credit: Ed Grefenstette
Details of word2Vec
the current center word: where θ represents all variables we optimize
154J(θ) = 1 T
T
X
t=1
X
mjm,j6=0
log p(wt+j|wt)
Distributed representations of words and phrases and their compositionality [Mikolov vd.'13]Details of Word2Vec
where o is the outside (or output) word id,
c is the center word id, u and v are “center” and “outside” vectors of o and c
p(o|c) = exp(uT
PW
w=1 exp(uT wvc)
Intuition: similarity as dot-product between a target vector and context vector
1 . . k . . |Vw| 1.2…….j………|Vw| 1 . . . d
W
context embedding for word k
C
target embeddings context embeddings
Similarity( j , k)
target embedding for word j
156turn into probabilities
p(wk|wj) = exp(ck ·vj) P
i∈|V| exp(ci ·vj)
Details of Word2Vec
p(o|c) = exp(uT
PW
w=1 exp(uT wvc)
Learning
(e.g., random)
word
⎯ more like the embeddings of its neighbors ⎯ less like the embeddings of other words.
158s
Visualizing W and C as a network for doing error backprop
Input layer Projection layer Output layer
wt wt+1 1-hot input vector
1⨉d 1⨉|V|
embedding for wt probabilities of context words
C d ⨉ |V|
x1 x2 xj x|V| y1 y2 yk y|V|
W
|V|⨉d
1⨉|V|
159Problem with the softmax
p(wk|w j) = exp(ck ·v j) P
i∈|V| exp(ci ·v j)
Goal in learning
lemon, a [tablespoon of apricot preserves or] jam c1 c2 w c3 c4
[cement metaphysical dear coaxial apricot attendant whence forever puddle] n1 n2 n3 n4 n5 n6 n7 n8
is high. In practice σ(x) =
1 1+ex .
σ c4·w to be
ant σ(c1·w)+σ(c2·w)+σ(c3·w)+
1+
σ(c4·w) to In addition,
ant σ(n1·w)+σ(n2·w)+...+σ(n8·w) to learning objective for one word/context pair (w,
Skipgram with negative sampling: Loss function
logσ(c·w)+
k
X
i=1
Ewi∼p(w) [logσ(−wi ·w)]
162Stochastic gradients with word vectors!
is very sparse!
1634/ Richard Socher 9
But in each w so i
Stochastic gradients with word vectors!
columns of full embedding matrix U and V
to not have to send gigantic updates around.
164d |V|
Embeddings capture semantics!
1. frogs 2. toad 3. litoria 4. leptodactylidae 5. rana 6. lizard 7. eleutherodactylus
165“litoria” “leptodactylidae” “rana” “eleutherodactylus”
GloVe: Global Vectors for Word Representation [Pennington vd.'14]
Embeddings capture relational meaning!
vector(‘king’) - vector(‘man’) + vector(‘woman’) ≈ vector(‘queen’) vector(‘Paris’) - vector(‘France’) + vector(‘Italy’) ≈ vector(‘Rome’)
166Demo time
http://projector.tensorflow.org
167Next Lecture: Training Deep Neural Networks