CMP784 DEEP LEARNING Lecture #03 Multi-layer Perceptrons Aykut - - PowerPoint PPT Presentation

cmp784
SMART_READER_LITE
LIVE PREVIEW

CMP784 DEEP LEARNING Lecture #03 Multi-layer Perceptrons Aykut - - PowerPoint PPT Presentation

Image: Jose-Luis Olivares CMP784 DEEP LEARNING Lecture #03 Multi-layer Perceptrons Aykut Erdem // Hacettepe University // Spring 2020 Breaking news! Practical 1 is out! Learning neural word embeddings Due Friday, Mar. 26,


slide-1
SLIDE 1

Lecture #03 – Multi-layer Perceptrons

Aykut Erdem // Hacettepe University // Spring 2020

CMP784

DEEP LEARNING

Image: Jose-Luis Olivares

slide-2
SLIDE 2

Breaking news!

  • Practical 1 is out!

—Learning neural word embeddings —Due Friday, Mar. 26, 23:59:59

  • Paper presentations and quizzes will start

in two weeks!

− Choose your papers and your roles

2
slide-3
SLIDE 3

Previously on CMP784

  • Learning problem
  • Parametric vs. non-parametric models
  • Nearest—neighbor classifier
  • Linear classification
  • Linear regression
  • Capacity
  • Hyperparameter
  • Underfitting
  • Overfitting
  • Variance-Bias tradeoff
  • Model selection
  • Cross-validation
3

Puppy or bagel? // Karen Zack

slide-4
SLIDE 4

Lecture overview

  • the perceptron
  • the multi-layer perceptron
  • stochastic gradient descent
  • backpropagation
  • shallow yet very powerful: word2vec
  • Disclaimer: Much of the material and slides for this lecture were borrowed from

—Hugo Larochelle’s Neural networks slides —Nick Locascio’s MIT 6.S191 slides —Efstratios Gavves and Max Willing’s UvA deep learning class —Leonid Sigal’s CPSC532L class —Richard Socher’s CS224d class —Dan Jurafsky’s CS124 class

4
slide-5
SLIDE 5

A Brief History of Neural Networks

5 Image: VUNI Inc.

today

slide-6
SLIDE 6 6

The Perceptron

slide-7
SLIDE 7

The Perceptron

7

x0 x1 x2 xn 1 ∑ inputs bias weights w0 w1 w2 wn b … sum non-linearity

slide-8
SLIDE 8

Perceptron Forward Pass

  • Neuron pre-activation

(or input activation)

  • Neuron output activation:

where

w are the weights (parameters) b is the bias term g(·) is called the activation function

8
  • a(x) = b + P

i wixi = b + w>x

P

  • P
  • h(x) = g(a(x)) = g(b + P

i wixi)

  • x0

x1 x2 xn 1 ∑ inputs bias weights w0 w1 w2 wn b … sum non-linearity

slide-9
SLIDE 9

Output Activation of The Neuron

9

Bi t ri s ed

(x a(x

(from Pascal Vincent’s slides) Image credit: Pascal Vincent

  • P
  • h(x) = g(a(x)) = g(b + P

i wixi)

Bias only changes the position of the riff Range is determined by g(·)

x0 x1 x2 xn 1 ∑ inputs bias weights w0 w1 w2 wn b … sum non-linearity

slide-10
SLIDE 10

Linear Activation Function

10
  • P
  • h(x) = g(a(x)) = g(b + P

i wixi)

  • No nonlinear transformation
  • No input squashing

x0 x1 x2 xn 1 ∑ inputs bias weights w0 w1 w2 wn b … sum non-linearity tion

  • {
  • g(a) = a
slide-11
SLIDE 11

Sigmoid Activation Function

11

x0 x1 x2 xn 1 ∑ inputs bias weights w0 w1 w2 wn b … sum non-linearity

  • g(a) = sigm(a) =

1 1+exp(a)

s

  • utput between 0 and 1
  • utput between 0 and 1
  • P
  • h(x) = g(a(x)) = g(b + P

i wixi)

  • Squashes

the neuron’s

  • utput

between 0 and 1

  • Always

positive

  • Bounded
  • Strictly

Increasing

slide-12
SLIDE 12

Perceptron Forward Pass

12

2 3

  • 1

5 1 ∑ inputs bias weights 0.1 0.5 2.5 0.2 3.0 sum non-linearity

  • P
  • h(x) = g(a(x)) = g(b + P

i wixi)

slide-13
SLIDE 13

Perceptron Forward Pass

13

2 3

  • 1

5 1 ∑ inputs bias weights 0.1 0.5 2.5 0.2 3.0 sum non-linearity

  • h(x) = g(a

(x

  • xi)

(2*0.1) + (3*0.5) + (-1*2.5) + (5*0.2) + (1*3.0)

slide-14
SLIDE 14

Perceptron Forward Pass

14

2 3

  • 1

5 1 ∑ inputs bias weights 0.1 0.5 2.5 0.2 3.0 sum non-linearity

h(x) = g(3.2) = σ(3.2) 1 1 + e−3.2 = 0.96

slide-15
SLIDE 15

Hyperbolic Tangent (tanh) Activation Function

15

x0 x1 x2 xn 1 ∑ inputs bias weights w0 w1 w2 wn b … sum non-linearity

  • P
  • h(x) = g(a(x)) = g(b + P

i wixi)

  • Squashes the

neuron’s output between

  • 1 and 1
  • Can be positive
  • r negative
  • Bounded
  • Strictly

Increasing

h(a) = exp(a)exp(a)

exp(a)+exp(a) = exp(2a)1 exp(2a)+1

  • g(a) = tanh(a) = exp(a)exp(a)

exp(a)+exp(a)

slide-16
SLIDE 16

Rectified Linear (ReLU) Activation Function

16

x0 x1 x2 xn 1 ∑ inputs bias weights w0 w1 w2 wn b … sum non-linearity

  • P
  • h(x) = g(a(x)) = g(b + P

i wixi)

  • Bounded below

by 0 (always non-negative)

  • Not upper

bounded

  • Strictly

increasing

  • Tends to

produce units with sparse activities

  • g(a) = reclin(a) = max(0, a)
slide-17
SLIDE 17

Decision Boundary of a Neuron

  • Could do binary classification:

—with sigmoid, one can interpret neuron as estimating p(y = 1 | x) —also known as logistic regression classifier —if activation is greater than 0.5, predict 1 —otherwise predict 0

Same idea can be applied to a tanh activation

17

Image credit: Pascal Vincent

(from Pascal Vincent’s slides)

han

Decision boundary is linear

slide-18
SLIDE 18

Capacity of Single Neuron

  • Can solve linearly separable problems
18

1 1 1 1 1 1

OR (x1, x2)

AND (x1, x2)

AND (x1, x2) (x1

, x2) , x2) , x2)

(x1 (x1

slide-19
SLIDE 19

Capacity of Single Neuron

  • Can not solve non-linearly separable problems
  • Need to transform the input into a better representation
  • Remember basis functions!
19

1 1

?

XOR (x1, x2)

(x1

1 1

XOR (x1, x2) AND (x1, x2)

AND (x1, x2)

, x2)

slide-20
SLIDE 20

Perceptron Diagram Simplified

20

x0 x1 x2 xn 1 ∑ inputs bias weights w0 w1 w2 wn b … sum non-linearity

slide-21
SLIDE 21

Perceptron Diagram Simplified

21

x0 x1 x2 xn

inputs …

  • utput
slide-22
SLIDE 22

Multi-Output Perceptron

  • Remember multi-way classification

—We need multiple outputs (1 output per class) —We need to estimate conditional probability p(y = c|x) —Discriminative Learning

  • Softmax activation function at the output

—Strictly positive —sums to one

  • Predict class with the highest estimated class conditional probability.
22

x0 x1 x2 xn

inputs …

  • utput
  • n
  • |
  • o(a) = softmax(a) =

h

exp(a1) P

c exp(ac) . . .

exp(aC) P

c exp(ac)

i>

slide-23
SLIDE 23 23

Multi-Layer Perceptron

slide-24
SLIDE 24

Single Hidden Layer Neural Network

  • Hidden layer pre-activation:
  • Hidden layer activation:
  • Output layer activation:
24

x0 x1 xn h1 inputs … hidden layer h2 h0 hn

  • n
  • utput

layer

  • a(x) = b(1) + W(1)x

⇣ a(x)i = b(1)

i

+ P

j W (1) i,j xj

  • h(x) = g(a(x))
  • (x) = o

⇣ b(2) + w(2)h(1)x ⌘

>

slide-25
SLIDE 25

Multi-Layer Perceptron (MLP)

  • Consider a network with L hidden

layers.

—layer pre-activation for k>0 —hidden layer activation from 1 to L: —output layer activation (k=L+1)

25

x0 x1 xn h1 inputs … hidden layer h2 h0 hn

  • n
  • utput

layer

  • a(k)(x) = b(k) + W(k)h(k1)(x) (
  • h(k)(x) = g(a(k)(x))
  • h(L+1)(x) = o(a(L+1)(x)) = f(x)
  • (h(0)(x) = x)
slide-26
SLIDE 26

Deep Neural Network

26

x0 x1 xn h1 inputs … hidden layer h2 h0 hn

  • n
  • utput

layer h1 h2 h0 hn …

slide-27
SLIDE 27

Capacity of Neural Networks

  • Consider a single layer neural network
27

Image credit: Pascal Vincent

R´ eseaux de neurones

1 1 1 1 .5
  • 1.5
.7
  • .4
  • 1

x1 x2 x

1
  • 1
1
  • 1
1
  • 1
1
  • 1
1
  • 1
1
  • 1
1
  • 1
1
  • 1
1
  • 1

y1 y2 z zk

wkj wji

x1 x2 x1 x2 x1 x2 y1 y2

sortie k entr´ ee i cach´ ee j biais

Input Hidden Output bias

(from Pascal Vincent’s slides)

slide-28
SLIDE 28

Capacity of Neural Networks

  • Consider a single layer neural network
28

Image credit: Pascal Vincent

y1 y2 y4 y3 y3 y4 y2 y1 x1 x2 z1 z1 x1 x2

(from Pascal Vincent’s slides)

slide-29
SLIDE 29

Capacity of Neural Networks

  • Consider a single layer neural network
29

Image credit: Pascal Vincent (from Pascal Vincent’s slides)

slide-30
SLIDE 30

Universal Approximation

  • Universal Approximation Theorem (Hornik, 1991):

—“a single hidden layer neural network with a linear output unit can approximate any continuous function arbitrarily well, given enough hidden units’’

  • This applies for sigmoid, tanh and many other activation functions.
  • However, this does not mean that there is learning algorithm that can

find the necessary parameter values.

30
slide-31
SLIDE 31 31

Applying Neural Networks

slide-32
SLIDE 32

Example Problem: Will my flight be delayed?

32
  • Temperature: -20 F

Wind Speed: 45 mph

slide-33
SLIDE 33

Example Problem: Will my flight be delayed?

33
  • Predicted: 0.05

[-20, 45]

x0 x1 h1 h2 h0

slide-34
SLIDE 34

Example Problem: Will my flight be delayed?

34
  • Predicted: 0.05

Actual: 1

[-20, 45]

x0 x1 h1 h2 h0

slide-35
SLIDE 35

Quantifying Loss

35

Predicted: 0.05 Actual: 1

[-20, 45]

x0 x1 h1 h2 h0

Predicted

`(f(x(i); ✓), y(i))

Actual

slide-36
SLIDE 36

Total Loss

36

[ [-20, 45], [80, 0], [4, 15], [45, 60], ]

x0 x1 h1 h2 h0

Predicted Actual

J(✓) = 1 N X

i

`(f(x(i); ✓), y(i))

Input Predicted Actual

[ 0.05 0.02 0.96 0.35 ] [ 1 1 1 ]

slide-37
SLIDE 37

Total Loss

37

[ [-20, 45], [80, 0], [4, 15], [45, 60], ]

x0 x1 h1 h2 h0

Predicted Actual

J(✓) = 1 N X

i

`(f(x(i); ✓), y(i))

Input Predicted Actual

[ 0.05 0.02 0.96 0.35 ] [ 1 1 1 ]

slide-38
SLIDE 38

Binary Cross Entropy Loss

38

[ [-20, 45], [80, 0], [4, 15], [45, 60], ]

x0 x1 h1 h2 h0

Input Predicted Actual

[ 0.05 0.02 0.96 0.35 ] [ 1 1 1 ]

Jcross entropy(θ) = 1 N X

i

y(i) log(f(x(i); θ)) + (1 − y(i)) log(1 − f(x(i); θ)))

  • For classification problems with a softmax output layer.
  • Maximize log-probability of the correct class given an input
slide-39
SLIDE 39

Binary Cross Entropy Loss

39

[ [-20, 45], [80, 0], [4, 15], [45, 60], ]

x0 x1 h1 h2 h0

Input Predicted Actual

[ 0.05 0.02 0.96 0.35 ] [ 1 1 1 ]

JMSE(θ) = 1 N X

i

⇣ f(x(i); θ) − y(i)⌘2

slide-40
SLIDE 40 40

Training Neural Networks

slide-41
SLIDE 41

arg min

θ

1 T X

t

l(f(x(t); θ), y(t)) + λΩ(θ)

Training

  • Learning is cast as optimization.

—For classification problems, we would like to minimize classification error —Loss function can sometimes be viewed as a surrogate for what we want to optimize (e.g. upper bound)

41

Loss function Regularizer

  • =
slide-42
SLIDE 42

Loss is a function of the model’s parameters

42
slide-43
SLIDE 43

How to minimize loss?

43
  • Start at random point
slide-44
SLIDE 44

How to minimize loss?

44

Compute:

slide-45
SLIDE 45

How to minimize loss?

45

Move in direction opposite

  • f gradient to new point
slide-46
SLIDE 46

How to minimize loss?

46

Move in direction opposite

  • f gradient to new point
slide-47
SLIDE 47

How to minimize loss?

47

Repeat!

slide-48
SLIDE 48

This is called Stochastic Gradient Descent (SGD)

48

Repeat!

slide-49
SLIDE 49

Stochastic Gradient Descent (SGD)

  • Initialize θ randomly
  • For N Epochs
  • For each training example (x, y):
  • Compute Loss Gradient:
  • Update θ with update rule:
49
  • 𝜄𝑢+1 𝜄𝑢 𝜃𝑢𝛼𝜄ℒ
slide-50
SLIDE 50

Why is it Stochastic Gradient Descent?

  • Initialize θ randomly
  • For N Epochs
  • For each training example (x, y):
  • Compute Loss Gradient:
  • Update θ with update rule:
50
  • Only an estimate of

true gradient!

𝜄𝑢+1 𝜄𝑢 𝜃𝑢𝛼𝜄ℒ

slide-51
SLIDE 51

𝜄𝑢+1 𝜄𝑢 𝜃𝑢𝛼𝜄ℒ

  • Why is it Stochastic Gradient Descent?
  • Initialize θ randomly
  • For N Epochs
  • For each training batch {(x0, y0),…, (xB, yB)}:
  • Compute Loss Gradient:
  • Update θ with update rule:
51
  • More accurate

estimate!

  • Advantages:
  • More accurate estimation of gradient

⎯ Smoother convergence ⎯ Allows for larger learning rates

  • Minibatches lead to fast training!

⎯ Can parallelize computation + achieve significant speed increases on GPU’s

slide-52
SLIDE 52

θ)

Training epoch = Iteration of all examples

Stochastic Gradient Descent (SGD)

  • Algorithm that performs updates after each example

—initialize —for N iterations

—for each training example or batch

  • To apply this algorithm to neural network training, we need:

—the loss function —a procedure to compute the parameter gradients: —the regularizer (and the gradient )

52

ze:

  • θ ⌘ {W(1), b(1), . . . , W(L+1), b(L+1)}

mple

  • r
  • (x(t), y(t))

r 8

  • ∆ = rθl(f(x(t); θ), y(t)) λrθΩ(θ)
  • P r
  • θ θ + α ∆

Training epoch = Iteration over all examples

ent: ,

  • Ω(θ)

,

r

  • rθΩ(θ)

ion:

  • l(f(x(t); θ), y(t))

𝜄𝑢+1 𝜄𝑢 𝜃𝑢𝛼𝜄ℒ

  • tions
slide-53
SLIDE 53

θ)

Training epoch = Iteration of all examples

Stochastic Gradient Descent (SGD)

  • Algorithm that performs updates after each example

—initialize —for N iterations

—for each training example or batch

  • To apply this algorithm to neural network training, we need:

—the loss function —a procedure to compute the parameter gradients: —the regularizer (and the gradient )

53

ze:

  • θ ⌘ {W(1), b(1), . . . , W(L+1), b(L+1)}

mple

  • r
  • (x(t), y(t))

r 8

  • ∆ = rθl(f(x(t); θ), y(t)) λrθΩ(θ)
  • P r
  • θ θ + α ∆

Training epoch = Iteration over all examples

ent: ,

  • Ω(θ)

,

r

  • rθΩ(θ)

ion:

  • l(f(x(t); θ), y(t))

𝜄𝑢+1 𝜄𝑢 𝜃𝑢𝛼𝜄ℒ

  • tions
slide-54
SLIDE 54

What is a neural network again?

  • A family of parametric, non-linear and hierarchical representation learning

functions

  • ⎯ x: input, θl: parameters for layer l, al = hl(x, θl): (non-)linear function
  • Given training corpus {X, Y} find optimal parameters
54

aL(x; θ1,...,L) = hL(hL−1(. . . h1(x, θ1), θL−1), θL)

✓∗ ← arg min

θ

X

(x,y)∈(X,Y )

`(y, aL(x; ✓1,...,L))

slide-55
SLIDE 55

Neural network models

  • A neural network model is a series of hierarchically connected functions
  • The hierarchy can be very, very complex
55

h1(xi; θ)

Input

h2(xi; θ) h3(xi; θ) h4(xi; θ) h5(xi; θ) Loss

Forward connections (Feedforward architecture)

slide-56
SLIDE 56

Neural network models

  • A neural network model is a series of hierarchically connected functions
  • The hierarchy can be very, very complex
56

Input Interweaved connections (Directed Acyclic Graph architecture – DAGNN)

h1(xi; θ) h2(xi; θ) h3(xi; θ) h4(xi; θ) h5(xi; θ) Loss h2(xi; θ) h4(xi; θ)

slide-57
SLIDE 57

Neural network models

  • A neural network model is a series of hierarchically connected functions
  • The hierarchy can be very, very complex
57

Input

h2(xi; θ) h3(xi; θ) h4(xi; θ) h5(xi; θ) Loss

Loopy connections (Recurrent architecture, special care needed)

h1(xi; θ)

slide-58
SLIDE 58

Neural network models

  • A neural network model is a series of hierarchically connected functions
  • The hierarchy can be very, very complex
58

Functions → Modules

h1(xi; θ)

Input

h2(xi; θ) h3(xi; θ) h4(xi; θ) h5(xi; θ) Loss h1(xi; θ)

Input

h2(xi; θ) h3(xi; θ) h4(xi; θ) h5(xi; θ) Loss

Input

h1(xi; θ) h2(xi; θ) h3(xi; θ) h4(xi; θ) h5(xi; θ) Loss h2(xi; θ) h4(xi; θ)

slide-59
SLIDE 59

What is a module

  • A module is a building block for our network
  • Each module is an object/function 𝑏 = h(x; 𝜄) that

⎯ Contains trainable parameters (𝜄) ⎯ Receives as an argument an input 𝑦 ⎯ And returns an output 𝑏 based on the activation function h(...)

  • The activation function should be (at least) first order

differentiable (almost) everywhere

  • For easier/more efficient backpropagation

→ store module input

⎯ easy to get module output fast ⎯ easy to compute derivatives

59

Input

h1(xi; θ) h2(xi; θ) h3(xi; θ) h4(xi; θ) h5(xi; θ) Loss h2(xi; θ) h4(xi; θ)

slide-60
SLIDE 60

Anything goes or do special constraints exist?

  • A neural network is a composition of modules (building blocks)
  • Any architecture works
  • If the architecture is a feedforward cascade, no special care
  • If acyclic, there is right order
  • f computing the forward

computations

  • If there are loops, these

form recurrent connections (revisited later)

60
slide-61
SLIDE 61

What is a module

  • Simply compute the activation of each module in the

network

  • We need to know the precise function behind each

module h𝑚(... )

  • Recursive operations
  • One module’s output is another’s input
  • Steps
  • Visit modules one by one starting from the data input
  • Some modules might have several inputs from multiple modules
  • Compute modules activations with the right order
  • Make sure all the inputs computed at the right time
61

Input

h1(xi; θ) h2(xi; θ) h3(xi; θ) h4(xi; θ) h5(xi; θ) Loss h2(xi; θ) h4(xi; θ) where

al = hl(xl; θ) al = xl+1 xl = al−1

  • r
slide-62
SLIDE 62

What is a module

  • Simply compute the gradients of each module

for our data

  • We need to know the gradient formulation of each

module 𝜖h𝑚(𝑦𝑚;𝜄𝑚) w.r.t. their inputs 𝑦𝑚 and parameters 𝜄𝑚

  • We need the forward computations first
  • Their result is the sum of losses for our input data
  • Then take the reverse network (reverse connections)

and traverse it backwards

  • Instead of using the activation functions, we use

their gradients

  • The whole process can be described very neatly and concisely

with the backpropagation algorithm

62

h1(xi; θ) h2(xi; θ) h3(xi; θ) h4(xi; θ) h5(xi; θ) h2(xi; θ) h4(xi; θ)

dLoss(Input)

slide-63
SLIDE 63

Again, what is a neural network again?

  • d

⎯ x: input, θl: parameters for layer l, al = hl(x, θl): (non-)linear function

  • Given training corpus {X, Y} find optimal parameters
  • To use any gradient descent based optimization

we need the gradients

  • How to compute the gradients for such a complicated function enclosing other

functions, like 𝑏𝑀 (... )?

63

aL(x; θ1,...,L) = hL(hL−1(. . . h1(x, θ1), θL−1), θL)

✓∗ ← arg min

θ

X

(x,y)∈(X,Y )

`(y, aL(x; ✓1,...,L))

✓ θt+1 = θt − ηt ∂L ∂θt ◆

∂L ∂θl , l = 1, . . . , L

slide-64
SLIDE 64

Again, what is a neural network again?

  • d

⎯ x: input, θl: parameters for layer l, al = hl(x, θl): (non-)linear function

  • Given training corpus {X, Y} find optimal parameters
  • To use any gradient descent based optimization

we need the gradients

  • How to co

compute the grad adients s for su such ch a a co complicat cated funct ction encl closi sing other funct ctions, s, like ke 𝑏𝑀 (... )? ?

64

aL(x; θ1,...,L) = hL(hL−1(. . . h1(x, θ1), θL−1), θL)

✓∗ ← arg min

θ

X

(x,y)∈(X,Y )

`(y, aL(x; ✓1,...,L))

✓ θt+1 = θt − ηt ∂L ∂θt ◆

∂L ∂θl , l = 1, . . . , L

slide-65
SLIDE 65

How do we compute gradients?

  • Numerical Differentiation
  • Symbolic Differentiation
  • Automatic Differentiation (AutoDiff)
65
slide-66
SLIDE 66

Numerical Differentiation

  • We can approximate the gradient numerically, using:
66 slide adopted from T. Chen, H. Shen, A. Krishnamurthy

∂f(x) ∂xi ≈= lim

h→0

f(x + h1i) − f(x) h

1i - Vector of all zeros, except for one 1 in i-th location

slide-67
SLIDE 67

Numerical Differentiation

  • We can approximate the gradient numerically, using:
  • Even better, we can use central differencing:
67 slide adopted from T. Chen, H. Shen, A. Krishnamurthy

∂f(x) ∂xi ≈= lim

h→0

f(x + h1i) − f(x) h ∂f(x) ∂xi ≈= lim

h→0

f(x + h1i) − f(x − h1i) 2h

1i - Vector of all zeros, except for one 1 in i-th location

slide-68
SLIDE 68

Numerical Differentiation

  • We can approximate the gradient numerically, using:
  • Even better, we can use central differencing:
  • However, both of these suffer from rounding errors and are not good enough

for learning (they are very good tools for checking the correctness of implementation though, e.g., use h = 0.000001).

68 slide adopted from T. Chen, H. Shen, A. Krishnamurthy

∂f(x) ∂xi ≈= lim

h→0

f(x + h1i) − f(x) h ∂f(x) ∂xi ≈= lim

h→0

f(x + h1i) − f(x − h1i) 2h

1i - Vector of all zeros, except for one 1 in i-th location

slide-69
SLIDE 69

Numerical Differentiation

  • We can approximate the gradient numerically, using:
  • Even better, we can use central differencing:
  • However, both of these suffer from rounding errors and are not good enough

for learning (they are very good tools for checking the correctness of implementation though, e.g., use h = 0.000001).

69 slide adopted from T. Chen, H. Shen, A. Krishnamurthy

1i - Vector of all zeros, except for one 1 in i-th location 1ij - Matrix of all zeros, except for one 1 in (i,j)-th location

∂L(W, b) ∂wij ≈ lim

h→0

L(W + h1ij, b) − L(W, b) h ∂L(W, b) ∂bj ≈ lim

h→0

L(W, b + h1j) − L(W, b) h

∂L(W, b) ∂wij ≈ lim

h→0

L(W + h1ij, b) − L(W + h1ij, b) 2h ∂L(W, b) ∂bj ≈ lim

h→0

L(W, b + h1j) − L(W, b + h1j) 2h

slide-70
SLIDE 70

Symbolic Differentiation

  • Input function is represented as co

computational gr graph ph (a symbolic tree)

  • Implements differentiation rules for composite functions:
70

Implements differentiation rules for composite functions:

ln x1 x2 + sin +

y v2 v4 v3 v5 + sin +

ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y

slide adopted from T. Chen, H. Shen, A. Krishnamurthy

d (f(x) + g(x)) dx = df(x) dx + dg(x) dx d (f(x) · g(x)) dx = df(x) dx g(x) + f(x)dg(x) dx d(f(g(x))) dx = df(g(x)) dx · dg(x) dx

Sum Rule Product Rule Chain Rule

Pr Proble

  • blem:

m: For complex functions, expressions can be exponentially large; also difficult to deal with piece-wise functions (creates many symbolic cases)

y = f(x1, x2) = ln(x1) + x1x2 − sin(x2)

slide-71
SLIDE 71

Automatic Differentiation (AutoDiff)

  • In

Intuit itio ion: Interleave symbolic differentiation and simplification

  • Ke

Key Id Idea: : Apply symbolic differentiation at the elementary

  • peration level, evaluate and keep intermediate results
71

Success of de deep learning learning owes A LOT to success of AutoDiff algorithms (also to advances in parallel architectures, and large datasets, ...)

slide adopted from T. Chen, H. Shen, A. Krishnamurthy

y = f(x1, x2) = ln(x1) + x1x2 − sin(x2)

slide-72
SLIDE 72

Automatic Differentiation (AutoDiff)

  • Each node

node is an input, intermediate, or output variable

  • Computat

ational al grap aph (a DAG) with variable

  • rdering from topological sort.
72

+ sin +

ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y

slide adopted from T. Chen, H. Shen, A. Krishnamurthy

y = f(x1, x2) = ln(x1) + x1x2 − sin(x2)

slide-73
SLIDE 73

Automatic Differentiation (AutoDiff)

  • Each node

node is an input, intermediate, or output variable

  • Computat

ational al grap aph (a DAG) with variable

  • rdering from topological sort.
73

+ sin +

ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 Computational graph is governed by these equations:

slide adopted from T. Chen, H. Shen, A. Krishnamurthy

y = f(x1, x2) = ln(x1) + x1x2 − sin(x2)

slide-74
SLIDE 74

Automatic Differentiation (AutoDiff)

  • Each node

node is an input, intermediate, or output variable

  • Computat

ational al grap aph (a DAG) with variable

  • rdering from topological sort.
74

+ sin +

ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 Computational graph is governed by these equations:

slide adopted from T. Chen, H. Shen, A. Krishnamurthy

Lets see how we can evalu evaluat ate a a funct ction using computational graph (DNN inferences)

y = f(x1, x2) = ln(x1) + x1x2 − sin(x2)

slide-75
SLIDE 75

Automatic Differentiation (AutoDiff)

  • Each node

node is an input, intermediate, or output variable

  • Computat

ational al grap aph (a DAG) with variable

  • rdering from topological sort.
75

+ sin +

ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y Lets see how we can evalu evaluat ate a a funct ction using computational graph (DNN inferences)

v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)

Forwar ard Eval valuat ation Trace: y = f(x1, x2) = ln(x1) + x1x2 − sin(x2)

slide-76
SLIDE 76

Automatic Differentiation (AutoDiff)

  • Each node

node is an input, intermediate, or output variable

  • Computat

ational al grap aph (a DAG) with variable

  • rdering from topological sort.
76

+ sin +

ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y Lets see how we can evalu evaluat ate a a funct ction using computational graph (DNN inferences)

v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)

Forwar ard Eval valuat ation Trace:

2

y = f(x1, x2) = ln(x1) + x1x2 − sin(x2)

slide-77
SLIDE 77

Automatic Differentiation (AutoDiff)

  • Each node

node is an input, intermediate, or output variable

  • Computat

ational al grap aph (a DAG) with variable

  • rdering from topological sort.
77

+ sin +

ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y Lets see how we can evalu evaluat ate a a funct ction using computational graph (DNN inferences)

v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)

Forwar ard Eval valuat ation Trace:

2 5

y = f(x1, x2) = ln(x1) + x1x2 − sin(x2)

slide-78
SLIDE 78

Automatic Differentiation (AutoDiff)

  • Each node

node is an input, intermediate, or output variable

  • Computat

ational al grap aph (a DAG) with variable

  • rdering from topological sort.
78

+ sin +

ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y Lets see how we can evalu evaluat ate a a funct ction using computational graph (DNN inferences)

v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)

Forwar ard Eval valuat ation Trace:

2 5 ln(2) = 0.693

y = f(x1, x2) = ln(x1) + x1x2 − sin(x2)

slide-79
SLIDE 79

Automatic Differentiation (AutoDiff)

  • Each node

node is an input, intermediate, or output variable

  • Computat

ational al grap aph (a DAG) with variable

  • rdering from topological sort.
79

+ sin +

ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y Lets see how we can evalu evaluat ate a a funct ction using computational graph (DNN inferences)

v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)

Forwar ard Eval valuat ation Trace:

2 5 ln(2) = 0.693 2 x 5 = 10

y = f(x1, x2) = ln(x1) + x1x2 − sin(x2)

slide-80
SLIDE 80

Automatic Differentiation (AutoDiff)

  • Each node

node is an input, intermediate, or output variable

  • Computat

ational al grap aph (a DAG) with variable

  • rdering from topological sort.
80

+ sin +

ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y Lets see how we can evalu evaluat ate a a funct ction using computational graph (DNN inferences)

v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)

Forwar ard Eval valuat ation Trace:

2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959

y = f(x1, x2) = ln(x1) + x1x2 − sin(x2)

slide-81
SLIDE 81

Automatic Differentiation (AutoDiff)

  • Each node

node is an input, intermediate, or output variable

  • Computat

ational al grap aph (a DAG) with variable

  • rdering from topological sort.
81

+ sin +

ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y Lets see how we can evalu evaluat ate a a funct ction using computational graph (DNN inferences)

v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)

Forwar ard Eval valuat ation Trace:

2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693

y = f(x1, x2) = ln(x1) + x1x2 − sin(x2)

slide-82
SLIDE 82

Automatic Differentiation (AutoDiff)

  • Each node

node is an input, intermediate, or output variable

  • Computat

ational al grap aph (a DAG) with variable

  • rdering from topological sort.
82

+ sin +

ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y Lets see how we can evalu evaluat ate a a funct ction using computational graph (DNN inferences)

v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)

Forwar ard Eval valuat ation Trace:

2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652

y = f(x1, x2) = ln(x1) + x1x2 − sin(x2)

slide-83
SLIDE 83

Automatic Differentiation (AutoDiff)

  • Each node

node is an input, intermediate, or output variable

  • Computat

ational al grap aph (a DAG) with variable

  • rdering from topological sort.
83

y = f(x1, x2) = ln(x1) + x1x2 − sin(x2) + sin +

ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y Lets see how we can evalu evaluat ate a a funct ction using computational graph (DNN inferences)

v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)

Forwar ard Eval valuat ation Trace:

2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652

slide-84
SLIDE 84

Automatic Differentiation (AutoDiff)

84

y = f(x1, x2) = ln(x1) + x1x2 − sin(x2)

+ sin +

ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)

Forwar ard Eval valuat ation Trace:

2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652

slide-85
SLIDE 85

Automatic Differentiation (AutoDiff)

85

y = f(x1, x2) = ln(x1) + x1x2 − sin(x2)

Lets see how we can evalu evaluat ate a a funct ction using computational graph (DNN inferences)

∂f(x1, x2) ∂x1

  • (x1=2,x2=5)

We will do this with for forwa ward mo mode first, by introducing a derivative of each variable node with respect to the input variable.

v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)

Forwar ard Eval valuat ation Trace:

2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652

+ sin +

ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y

slide-86
SLIDE 86

Automatic Differentiation (AutoDiff)

86

y = f(x1, x2) = ln(x1) + x1x2 − sin(x2)

Forward Derivative Trace: ∂f(x1, x2) ∂x1

  • (x1=2,x2=5)

v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)

Forwar ard Eval valuat ation Trace:

2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652

+ sin +

ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y

Forwar ard Derivat vative ve Trace:

∂v0 ∂x1 ∂v

1

∂v

slide-87
SLIDE 87

Automatic Differentiation (AutoDiff)

87

y = f(x1, x2) = ln(x1) + x1x2 − sin(x2)

Forward Derivative Trace: ∂f(x1, x2) ∂x1

  • (x1=2,x2=5)

v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)

Forwar ard Eval valuat ation Trace:

2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652

+ sin +

ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y

Forwar ard Derivat vative ve Trace:

∂v0 ∂x1 ∂v

1

∂v 1

slide-88
SLIDE 88

Automatic Differentiation (AutoDiff)

88

y = f(x1, x2) = ln(x1) + x1x2 − sin(x2)

Forward Derivative Trace: ∂f(x1, x2) ∂x1

  • (x1=2,x2=5)

v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)

Forwar ard Eval valuat ation Trace:

2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652

+ sin +

ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y

Forwar ard Derivat vative ve Trace:

∂v0 ∂x1 ∂v

1

∂v1 ∂x1 ∂v ∂x1 1

slide-89
SLIDE 89

Automatic Differentiation (AutoDiff)

89

y = f(x1, x2) = ln(x1) + x1x2 − sin(x2)

Forward Derivative Trace: ∂f(x1, x2) ∂x1

  • (x1=2,x2=5)

v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)

Forwar ard Eval valuat ation Trace:

2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652

+ sin +

ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y

Forwar ard Derivat vative ve Trace:

∂v0 ∂x1 ∂v

1

∂v1 ∂x1 ∂v ∂x1 1

slide-90
SLIDE 90

Automatic Differentiation (AutoDiff)

90

y = f(x1, x2) = ln(x1) + x1x2 − sin(x2)

Forward Derivative Trace: ∂f(x1, x2) ∂x1

  • (x1=2,x2=5)

v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)

Forwar ard Eval valuat ation Trace:

2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652

+ sin +

ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y

Forwar ard Derivat vative ve Trace:

∂v0 ∂x1 ∂v

1

∂v

1

∂v1 ∂x1 ∂v ∂x1 ∂v2 ∂x1 = 1

slide-91
SLIDE 91

Automatic Differentiation (AutoDiff)

91

y = f(x1, x2) = ln(x1) + x1x2 − sin(x2)

Forward Derivative Trace: ∂f(x1, x2) ∂x1

  • (x1=2,x2=5)

v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)

Forwar ard Eval valuat ation Trace:

2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652

+ sin +

ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y

Forwar ard Derivat vative ve Trace:

∂v0 ∂x1 ∂v

1

∂v

1

∂v1 ∂x1 ∂v ∂x1 ∂v2 ∂x1 = Chain Rule 1

slide-92
SLIDE 92

Automatic Differentiation (AutoDiff)

92

y = f(x1, x2) = ln(x1) + x1x2 − sin(x2)

Forward Derivative Trace: ∂f(x1, x2) ∂x1

  • (x1=2,x2=5)

v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)

Forwar ard Eval valuat ation Trace:

2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652

+ sin +

ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y

Forwar ard Derivat vative ve Trace:

∂v0 ∂x1 ∂v

1

∂v1 ∂x1 ∂v ∂x1 ∂v2 ∂x1 = 1 v0 ∂v0 ∂x1 Chain Rule 1

slide-93
SLIDE 93

Automatic Differentiation (AutoDiff)

93

y = f(x1, x2) = ln(x1) + x1x2 − sin(x2)

Forward Derivative Trace: ∂f(x1, x2) ∂x1

  • (x1=2,x2=5)

v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)

Forwar ard Eval valuat ation Trace:

2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652

+ sin +

ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y

Forwar ard Derivat vative ve Trace:

∂v0 ∂x1 ∂v

1

∂v1 ∂x1 ∂v ∂x1 ∂v2 ∂x1 = 1 v0 ∂v0 ∂x1 Chain Rule 1 1/2 * 1 = 0.5

slide-94
SLIDE 94

Automatic Differentiation (AutoDiff)

94

y = f(x1, x2) = ln(x1) + x1x2 − sin(x2)

Forward Derivative Trace: ∂f(x1, x2) ∂x1

  • (x1=2,x2=5)

v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)

Forwar ard Eval valuat ation Trace:

2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652

+ sin +

ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y

Forwar ard Derivat vative ve Trace:

∂v0 ∂x1 ∂v

1 1

∂v3 ∂x1 = ∂v0 ∂x1 · v1

1

∂v1 ∂x1 ∂v ∂x1 ∂v2 ∂x1 = 1 v0 ∂v0 ∂x1 1 1/2 * 1 = 0.5

slide-95
SLIDE 95

Automatic Differentiation (AutoDiff)

95

y = f(x1, x2) = ln(x1) + x1x2 − sin(x2)

Forward Derivative Trace: ∂f(x1, x2) ∂x1

  • (x1=2,x2=5)

v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)

Forwar ard Eval valuat ation Trace:

2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652

+ sin +

ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y

Forwar ard Derivat vative ve Trace:

∂v0 ∂x1 ∂v

1 1

∂v3 ∂x1 = ∂v0 ∂x1 · v1

1

∂v1 ∂x1 ∂v ∂x1 ∂v2 ∂x1 = 1 v0 ∂v0 ∂x1 1 1/2 * 1 = 0.5 Product Rule

slide-96
SLIDE 96

Automatic Differentiation (AutoDiff)

96

y = f(x1, x2) = ln(x1) + x1x2 − sin(x2)

Forward Derivative Trace: ∂f(x1, x2) ∂x1

  • (x1=2,x2=5)

v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)

Forwar ard Eval valuat ation Trace:

2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652

+ sin +

ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y

Forwar ard Derivat vative ve Trace:

∂v0 ∂x1 ∂v

1 1

∂v3 ∂x1 = ∂v0 ∂x1 · v1 + v0 · ∂v1 ∂x1

1

∂v1 ∂x1 ∂v ∂x1 ∂v2 ∂x1 = 1 v0 ∂v0 ∂x1 1 1/2 * 1 = 0.5 Product Rule

slide-97
SLIDE 97

Automatic Differentiation (AutoDiff)

97

y = f(x1, x2) = ln(x1) + x1x2 − sin(x2)

Forward Derivative Trace: ∂f(x1, x2) ∂x1

  • (x1=2,x2=5)

v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)

Forwar ard Eval valuat ation Trace:

2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652

+ sin +

ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y

Forwar ard Derivat vative ve Trace:

∂v0 ∂x1 ∂v

1 1

∂v3 ∂x1 = ∂v0 ∂x1 · v1 + v0 · ∂v1 ∂x1

1

∂v1 ∂x1 ∂v ∂x1 ∂v2 ∂x1 = 1 v0 ∂v0 ∂x1 1 1/2 * 1 = 0.5 Product Rule 1*5 + 2*0 = 5

slide-98
SLIDE 98

Automatic Differentiation (AutoDiff)

98

y = f(x1, x2) = ln(x1) + x1x2 − sin(x2)

Forward Derivative Trace: ∂f(x1, x2) ∂x1

  • (x1=2,x2=5)

v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)

Forwar ard Eval valuat ation Trace:

2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652

+ sin +

ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y

Forwar ard Derivat vative ve Trace:

∂v0 ∂x1 ∂v

1 1

∂v3 ∂x1 = ∂v0 ∂x1 · v1 + v0 · ∂v1 ∂x1

1

∂v1 ∂x1 ∂v ∂x1 ∂v2 ∂x1 = 1 v0 ∂v0 ∂x1

1 1 1

∂v6 ∂x1 = ∂v5 ∂x1 − ∂v4 ∂x1 ∂v4 ∂x1 = ∂v1 ∂x1 cos(v1)

1 1

∂v5 ∂x1 = ∂v2 ∂x1 + ∂v3 ∂x1

1 1 1

∂y ∂x1 = ∂v6 ∂x1 1 1/2 * 1 = 0.5 1*5 + 2*0 = 5 0 * cos(5) = 0 0.5 + 5 = 5.5 5.5 – 0 = 5.5 5.5

slide-99
SLIDE 99

Automatic Differentiation (AutoDiff)

99

y = f(x1, x2) = ln(x1) + x1x2 − sin(x2)

Forward Derivative Trace: ∂f(x1, x2) ∂x1

  • (x1=2,x2=5)

+ sin +

ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y

Forwar ard Derivat vative ve Trace:

∂v0 ∂x1 ∂v

1 1

∂v3 ∂x1 = ∂v0 ∂x1 · v1 + v0 · ∂v1 ∂x1

1

∂v1 ∂x1 ∂v ∂x1 ∂v2 ∂x1 = 1 v0 ∂v0 ∂x1

1 1 1

∂v6 ∂x1 = ∂v5 ∂x1 − ∂v4 ∂x1 ∂v4 ∂x1 = ∂v1 ∂x1 cos(v1)

1 1

∂v5 ∂x1 = ∂v2 ∂x1 + ∂v3 ∂x1

1 1 1

∂y ∂x1 = ∂v6 ∂x1 1 1/2 * 1 = 0.5 1*5 + 2*0 = 5 0 * cos(5) = 0 0.5 + 5 = 5.5 5.5 – 0 = 5.5 5.5

We now have: ∂f(x1, x2) ∂x1

  • (x1=2,x2=5)

= 5.5

slide-100
SLIDE 100

Automatic Differentiation (AutoDiff)

100

y = f(x1, x2) = ln(x1) + x1x2 − sin(x2)

Forward Derivative Trace: ∂f(x1, x2) ∂x1

  • (x1=2,x2=5)

+ sin +

ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y

Forwar ard Derivat vative ve Trace:

∂v0 ∂x1 ∂v

1 1

∂v3 ∂x1 = ∂v0 ∂x1 · v1 + v0 · ∂v1 ∂x1

1

∂v1 ∂x1 ∂v ∂x1 ∂v2 ∂x1 = 1 v0 ∂v0 ∂x1

1 1 1

∂v6 ∂x1 = ∂v5 ∂x1 − ∂v4 ∂x1 ∂v4 ∂x1 = ∂v1 ∂x1 cos(v1)

1 1

∂v5 ∂x1 = ∂v2 ∂x1 + ∂v3 ∂x1

1 1 1

∂y ∂x1 = ∂v6 ∂x1 1 1/2 * 1 = 0.5 1*5 + 2*0 = 5 0 * cos(5) = 0 0.5 + 5 = 5.5 5.5 – 0 = 5.5 5.5

We now have: Still need: ∂f(x1, x2) ∂x1

  • (x1=2,x2=5)

= 5.5 ∂f(x1, x2) ∂x2

  • (x1=2,x2=5)
slide-101
SLIDE 101

AutoDiff: Forward Mode

  • Forwar

ard mode mode needs m forward passes to get a full Jacobian (all gradients of

  • utput with respect to each input), where m is the number of inputs
101

y = f(x) : Rm → Rn

slide-102
SLIDE 102

AutoDiff: Forward Mode

  • Forwar

ard mode mode needs m forward passes to get a full Jacobian (all gradients of

  • utput with respect to each input), where m is the number of inputs
102

y = f(x) : Rm → Rn

slide adopted from T. Chen, H. Shen, A. Krishnamurthy

Pr Probl

  • blem: DNN typically has large number of inputs:

image as an input, plus all the weights and biases of layers = millions of inputs! and very few outputs (many DNNs have n = 1) image as an input, plus all the weights and biases of layers = millions of inputs!

slide-103
SLIDE 103

AutoDiff: Forward Mode

  • Forwar

ard mode mode needs m forward passes to get a full Jacobian (all gradients of

  • utput with respect to each input), where m is the number of inputs
  • Automatic differentiation in reverse

se mode mode computes all gradients in n backwards passes (so for most DNNs in a single back pass — back ck pr propa

  • paga

gation ion)

103

y = f(x) : Rm → Rn

slide adopted from T. Chen, H. Shen, A. Krishnamurthy

Pr Probl

  • blem: DNN typically has large number of inputs:

image as an input, plus all the weights and biases of layers = millions of inputs! and very few outputs (many DNNs have n = 1) image as an input, plus all the weights and biases of layers = millions of inputs!

slide-104
SLIDE 104

AutoDiff: Reverse Mode

104

v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)

Forwar ard Eval valuat ation Trace:

2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652

+ sin +

ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y

x1 x2 y ¯ v0 ¯ v1 ¯ v2 ¯ v3 ¯ v4 ¯ v5 ¯ v6

Traverse the original graph in the reverse topological

  • rder and for each node in the original graph

introduce an ad adjo join int node node, which computes derivative of the output with respect to the local node (using Chain rule):

"local cal" derivative

slide-105
SLIDE 105

AutoDiff: Reverse Mode

105

v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)

Forwar ard Eval valuat ation Trace:

2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652

+ sin +

ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y

Backwards Derivative ∂v3 ¯ v4 = ¯ v6 ∂v6 ∂v4 ∂v4 ¯ v5 = ¯ v6 ∂v6 ∂v5 ∂y ∂v5 ¯ v6 = ∂y ∂v6

1 1x1 = 1

= ¯ v6 · 1 = ¯ v6 · (−1)

1x-1 = -1

∂v2 ¯ v3 = ¯ v5 ∂v5 ∂v3 = ¯ v5 · (1)

1x1 = 1

∂v1 ∂v1 ¯ v2 = ¯ v5 ∂v5 ∂v2 = ¯ v5 · (1)

1x1 = 1

∂v1 ∂v0

1 = ¯

v3 ∂v3 ∂v1 + ¯ v4 ∂v4 ∂v1 ∂v ¯ v1 = ¯ = ¯ v3v0 + ¯ v4cos(v1)

1.716

= ¯ v3v1 + ¯ v2 1 v0

5.5

¯ v0 = ¯ v3 ∂v3 ∂v0 + ¯ v2 ∂v2 ∂v0

Backw ackwar ards Derivat vative ve Trace:

slide-106
SLIDE 106

AutoDiff: Reverse Mode

106

v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)

Forwar ard Eval valuat ation Trace:

2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652

+ sin +

ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y

Backwards Derivative ∂v3 ¯ v4 = ¯ v6 ∂v6 ∂v4 ∂v4 ¯ v5 = ¯ v6 ∂v6 ∂v5 ∂y ∂v5 ¯ v6 = ∂y ∂v6

1 1x1 = 1

= ¯ v6 · 1 = ¯ v6 · (−1)

1x-1 = -1

∂v2 ¯ v3 = ¯ v5 ∂v5 ∂v3 = ¯ v5 · (1)

1x1 = 1

∂v1 ∂v1 ¯ v2 = ¯ v5 ∂v5 ∂v2 = ¯ v5 · (1)

1x1 = 1

∂v1 ∂v0

1 = ¯

v3 ∂v3 ∂v1 + ¯ v4 ∂v4 ∂v1 ∂v ¯ v1 = ¯ = ¯ v3v0 + ¯ v4cos(v1)

1.716

= ¯ v3v1 + ¯ v2 1 v0

5.5

¯ v0 = ¯ v3 ∂v3 ∂v0 + ¯ v2 ∂v2 ∂v0

Backw ackwar ards Derivat vative ve Trace:

slide-107
SLIDE 107

AutoDiff: Reverse Mode

107

v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)

Forwar ard Eval valuat ation Trace:

2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652

+ sin +

ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y

Backwards Derivative ∂v3 ¯ v4 = ¯ v6 ∂v6 ∂v4 ∂v4 ¯ v5 = ¯ v6 ∂v6 ∂v5 ∂y ∂v5 ¯ v6 = ∂y ∂v6

1 1x1 = 1

= ¯ v6 · 1 = ¯ v6 · (−1)

1x-1 = -1

∂v2 ¯ v3 = ¯ v5 ∂v5 ∂v3 = ¯ v5 · (1)

1x1 = 1

∂v1 ∂v1 ¯ v2 = ¯ v5 ∂v5 ∂v2 = ¯ v5 · (1)

1x1 = 1

∂v1 ∂v0

1 = ¯

v3 ∂v3 ∂v1 + ¯ v4 ∂v4 ∂v1 ∂v ¯ v1 = ¯ = ¯ v3v0 + ¯ v4cos(v1)

1.716

= ¯ v3v1 + ¯ v2 1 v0

5.5

¯ v0 = ¯ v3 ∂v3 ∂v0 + ¯ v2 ∂v2 ∂v0

Backw ackwar ards Derivat vative ve Trace:

1

slide-108
SLIDE 108

AutoDiff: Reverse Mode

108

v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)

Forwar ard Eval valuat ation Trace:

2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652

+ sin +

ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y

Backwards Derivative ∂v3 ¯ v4 = ¯ v6 ∂v6 ∂v4 ∂v4 ¯ v5 = ¯ v6 ∂v6 ∂v5 ∂y ∂v5 ¯ v6 = ∂y ∂v6

1 1x1 = 1

= ¯ v6 · 1 = ¯ v6 · (−1)

1x-1 = -1

∂v2 ¯ v3 = ¯ v5 ∂v5 ∂v3 = ¯ v5 · (1)

1x1 = 1

∂v1 ∂v1 ¯ v2 = ¯ v5 ∂v5 ∂v2 = ¯ v5 · (1)

1x1 = 1

∂v1 ∂v0

1 = ¯

v3 ∂v3 ∂v1 + ¯ v4 ∂v4 ∂v1 ∂v ¯ v1 = ¯ = ¯ v3v0 + ¯ v4cos(v1)

1.716

= ¯ v3v1 + ¯ v2 1 v0

5.5

¯ v0 = ¯ v3 ∂v3 ∂v0 + ¯ v2 ∂v2 ∂v0

Backw ackwar ards Derivat vative ve Trace:

1

slide-109
SLIDE 109

AutoDiff: Reverse Mode

109

v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)

Forwar ard Eval valuat ation Trace:

2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652

+ sin +

ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y

Backwards Derivative ∂v3 ¯ v4 = ¯ v6 ∂v6 ∂v4 ∂v4 ¯ v5 = ¯ v6 ∂v6 ∂v5 ∂y ∂v5 ¯ v6 = ∂y ∂v6

1 1x1 = 1

= ¯ v6 · 1 = ¯ v6 · (−1)

1x-1 = -1

∂v2 ¯ v3 = ¯ v5 ∂v5 ∂v3 = ¯ v5 · (1)

1x1 = 1

∂v1 ∂v1 ¯ v2 = ¯ v5 ∂v5 ∂v2 = ¯ v5 · (1)

1x1 = 1

∂v1 ∂v0

1 = ¯

v3 ∂v3 ∂v1 + ¯ v4 ∂v4 ∂v1 ∂v ¯ v1 = ¯ = ¯ v3v0 + ¯ v4cos(v1)

1.716

= ¯ v3v1 + ¯ v2 1 v0

5.5

¯ v0 = ¯ v3 ∂v3 ∂v0 + ¯ v2 ∂v2 ∂v0

Backw ackwar ards Derivat vative ve Trace:

1

slide-110
SLIDE 110

AutoDiff: Reverse Mode

110

v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)

Forwar ard Eval valuat ation Trace:

2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652

+ sin +

ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y

Backwards Derivative ∂v3 ¯ v4 = ¯ v6 ∂v6 ∂v4 ∂v4 ¯ v5 = ¯ v6 ∂v6 ∂v5 ∂y ∂v5 ¯ v6 = ∂y ∂v6

1 1x1 = 1

= ¯ v6 · 1 = ¯ v6 · (−1)

1x-1 = -1

∂v2 ¯ v3 = ¯ v5 ∂v5 ∂v3 = ¯ v5 · (1)

1x1 = 1

∂v1 ∂v1 ¯ v2 = ¯ v5 ∂v5 ∂v2 = ¯ v5 · (1)

1x1 = 1

∂v1 ∂v0

1 = ¯

v3 ∂v3 ∂v1 + ¯ v4 ∂v4 ∂v1 ∂v ¯ v1 = ¯ = ¯ v3v0 + ¯ v4cos(v1)

1.716

= ¯ v3v1 + ¯ v2 1 v0

5.5

¯ v0 = ¯ v3 ∂v3 ∂v0 + ¯ v2 ∂v2 ∂v0

Backw ackwar ards Derivat vative ve Trace:

1

slide-111
SLIDE 111

AutoDiff: Reverse Mode

111

v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)

Forwar ard Eval valuat ation Trace:

2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652

+ sin +

ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y

Backwards Derivative ∂v3 ¯ v4 = ¯ v6 ∂v6 ∂v4 ∂v4 ¯ v5 = ¯ v6 ∂v6 ∂v5 ∂y ∂v5 ¯ v6 = ∂y ∂v6

1 1x1 = 1

= ¯ v6 · 1 = ¯ v6 · (−1)

1x-1 = -1

∂v2 ¯ v3 = ¯ v5 ∂v5 ∂v3 = ¯ v5 · (1)

1x1 = 1

∂v1 ∂v1 ¯ v2 = ¯ v5 ∂v5 ∂v2 = ¯ v5 · (1)

1x1 = 1

∂v1 ∂v0

1 = ¯

v3 ∂v3 ∂v1 + ¯ v4 ∂v4 ∂v1 ∂v ¯ v1 = ¯ = ¯ v3v0 + ¯ v4cos(v1)

1.716

= ¯ v3v1 + ¯ v2 1 v0

5.5

¯ v0 = ¯ v3 ∂v3 ∂v0 + ¯ v2 ∂v2 ∂v0

Backw ackwar ards Derivat vative ve Trace:

1 1x1 = 1

slide-112
SLIDE 112

AutoDiff: Reverse Mode

112

v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)

Forwar ard Eval valuat ation Trace:

2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652

+ sin +

ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y

Backwards Derivative ∂v3 ¯ v4 = ¯ v6 ∂v6 ∂v4 ∂v4 ¯ v5 = ¯ v6 ∂v6 ∂v5 ∂y ∂v5 ¯ v6 = ∂y ∂v6

1 1x1 = 1

= ¯ v6 · 1 = ¯ v6 · (−1)

1x-1 = -1

∂v2 ¯ v3 = ¯ v5 ∂v5 ∂v3 = ¯ v5 · (1)

1x1 = 1

∂v1 ∂v1 ¯ v2 = ¯ v5 ∂v5 ∂v2 = ¯ v5 · (1)

1x1 = 1

∂v1 ∂v0

1 = ¯

v3 ∂v3 ∂v1 + ¯ v4 ∂v4 ∂v1 ∂v ¯ v1 = ¯ = ¯ v3v0 + ¯ v4cos(v1)

1.716

= ¯ v3v1 + ¯ v2 1 v0

5.5

¯ v0 = ¯ v3 ∂v3 ∂v0 + ¯ v2 ∂v2 ∂v0

Backw ackwar ards Derivat vative ve Trace:

1 1x1 = 1

slide-113
SLIDE 113

AutoDiff: Reverse Mode

113

v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)

Forwar ard Eval valuat ation Trace:

2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652

+ sin +

ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y

Backwards Derivative ∂v3 ¯ v4 = ¯ v6 ∂v6 ∂v4 ∂v4 ¯ v5 = ¯ v6 ∂v6 ∂v5 ∂y ∂v5 ¯ v6 = ∂y ∂v6

1 1x1 = 1

= ¯ v6 · 1 = ¯ v6 · (−1)

1x-1 = -1

∂v2 ¯ v3 = ¯ v5 ∂v5 ∂v3 = ¯ v5 · (1)

1x1 = 1

∂v1 ∂v1 ¯ v2 = ¯ v5 ∂v5 ∂v2 = ¯ v5 · (1)

1x1 = 1

∂v1 ∂v0

1 = ¯

v3 ∂v3 ∂v1 + ¯ v4 ∂v4 ∂v1 ∂v ¯ v1 = ¯ = ¯ v3v0 + ¯ v4cos(v1)

1.716

= ¯ v3v1 + ¯ v2 1 v0

5.5

¯ v0 = ¯ v3 ∂v3 ∂v0 + ¯ v2 ∂v2 ∂v0

Backw ackwar ards Derivat vative ve Trace:

1 1x1 = 1

slide-114
SLIDE 114

AutoDiff: Reverse Mode

114

v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)

Forwar ard Eval valuat ation Trace:

2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652

+ sin +

ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y

Backwards Derivative ∂v3 ¯ v4 = ¯ v6 ∂v6 ∂v4 ∂v4 ¯ v5 = ¯ v6 ∂v6 ∂v5 ∂y ∂v5 ¯ v6 = ∂y ∂v6

1 1x1 = 1

= ¯ v6 · 1 = ¯ v6 · (−1)

1x-1 = -1

∂v2 ¯ v3 = ¯ v5 ∂v5 ∂v3 = ¯ v5 · (1)

1x1 = 1

∂v1 ∂v1 ¯ v2 = ¯ v5 ∂v5 ∂v2 = ¯ v5 · (1)

1x1 = 1

∂v1 ∂v0

1 = ¯

v3 ∂v3 ∂v1 + ¯ v4 ∂v4 ∂v1 ∂v ¯ v1 = ¯ = ¯ v3v0 + ¯ v4cos(v1)

1.716

= ¯ v3v1 + ¯ v2 1 v0

5.5

¯ v0 = ¯ v3 ∂v3 ∂v0 + ¯ v2 ∂v2 ∂v0

Backw ackwar ards Derivat vative ve Trace:

1 1x1 = 1

slide-115
SLIDE 115

AutoDiff: Reverse Mode

115

v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)

Forwar ard Eval valuat ation Trace:

2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652

+ sin +

ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y

Backwards Derivative ∂v3 ¯ v4 = ¯ v6 ∂v6 ∂v4 ∂v4 ¯ v5 = ¯ v6 ∂v6 ∂v5 ∂y ∂v5 ¯ v6 = ∂y ∂v6

1 1x1 = 1

= ¯ v6 · 1 = ¯ v6 · (−1)

1x-1 = -1

∂v2 ¯ v3 = ¯ v5 ∂v5 ∂v3 = ¯ v5 · (1)

1x1 = 1

∂v1 ∂v1 ¯ v2 = ¯ v5 ∂v5 ∂v2 = ¯ v5 · (1)

1x1 = 1

∂v1 ∂v0

1 = ¯

v3 ∂v3 ∂v1 + ¯ v4 ∂v4 ∂v1 ∂v ¯ v1 = ¯ = ¯ v3v0 + ¯ v4cos(v1)

1.716

= ¯ v3v1 + ¯ v2 1 v0

5.5

¯ v0 = ¯ v3 ∂v3 ∂v0 + ¯ v2 ∂v2 ∂v0

Backw ackwar ards Derivat vative ve Trace:

1 1x1 = 1 1x-1 = -1

slide-116
SLIDE 116

AutoDiff: Reverse Mode

116

v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)

Forwar ard Eval valuat ation Trace:

2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652

+ sin +

ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y

Backwards Derivative ∂v3 ¯ v4 = ¯ v6 ∂v6 ∂v4 ∂v4 ¯ v5 = ¯ v6 ∂v6 ∂v5 ∂y ∂v5 ¯ v6 = ∂y ∂v6

1 1x1 = 1

= ¯ v6 · 1 = ¯ v6 · (−1)

1x-1 = -1

∂v2 ¯ v3 = ¯ v5 ∂v5 ∂v3 = ¯ v5 · (1)

1x1 = 1

∂v1 ∂v1 ¯ v2 = ¯ v5 ∂v5 ∂v2 = ¯ v5 · (1)

1x1 = 1

∂v1 ∂v0

1 = ¯

v3 ∂v3 ∂v1 + ¯ v4 ∂v4 ∂v1 ∂v ¯ v1 = ¯ = ¯ v3v0 + ¯ v4cos(v1)

1.716

= ¯ v3v1 + ¯ v2 1 v0

5.5

¯ v0 = ¯ v3 ∂v3 ∂v0 + ¯ v2 ∂v2 ∂v0

Backw ackwar ards Derivat vative ve Trace:

1 1x1 = 1 1x-1 = -1

slide-117
SLIDE 117

AutoDiff: Reverse Mode

117

v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)

Forwar ard Eval valuat ation Trace:

2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652

+ sin +

ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y

Backwards Derivative ∂v3 ¯ v4 = ¯ v6 ∂v6 ∂v4 ∂v4 ¯ v5 = ¯ v6 ∂v6 ∂v5 ∂y ∂v5 ¯ v6 = ∂y ∂v6

1 1x1 = 1

= ¯ v6 · 1 = ¯ v6 · (−1)

1x-1 = -1

∂v2 ¯ v3 = ¯ v5 ∂v5 ∂v3 = ¯ v5 · (1)

1x1 = 1

∂v1 ∂v1 ¯ v2 = ¯ v5 ∂v5 ∂v2 = ¯ v5 · (1)

1x1 = 1

∂v1 ∂v0

1 = ¯

v3 ∂v3 ∂v1 + ¯ v4 ∂v4 ∂v1 ∂v ¯ v1 = ¯ = ¯ v3v0 + ¯ v4cos(v1)

1.716

= ¯ v3v1 + ¯ v2 1 v0

5.5

¯ v0 = ¯ v3 ∂v3 ∂v0 + ¯ v2 ∂v2 ∂v0

Backw ackwar ards Derivat vative ve Trace:

1 1x1 = 1 1x-1 = -1

slide-118
SLIDE 118

AutoDiff: Reverse Mode

118

v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)

Forwar ard Eval valuat ation Trace:

2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652

+ sin +

ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y

Backwards Derivative ∂v3 ¯ v4 = ¯ v6 ∂v6 ∂v4 ∂v4 ¯ v5 = ¯ v6 ∂v6 ∂v5 ∂y ∂v5 ¯ v6 = ∂y ∂v6

1 1x1 = 1

= ¯ v6 · 1 = ¯ v6 · (−1)

1x-1 = -1

∂v2 ¯ v3 = ¯ v5 ∂v5 ∂v3 = ¯ v5 · (1)

1x1 = 1

∂v1 ∂v1 ¯ v2 = ¯ v5 ∂v5 ∂v2 = ¯ v5 · (1)

1x1 = 1

∂v1 ∂v0

1 = ¯

v3 ∂v3 ∂v1 + ¯ v4 ∂v4 ∂v1 ∂v ¯ v1 = ¯ = ¯ v3v0 + ¯ v4cos(v1)

1.716

= ¯ v3v1 + ¯ v2 1 v0

5.5

¯ v0 = ¯ v3 ∂v3 ∂v0 + ¯ v2 ∂v2 ∂v0

Backw ackwar ards Derivat vative ve Trace:

1 1x1 = 1 1x-1 = -1

slide-119
SLIDE 119

AutoDiff: Reverse Mode

119

v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)

Forwar ard Eval valuat ation Trace:

2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652

+ sin +

ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y

Backwards Derivative ∂v3 ¯ v4 = ¯ v6 ∂v6 ∂v4 ∂v4 ¯ v5 = ¯ v6 ∂v6 ∂v5 ∂y ∂v5 ¯ v6 = ∂y ∂v6

1 1x1 = 1

= ¯ v6 · 1 = ¯ v6 · (−1)

1x-1 = -1

∂v2 ¯ v3 = ¯ v5 ∂v5 ∂v3 = ¯ v5 · (1)

1x1 = 1

∂v1 ∂v1 ¯ v2 = ¯ v5 ∂v5 ∂v2 = ¯ v5 · (1)

1x1 = 1

∂v1 ∂v0

1 = ¯

v3 ∂v3 ∂v1 + ¯ v4 ∂v4 ∂v1 ∂v ¯ v1 = ¯ = ¯ v3v0 + ¯ v4cos(v1)

1.716

= ¯ v3v1 + ¯ v2 1 v0

5.5

¯ v0 = ¯ v3 ∂v3 ∂v0 + ¯ v2 ∂v2 ∂v0

Backw ackwar ards Derivat vative ve Trace:

1 1x1 = 1 1x-1 = -1 1x1 = 1

slide-120
SLIDE 120

AutoDiff: Reverse Mode

120

v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)

Forwar ard Eval valuat ation Trace:

2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652

+ sin +

ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y

Backwards Derivative ∂v3 ¯ v4 = ¯ v6 ∂v6 ∂v4 ∂v4 ¯ v5 = ¯ v6 ∂v6 ∂v5 ∂y ∂v5 ¯ v6 = ∂y ∂v6

1 1x1 = 1

= ¯ v6 · 1 = ¯ v6 · (−1)

1x-1 = -1

∂v2 ¯ v3 = ¯ v5 ∂v5 ∂v3 = ¯ v5 · (1)

1x1 = 1

∂v1 ∂v1 ¯ v2 = ¯ v5 ∂v5 ∂v2 = ¯ v5 · (1)

1x1 = 1

∂v1 ∂v0

1 = ¯

v3 ∂v3 ∂v1 + ¯ v4 ∂v4 ∂v1 ∂v ¯ v1 = ¯ = ¯ v3v0 + ¯ v4cos(v1)

1.716

= ¯ v3v1 + ¯ v2 1 v0

5.5

¯ v0 = ¯ v3 ∂v3 ∂v0 + ¯ v2 ∂v2 ∂v0

Backw ackwar ards Derivat vative ve Trace:

1 1x1 = 1 1x-1 = -1 1x1 = 1

slide-121
SLIDE 121

AutoDiff: Reverse Mode

121

v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)

Forwar ard Eval valuat ation Trace:

2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652

+ sin +

ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y

Backwards Derivative ∂v3 ¯ v4 = ¯ v6 ∂v6 ∂v4 ∂v4 ¯ v5 = ¯ v6 ∂v6 ∂v5 ∂y ∂v5 ¯ v6 = ∂y ∂v6

1 1x1 = 1

= ¯ v6 · 1 = ¯ v6 · (−1)

1x-1 = -1

∂v2 ¯ v3 = ¯ v5 ∂v5 ∂v3 = ¯ v5 · (1)

1x1 = 1

∂v1 ∂v1 ¯ v2 = ¯ v5 ∂v5 ∂v2 = ¯ v5 · (1)

1x1 = 1

∂v1 ∂v0

1 = ¯

v3 ∂v3 ∂v1 + ¯ v4 ∂v4 ∂v1 ∂v ¯ v1 = ¯ = ¯ v3v0 + ¯ v4cos(v1)

1.716

= ¯ v3v1 + ¯ v2 1 v0

5.5

¯ v0 = ¯ v3 ∂v3 ∂v0 + ¯ v2 ∂v2 ∂v0

Backw ackwar ards Derivat vative ve Trace:

1 1x1 = 1 1x-1 = -1 1x1 = 1

slide-122
SLIDE 122

AutoDiff: Reverse Mode

122

v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)

Forwar ard Eval valuat ation Trace:

2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652

+ sin +

ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y

Backwards Derivative ∂v3 ¯ v4 = ¯ v6 ∂v6 ∂v4 ∂v4 ¯ v5 = ¯ v6 ∂v6 ∂v5 ∂y ∂v5 ¯ v6 = ∂y ∂v6

1 1x1 = 1

= ¯ v6 · 1 = ¯ v6 · (−1)

1x-1 = -1

∂v2 ¯ v3 = ¯ v5 ∂v5 ∂v3 = ¯ v5 · (1)

1x1 = 1

∂v1 ∂v1 ¯ v2 = ¯ v5 ∂v5 ∂v2 = ¯ v5 · (1)

1x1 = 1

∂v1 ∂v0

1 = ¯

v3 ∂v3 ∂v1 + ¯ v4 ∂v4 ∂v1 ∂v ¯ v1 = ¯ = ¯ v3v0 + ¯ v4cos(v1)

1.716

= ¯ v3v1 + ¯ v2 1 v0

5.5

¯ v0 = ¯ v3 ∂v3 ∂v0 + ¯ v2 ∂v2 ∂v0

Backw ackwar ards Derivat vative ve Trace:

1 1x1 = 1 1x-1 = -1 1x1 = 1

slide-123
SLIDE 123

AutoDiff: Reverse Mode

123

v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)

Forwar ard Eval valuat ation Trace:

2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652

+ sin +

ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y

Backwards Derivative ∂v3 ¯ v4 = ¯ v6 ∂v6 ∂v4 ∂v4 ¯ v5 = ¯ v6 ∂v6 ∂v5 ∂y ∂v5 ¯ v6 = ∂y ∂v6

1 1x1 = 1

= ¯ v6 · 1 = ¯ v6 · (−1)

1x-1 = -1

∂v2 ¯ v3 = ¯ v5 ∂v5 ∂v3 = ¯ v5 · (1)

1x1 = 1

∂v1 ∂v1 ¯ v2 = ¯ v5 ∂v5 ∂v2 = ¯ v5 · (1)

1x1 = 1

∂v1 ∂v0

1 = ¯

v3 ∂v3 ∂v1 + ¯ v4 ∂v4 ∂v1 ∂v ¯ v1 = ¯ = ¯ v3v0 + ¯ v4cos(v1)

1.716

= ¯ v3v1 + ¯ v2 1 v0

5.5

¯ v0 = ¯ v3 ∂v3 ∂v0 + ¯ v2 ∂v2 ∂v0

Backw ackwar ards Derivat vative ve Trace:

1 1x1 = 1 1x-1 = -1 1x1 = 1 1x1 = 1

slide-124
SLIDE 124

AutoDiff: Reverse Mode

124

v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)

Forwar ard Eval valuat ation Trace:

2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652

+ sin +

ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y

Backwards Derivative ∂v3 ¯ v4 = ¯ v6 ∂v6 ∂v4 ∂v4 ¯ v5 = ¯ v6 ∂v6 ∂v5 ∂y ∂v5 ¯ v6 = ∂y ∂v6

1 1x1 = 1

= ¯ v6 · 1 = ¯ v6 · (−1)

1x-1 = -1

∂v2 ¯ v3 = ¯ v5 ∂v5 ∂v3 = ¯ v5 · (1)

1x1 = 1

∂v1 ∂v1 ¯ v2 = ¯ v5 ∂v5 ∂v2 = ¯ v5 · (1)

1x1 = 1

∂v1 ∂v0

1 = ¯

v3 ∂v3 ∂v1 + ¯ v4 ∂v4 ∂v1 ∂v ¯ v1 = ¯ = ¯ v3v0 + ¯ v4cos(v1)

1.716

= ¯ v3v1 + ¯ v2 1 v0

5.5

¯ v0 = ¯ v3 ∂v3 ∂v0 + ¯ v2 ∂v2 ∂v0

Backw ackwar ards Derivat vative ve Trace:

1 1x1 = 1 1x-1 = -1 1x1 = 1 1x1 = 1

slide-125
SLIDE 125

AutoDiff: Reverse Mode

125

v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)

Forwar ard Eval valuat ation Trace:

2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652

+ sin +

ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y

Backwards Derivative ∂v3 ¯ v4 = ¯ v6 ∂v6 ∂v4 ∂v4 ¯ v5 = ¯ v6 ∂v6 ∂v5 ∂y ∂v5 ¯ v6 = ∂y ∂v6

1 1x1 = 1

= ¯ v6 · 1 = ¯ v6 · (−1)

1x-1 = -1

∂v2 ¯ v3 = ¯ v5 ∂v5 ∂v3 = ¯ v5 · (1)

1x1 = 1

∂v1 ∂v1 ¯ v2 = ¯ v5 ∂v5 ∂v2 = ¯ v5 · (1)

1x1 = 1

∂v1 ∂v0

1 = ¯

v3 ∂v3 ∂v1 + ¯ v4 ∂v4 ∂v1 ∂v ¯ v1 = ¯ = ¯ v3v0 + ¯ v4cos(v1)

1.716

= ¯ v3v1 + ¯ v2 1 v0

5.5

¯ v0 = ¯ v3 ∂v3 ∂v0 + ¯ v2 ∂v2 ∂v0

Backw ackwar ards Derivat vative ve Trace:

1 1x1 = 1 1x-1 = -1 1x1 = 1 1x1 = 1

slide-126
SLIDE 126

AutoDiff: Reverse Mode

126

v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)

Forwar ard Eval valuat ation Trace:

2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652

+ sin +

ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y

Backwards Derivative ∂v3 ¯ v4 = ¯ v6 ∂v6 ∂v4 ∂v4 ¯ v5 = ¯ v6 ∂v6 ∂v5 ∂y ∂v5 ¯ v6 = ∂y ∂v6

1 1x1 = 1

= ¯ v6 · 1 = ¯ v6 · (−1)

1x-1 = -1

∂v2 ¯ v3 = ¯ v5 ∂v5 ∂v3 = ¯ v5 · (1)

1x1 = 1

∂v1 ∂v1 ¯ v2 = ¯ v5 ∂v5 ∂v2 = ¯ v5 · (1)

1x1 = 1

∂v1 ∂v0

1 = ¯

v3 ∂v3 ∂v1 + ¯ v4 ∂v4 ∂v1 ∂v ¯ v1 = ¯ = ¯ v3v0 + ¯ v4cos(v1)

1.716

= ¯ v3v1 + ¯ v2 1 v0

5.5

¯ v0 = ¯ v3 ∂v3 ∂v0 + ¯ v2 ∂v2 ∂v0

Backw ackwar ards Derivat vative ve Trace:

1 1x1 = 1 1x-1 = -1 1x1 = 1 1x1 = 1

slide-127
SLIDE 127

AutoDiff: Reverse Mode

127

v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)

Forwar ard Eval valuat ation Trace:

2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652

+ sin +

ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y

Backwards Derivative ∂v3 ¯ v4 = ¯ v6 ∂v6 ∂v4 ∂v4 ¯ v5 = ¯ v6 ∂v6 ∂v5 ∂y ∂v5 ¯ v6 = ∂y ∂v6

1 1x1 = 1

= ¯ v6 · 1 = ¯ v6 · (−1)

1x-1 = -1

∂v2 ¯ v3 = ¯ v5 ∂v5 ∂v3 = ¯ v5 · (1)

1x1 = 1

∂v1 ∂v1 ¯ v2 = ¯ v5 ∂v5 ∂v2 = ¯ v5 · (1)

1x1 = 1

∂v1 ∂v0

1 = ¯

v3 ∂v3 ∂v1 + ¯ v4 ∂v4 ∂v1 ∂v ¯ v1 = ¯ = ¯ v3v0 + ¯ v4cos(v1)

1.716

= ¯ v3v1 + ¯ v2 1 v0

5.5

¯ v0 = ¯ v3 ∂v3 ∂v0 + ¯ v2 ∂v2 ∂v0

Backw ackwar ards Derivat vative ve Trace:

1 1x1 = 1 1x-1 = -1 1x1 = 1 1x1 = 1

slide-128
SLIDE 128

AutoDiff: Reverse Mode

128

v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)

Forwar ard Eval valuat ation Trace:

2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652

+ sin +

ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y

Backwards Derivative ∂v3 ¯ v4 = ¯ v6 ∂v6 ∂v4 ∂v4 ¯ v5 = ¯ v6 ∂v6 ∂v5 ∂y ∂v5 ¯ v6 = ∂y ∂v6

1 1x1 = 1

= ¯ v6 · 1 = ¯ v6 · (−1)

1x-1 = -1

∂v2 ¯ v3 = ¯ v5 ∂v5 ∂v3 = ¯ v5 · (1)

1x1 = 1

∂v1 ∂v1 ¯ v2 = ¯ v5 ∂v5 ∂v2 = ¯ v5 · (1)

1x1 = 1

∂v1 ∂v0

1 = ¯

v3 ∂v3 ∂v1 + ¯ v4 ∂v4 ∂v1 ∂v ¯ v1 = ¯ = ¯ v3v0 + ¯ v4cos(v1)

1.716

= ¯ v3v1 + ¯ v2 1 v0

5.5

¯ v0 = ¯ v3 ∂v3 ∂v0 + ¯ v2 ∂v2 ∂v0

Backw ackwar ards Derivat vative ve Trace:

1 1x1 = 1 1x-1 = -1 1x1 = 1 1x1 = 1 1.716

slide-129
SLIDE 129

AutoDiff: Reverse Mode

129

v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)

Forwar ard Eval valuat ation Trace:

2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652

+ sin +

ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y

Backwards Derivative ∂v3 ¯ v4 = ¯ v6 ∂v6 ∂v4 ∂v4 ¯ v5 = ¯ v6 ∂v6 ∂v5 ∂y ∂v5 ¯ v6 = ∂y ∂v6

1 1x1 = 1

= ¯ v6 · 1 = ¯ v6 · (−1)

1x-1 = -1

∂v2 ¯ v3 = ¯ v5 ∂v5 ∂v3 = ¯ v5 · (1)

1x1 = 1

∂v1 ∂v1 ¯ v2 = ¯ v5 ∂v5 ∂v2 = ¯ v5 · (1)

1x1 = 1

∂v1 ∂v0

1 = ¯

v3 ∂v3 ∂v1 + ¯ v4 ∂v4 ∂v1 ∂v ¯ v1 = ¯ = ¯ v3v0 + ¯ v4cos(v1)

1.716

= ¯ v3v1 + ¯ v2 1 v0

5.5

¯ v0 = ¯ v3 ∂v3 ∂v0 + ¯ v2 ∂v2 ∂v0

Backw ackwar ards Derivat vative ve Trace:

1 1x1 = 1 1x-1 = -1 1x1 = 1 1x1 = 1 1.716 5.5

slide-130
SLIDE 130

AutoDiff: Reverse Mode

130

v0 = x1 v1 = x2 v2 = ln(v0) v3 = v0 · v1 v4 = sin(v1) v5 = v2 + v3 v6 = v5 − v4 y = v6 f(2, 5)

Forwar ard Eval valuat ation Trace:

2 5 ln(2) = 0.693 2 x 5 = 10 sin(5) = -0.959 0.693 + 10 = 10.693 10.693 + 0.959 = 11.652 11.652

+ sin +

ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y

Backwards Derivative ∂v3 ¯ v4 = ¯ v6 ∂v6 ∂v4 ∂v4 ¯ v5 = ¯ v6 ∂v6 ∂v5 ∂y ∂v5 ¯ v6 = ∂y ∂v6

1 1x1 = 1

= ¯ v6 · 1 = ¯ v6 · (−1)

1x-1 = -1

∂v2 ¯ v3 = ¯ v5 ∂v5 ∂v3 = ¯ v5 · (1)

1x1 = 1

∂v1 ∂v1 ¯ v2 = ¯ v5 ∂v5 ∂v2 = ¯ v5 · (1)

1x1 = 1

∂v1 ∂v0

1 = ¯

v3 ∂v3 ∂v1 + ¯ v4 ∂v4 ∂v1 ∂v ¯ v1 = ¯ = ¯ v3v0 + ¯ v4cos(v1)

1.716

= ¯ v3v1 + ¯ v2 1 v0

5.5

¯ v0 = ¯ v3 ∂v3 ∂v0 + ¯ v2 ∂v2 ∂v0

Backw ackwar ards Derivat vative ve Trace:

1 1x1 = 1 1x-1 = -1 1x1 = 1 1x1 = 1 1. 1.716 716 5. 5.5

slide-131
SLIDE 131

A

  • AutoDiff can be done at various gran

anular arities

131

Automatic Differentiation (AutoDiff)

y = f(x1, x2) = ln(x1) + x1x2 − sin(x2)

+ sin +

ln v0 v1 v2 v4 v5 v3 v6 x1 x2 y v0 v1 v2 x1 x2 y f(x1, x2) = l Elementar ary funct ction granularity: Co Complex funct ction granularity:

slide-132
SLIDE 132

Backpropagation: Practical Issues

132

x5

x4

x3

x2 x1

y1 y2

1st Hidden Layer 2nd Hidden Layer Output Layer

Wh1, bh1 Wh2, bh2 Wo, bo

vector form

2nd Hidden Layer Output Layer 1st Hidden Layer Input Layer

Easier to deal with in ve vect ctor fo form rm

slide-133
SLIDE 133

Backpropagation: Practical Issues

133
slide-134
SLIDE 134

Backpropagation: Practical Issues

134

"local cal" Jacobians (matrix of partial derivatives, e.g. |x| x |y| "backp ackprop" Gradient

slide-135
SLIDE 135

Jacobian of Sigmoid layer

  • Element-wise sigmoid layer:
135

x, y ∈ R2048

x y

sigmoid

slide-136
SLIDE 136

Jacobian of Sigmoid layer

  • Element-wise sigmoid layer:
136

x, y ∈ R2048

x y

sigmoid

− What is the dimension of Jaco Jacobian an?

slide-137
SLIDE 137

Jacobian of Sigmoid layer

  • Element-wise sigmoid layer:
137

x, y ∈ R2048

x y

sigmoid

− What is the dimension of Jaco Jacobian an? − What does it look like?

slide-138
SLIDE 138

Jacobian of Sigmoid layer

  • Element-wise sigmoid layer:
138

x, y ∈ R2048

x y

sigmoid

− What is the dimension of Jaco Jacobian an? − What does it look like? If we are working with a mini batch of 100 inputs-output pairs, Jacobian is a matrix 204,800 x 204,800!

slide-139
SLIDE 139

Backpropagation: Common questions

  • Quest

stion: Does BackProp only work for certain layers? Answ swer: No, for any differentiable functions

  • Quest

stion: What is computational cost of BackProp? Answ swer: On average about twice the forward pass

  • Quest

stion: Is BackProp a dual of forward propagation? Answ swer: Yes

139 slide adopted from Marc’Aurelio Ranzato
slide-140
SLIDE 140

Backpropagation: Common questions

  • Quest

stion: Does BackProp only work for certain layers? Answ swer: No, for any differentiable functions

  • Quest

stion: What is computational cost of BackProp? Answ swer: On average about twice the forward pass

  • Quest

stion: Is BackProp a dual of forward propagation? Answ swer: Yes

140 slide adopted from Marc’Aurelio Ranzato
  • pagation?

+

Sum Copy Copy Sum

+ FProp BackProp FP FPro rop BackP ckProp

Sum Copy Copy Sum

slide-141
SLIDE 141

Demo time

http://playground.tensorflow.org

141
slide-142
SLIDE 142 142

Shallow yet very powerful: word2vec

slide-143
SLIDE 143

From symbolic to distributed word representations

  • The vast majority of rule-based or statistical NLP and IR work regarded words

as atomic symbols: hotel, conference, walk

  • In vector space terms, this is a vector with one 1 and a lot of zeroes
  • We now call this a one-hot representation.
143

0 0 0 0 0 0 0 0 0 0 1 0 0 0 0

“hotel”

slide-144
SLIDE 144

From symbolic to distributed word representations

  • The size of word vectors are equal to the number of words in the dictionary
  • Vector size is proportional to the size of the dictionary

20K (speech) – 50K (Pen Treebank) – 500K (A large dictionary) – 13M (Google 1T)

  • One-hot vectors vectors are orthogonal
  • There is no natural notion of similarity in a set of one-hot vectors
144

0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0

“hotel” “motel”

T = 0

slide-145
SLIDE 145

Distributional similarity-based representations

  • You can get a lot of value by representing a word

by means of its neighbors

  • “You shall know a word by the company it keeps”

(J. R. Firth 1957:11)

  • One of the most successful ideas of modern NLP
145

government debt problems turning into bankin crises as has happened in saying that Europe needs unified bankin regulation to replace the hodgepodge banking banking

These words will represent “banking”

slide-146
SLIDE 146

Distributional hypothesis

  • The meaning of a word is (can be approximated by, derived from) the

set of contexts in which it occurs in texts

He filled the wampimuk, passed it around and we all drunk some We found a little, hairy wampimuk sleeping behind the tree

146 Slide credit: Marco Baroni

Testing the distributional hypothesis: The influence of context on judgements of semantic similarity [McDonald & Ramscar’01]

slide-147
SLIDE 147

Distributional semantics

147

he curtains open and the moon shining in on the barely ars and the cold , close moon " . And neither of the w rough the night with the moon shining so brightly , it made in the light of the moon . It all boils down , wr surely under a crescent moon , thrilled by ice-white sun , the seasons of the moon ? Home , alone , Jay pla m is dazzling snow , the moon has risen full and cold un and the temple of the moon , driving out of the hug in the dark and now the moon rises , full and amber a bird on the shape of the moon over the trees in front But I could n’t see the moon or the stars , only the rning , with a sliver of moon hanging among the stars they love the sun , the moon and the stars . None of the light of an enormous moon . The plash of flowing w man ’s first step on the moon ; various exhibits , aer the inevitable piece of moon rock . Housing The Airsh

  • ud obscured part of the moon . The Allied guns behind
Slide credit: Marco Baroni A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge [Landauer and Dumais’97] From frequency to meaning: Vector space models of semantics [Turney ve Pantel'10] …
slide-148
SLIDE 148

Window based co-occurence matrix

  • Example corpus:
  • I like deep learning.
  • I like NLP.
  • I enjoy flying.
  • Increase in size with vocabulary
  • Very high dimensional: require a lot of storage
  • Subsequent classification models have sparsity issues
  • Models are less robust
148

3/31/16 Richard Socher

counts I like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1 1 deep 1 1 learning 0 1 1 NLP 1 1 flying 1 1 . 1 1 1

slide-149
SLIDE 149

Three methods for getting short dense vectors

  • Singular Value Decomposition of cooccurrence

matrix X

  • A special case of this is called LSA – Latent Semantic

Analysis

  • Neural Language Model-inspired predictive models
  • skip-grams and CBOW
  • Brown clustering
149

238

LANDAUER AND DUMAIS

Appendix An Introduction to Singular Value Decomposition and an LSA Example

Singular Value Decomposition (SVD) A well-known proof in matrix algebra asserts that any rectangular matrix (X) is equal to the product of three other matrices (W, S, and C) of a particular form (see Berry, 1992, and Golub et al., 1981, for the basic math and computer algorithms of SVD). The first of these (W) has rows corresponding to the rows of the original, but has m columns corresponding to new, specially derived variables such that there is no correlation between any two columns; that is, each is linearly independent of the others, which means that no one can be constructed as a linear combination of others. Such derived variables are often called principal components, basis vectors, factors, or dimensions. The third matrix (C) has columns corresponding to the original columns, but m rows composed of derived singular vectors. The second matrix (S) is a diagonal matrix; that is, it is a square m × m matrix with nonzero entries
  • nly along one central diagonal. These are derived constants called
singular values. Their role is to relate the scale of the factors in the first two matrices to each other. This relation is shown schematically in Figure
  • A1. To keep the connection to the concrete applications of SVD in the
main text clear, we have labeled the rows and columns words (w) and contexts (c). The figure caption defines SVD more formally. The fundamental proof of SVD shows that there always exists a decomposition of this form such that matrix mu!tiplication of the three derived matrices reproduces the original matrix exactly so long as there are enough factors, where enough is always less than or equal to the smaller of the number of rows or columns of the original matrix. The number actually needed, referred to as the rank of the matrix, depends
  • n (or expresses) the intrinsic dimensionality of the data contained in
the cells of the original matrix. Of critical importance for latent semantic analysis (LSA), if one or more factor is omitted (that is, if one or more singular values in the diagonal matrix along with the corresponding singular vectors of the other two matrices are deleted), the reconstruction is a least-squares best approximation to the original given the remaining
  • dimensions. Thus, for example, after constructing an SVD, one can
reduce the number of dimensions systematically by, for example, remov- ing those with the smallest effect on the sum-squared error of the approx- imation simply by deleting those with the smallest singular values. The actual algorithms used to compute SVDs for large sparse matrices
  • f the sort involved in LSA are rather sophisticated and are not described
  • here. Suffice it to say that cookbook versions of SVD adequate for
small (e.g., 100 × 100) matrices are available in several places (e.g., Mathematica, 1991 ), and a free software version (Berry, 1992) suitable Contexts 3= m x m m x c wxc w xm Figure A1. Schematic diagram of the singular value decomposition (SVD) of a rectangular word (w) by context (c) matrix (X). The
  • riginal matrix is decomposed into three matrices: W and C, which are
  • rthonormal, and S, a diagonal matrix. The m columns of W and the m
rows of C ' are linearly independent. for very large matrices such as the one used here to analyze an encyclope- dia can currently be obtained from the WorldWideWeb (http://www.net- lib.org/svdpack/index.html). University-affiliated researchers may be able to obtain a research-only license and complete software package for doing LSA by contacting Susan Dumais. A~ With Berry's software and a high-end Unix work-station with approximately 100 megabytes
  • f RAM, matrices on the order of 50,000 × 50,000 (e.g., 50,000 words
and 50,000 contexts) can currently be decomposed into representations in 300 dimensions with about 2-4 hr of computation. The computational complexity is O(3Dz), where z is the number of nonzero elements in the Word (w) × Context (c) matrix and D is the number of dimensions
  • returned. The maximum matrix size one can compute is usually limited
by the memory (RAM) requirement, which for the fastest of the methods in the Berry package is (10 + D + q)N + (4 + q)q, where N = w + c and q = min (N, 600), plus space for the W × C matrix. Thus, whereas the computational difficulty of methods such as this once made modeling and simulation of data equivalent in quantity to human experi- ence unthinkable, it is now quite feasible in many cases. Note, however, that the simulations of adult psycholinguistic data reported here were still limited to corpora much smaller than the total text to which an educated adult has been exposed.

An LSA Example

Here is a small example that gives the flavor of the analysis and demonstrates what the technique can accomplish. A2 This example uses as text passages the titles of nine technical memoranda, five about human computer interaction (HCI), and four about mathematical graph theory, topics that are conceptually rather disjoint. The titles are shown below. cl: Human machine interface for ABC computer applications c2: A survey of user opinion of computer system response time c3: The EPS user interface management system c4: System and human system engineering testing of EPS c5: Relation of user perceived response time to error measurement ml: The generation of random, binary, ordered trees m2: The intersection graph of paths in trees m3: Graph minors IV: Widths of trees and well-quasi-ordering m4: Graph minors: A survey The matrix formed to represent this text is shown in Figure A2. (We discuss the highlighted parts of the tables in due course.) The initial matrix has nine columns, one for each title, and we have given it 12 rows, each corresponding to a content word that occurs in at least two
  • contexts. These are the words in italics. In LSA analyses of text, includ-
ing some of those reported above, words that appear in only one context are often omitted in doing the SVD. These contribute little to derivation
  • f the space, their vectors can be constructed after the SVD with little
loss as a weighted average of words in the sample in which they oc- curred, and their omission sometimes greatly reduces the computation. See Deerwester, Dumais, Furnas, Landauer, and Harshman (1990) and Dumais (1994) for more on such details. For simplicity of presentation, A~ Inquiries about LSA computer programs should be addressed to Susan T. Dumais, Bellcore, 600 South Street, Morristown, New Jersey
  • 07960. Electronic mail may be sent via Intemet to std@bellcore.com.
A2 This example has been used in several previous publications (e.g., Deerwester et al., 1990; Landauer & Dumais, 1996).

wt wt-2 wt+1 wt-1 wt+2 Skip-gram model

slide-150
SLIDE 150

Three methods for getting short dense vectors

  • Singular Value Decomposition of cooccurrence

matrix X

  • A special case of this is called LSA – Latent Semantic

Analysis

  • Neural Language Model-inspired predictive models
  • skip-grams and CBOW
  • Brown clustering
150

238

LANDAUER AND DUMAIS

Appendix An Introduction to Singular Value Decomposition and an LSA Example

Singular Value Decomposition (SVD) A well-known proof in matrix algebra asserts that any rectangular matrix (X) is equal to the product of three other matrices (W, S, and C) of a particular form (see Berry, 1992, and Golub et al., 1981, for the basic math and computer algorithms of SVD). The first of these (W) has rows corresponding to the rows of the original, but has m columns corresponding to new, specially derived variables such that there is no correlation between any two columns; that is, each is linearly independent of the others, which means that no one can be constructed as a linear combination of others. Such derived variables are often called principal components, basis vectors, factors, or dimensions. The third matrix (C) has columns corresponding to the original columns, but m rows composed of derived singular vectors. The second matrix (S) is a diagonal matrix; that is, it is a square m × m matrix with nonzero entries
  • nly along one central diagonal. These are derived constants called
singular values. Their role is to relate the scale of the factors in the first two matrices to each other. This relation is shown schematically in Figure
  • A1. To keep the connection to the concrete applications of SVD in the
main text clear, we have labeled the rows and columns words (w) and contexts (c). The figure caption defines SVD more formally. The fundamental proof of SVD shows that there always exists a decomposition of this form such that matrix mu!tiplication of the three derived matrices reproduces the original matrix exactly so long as there are enough factors, where enough is always less than or equal to the smaller of the number of rows or columns of the original matrix. The number actually needed, referred to as the rank of the matrix, depends
  • n (or expresses) the intrinsic dimensionality of the data contained in
the cells of the original matrix. Of critical importance for latent semantic analysis (LSA), if one or more factor is omitted (that is, if one or more singular values in the diagonal matrix along with the corresponding singular vectors of the other two matrices are deleted), the reconstruction is a least-squares best approximation to the original given the remaining
  • dimensions. Thus, for example, after constructing an SVD, one can
reduce the number of dimensions systematically by, for example, remov- ing those with the smallest effect on the sum-squared error of the approx- imation simply by deleting those with the smallest singular values. The actual algorithms used to compute SVDs for large sparse matrices
  • f the sort involved in LSA are rather sophisticated and are not described
  • here. Suffice it to say that cookbook versions of SVD adequate for
small (e.g., 100 × 100) matrices are available in several places (e.g., Mathematica, 1991 ), and a free software version (Berry, 1992) suitable Contexts 3= m x m m x c wxc w xm Figure A1. Schematic diagram of the singular value decomposition (SVD) of a rectangular word (w) by context (c) matrix (X). The
  • riginal matrix is decomposed into three matrices: W and C, which are
  • rthonormal, and S, a diagonal matrix. The m columns of W and the m
rows of C ' are linearly independent. for very large matrices such as the one used here to analyze an encyclope- dia can currently be obtained from the WorldWideWeb (http://www.net- lib.org/svdpack/index.html). University-affiliated researchers may be able to obtain a research-only license and complete software package for doing LSA by contacting Susan Dumais. A~ With Berry's software and a high-end Unix work-station with approximately 100 megabytes
  • f RAM, matrices on the order of 50,000 × 50,000 (e.g., 50,000 words
and 50,000 contexts) can currently be decomposed into representations in 300 dimensions with about 2-4 hr of computation. The computational complexity is O(3Dz), where z is the number of nonzero elements in the Word (w) × Context (c) matrix and D is the number of dimensions
  • returned. The maximum matrix size one can compute is usually limited
by the memory (RAM) requirement, which for the fastest of the methods in the Berry package is (10 + D + q)N + (4 + q)q, where N = w + c and q = min (N, 600), plus space for the W × C matrix. Thus, whereas the computational difficulty of methods such as this once made modeling and simulation of data equivalent in quantity to human experi- ence unthinkable, it is now quite feasible in many cases. Note, however, that the simulations of adult psycholinguistic data reported here were still limited to corpora much smaller than the total text to which an educated adult has been exposed.

An LSA Example

Here is a small example that gives the flavor of the analysis and demonstrates what the technique can accomplish. A2 This example uses as text passages the titles of nine technical memoranda, five about human computer interaction (HCI), and four about mathematical graph theory, topics that are conceptually rather disjoint. The titles are shown below. cl: Human machine interface for ABC computer applications c2: A survey of user opinion of computer system response time c3: The EPS user interface management system c4: System and human system engineering testing of EPS c5: Relation of user perceived response time to error measurement ml: The generation of random, binary, ordered trees m2: The intersection graph of paths in trees m3: Graph minors IV: Widths of trees and well-quasi-ordering m4: Graph minors: A survey The matrix formed to represent this text is shown in Figure A2. (We discuss the highlighted parts of the tables in due course.) The initial matrix has nine columns, one for each title, and we have given it 12 rows, each corresponding to a content word that occurs in at least two
  • contexts. These are the words in italics. In LSA analyses of text, includ-
ing some of those reported above, words that appear in only one context are often omitted in doing the SVD. These contribute little to derivation
  • f the space, their vectors can be constructed after the SVD with little
loss as a weighted average of words in the sample in which they oc- curred, and their omission sometimes greatly reduces the computation. See Deerwester, Dumais, Furnas, Landauer, and Harshman (1990) and Dumais (1994) for more on such details. For simplicity of presentation, A~ Inquiries about LSA computer programs should be addressed to Susan T. Dumais, Bellcore, 600 South Street, Morristown, New Jersey
  • 07960. Electronic mail may be sent via Intemet to std@bellcore.com.
A2 This example has been used in several previous publications (e.g., Deerwester et al., 1990; Landauer & Dumais, 1996).

wt wt-2 wt+1 wt-1 wt+2 Skip-gram model

slide-151
SLIDE 151

Prediction-based models: An alternative way to get dense vectors

  • Skip-gram (Mikolov et al. 2013a), CBOW (Mikolov et al. 2013b)
  • Learn embeddings as part of the process of word prediction.
  • Train a neural network to predict neighboring words
  • Inspired by neural net language models.
  • In so doing, learn dense embeddings for the words in the training corpus.
  • Advantages:
  • Fast, easy to train (much faster than SVD)
  • Available online in the word2vec package
  • Including sets of pretrained embeddings!
151
slide-152
SLIDE 152

Basic idea of learning neural network word embeddings

  • We define some model that aims to predict a word based on other

words in its context which has a loss function, e.g.,

  • We look at many samples from a big language corpus
  • We keep adjusting the vector representations of words to minimize

this loss

152

argmaxww · ((wj−1 + wj+1) /2) J(θ) = 1 − wj · ((wj−1 + wj+1) /2)

Unit norm vectors

slide-153
SLIDE 153

Neural Embedding Models (Mikolov et al. 2013)

153
  • Distributed representations of words and phrases and their compositionality [Mikolov vd.'13]

CBoW model Skip-gram model

Image credit: Ed Grefenstette

slide-154
SLIDE 154

Details of word2Vec

  • Predict surrounding words in a window of length m of every word.
  • Objective function: Maximize the log probability of any context word given

the current center word: where θ represents all variables we optimize

154

J(θ) = 1 T

T

X

t=1

X

mjm,j6=0

log p(wt+j|wt)

Distributed representations of words and phrases and their compositionality [Mikolov vd.'13]
slide-155
SLIDE 155

Details of Word2Vec

  • Predict surrounding words in a window of length m of every word.
  • For the simplest first formulation is

where o is the outside (or output) word id,

c is the center word id, u and v are “center” and “outside” vectors of o and c

  • Every word has two vectors!
  • This is essentially “dynamic” logistic regression
155
  • g p(wt+j|wt)
Distributed representations of words and phrases and their compositionality [Mikolov vd.'13]

p(o|c) = exp(uT

  • vc)

PW

w=1 exp(uT wvc)

slide-156
SLIDE 156

Intuition: similarity as dot-product between a target vector and context vector

1 . . k . . |Vw| 1.2…….j………|Vw| 1 . . . d

W

context embedding for word k

C

  • 1. .. … d

target embeddings context embeddings

Similarity( j , k)

target embedding for word j

156
  • Similarity(j,k) = ck · vj
  • We use softmax to

turn into probabilities

p(wk|wj) = exp(ck ·vj) P

i∈|V| exp(ci ·vj)

slide-157
SLIDE 157

Details of Word2Vec

  • Predict surrounding words in a window of length m of every word.
  • For the simplest first formulation is
  • Every word has two vectors!
  • We can either:
  • Just use vj
  • Sum them
  • Concatenate them to make a double-length embedding
157
  • g p(wt+j|wt)
Distributed representations of words and phrases and their compositionality [Mikolov vd.'13]

p(o|c) = exp(uT

  • vc)

PW

w=1 exp(uT wvc)

slide-158
SLIDE 158

Learning

  • Start with some initial embeddings

(e.g., random)

  • iteratively make the embeddings for a

word

⎯ more like the embeddings of its neighbors ⎯ less like the embeddings of other words.

158

s

slide-159
SLIDE 159

Visualizing W and C as a network for doing error backprop

Input layer Projection layer Output layer

wt wt+1 1-hot input vector

1⨉d 1⨉|V|

embedding for wt probabilities of context words

C d ⨉ |V|

x1 x2 xj x|V| y1 y2 yk y|V|

W

|V|⨉d

1⨉|V|

159
slide-160
SLIDE 160

Problem with the softmax

  • The denominator: have to compute over every word in vocabulary
  • Instead: just sample a few of those negative words
160

p(wk|w j) = exp(ck ·v j) P

i∈|V| exp(ci ·v j)

slide-161
SLIDE 161

Goal in learning

  • Make the word like the context words
  • We want this to be high:
  • And not like k randomly selected “noise words”
  • We want this to be low:
161

lemon, a [tablespoon of apricot preserves or] jam c1 c2 w c3 c4

[cement metaphysical dear coaxial apricot attendant whence forever puddle] n1 n2 n3 n4 n5 n6 n7 n8

is high. In practice σ(x) =

1 1+ex .

σ c4·w to be

ant σ(c1·w)+σ(c2·w)+σ(c3·w)+

1+

σ(c4·w) to In addition,

  • rds n to have a low dot-product with our tar

ant σ(n1·w)+σ(n2·w)+...+σ(n8·w) to learning objective for one word/context pair (w,

slide-162
SLIDE 162

Skipgram with negative sampling: Loss function

logσ(c·w)+

k

X

i=1

Ewi∼p(w) [logσ(−wi ·w)]

162
slide-163
SLIDE 163

Stochastic gradients with word vectors!

  • But in each window, we only have at most 2c -1 words, so

is very sparse!

163

4/ Richard Socher 9

But in each w so i

slide-164
SLIDE 164

Stochastic gradients with word vectors!

  • We may as well only update the word vectors that actually appear!
  • Solution: either keep around hash for word vectors or only update certain

columns of full embedding matrix U and V

  • Important if you have millions of word vectors and do distributed computing

to not have to send gigantic updates around.

164

[ ]

d |V|

slide-165
SLIDE 165

Embeddings capture semantics!

  • Words similar to “frog”

1. frogs 2. toad 3. litoria 4. leptodactylidae 5. rana 6. lizard 7. eleutherodactylus

165

“litoria” “leptodactylidae” “rana” “eleutherodactylus”

GloVe: Global Vectors for Word Representation [Pennington vd.'14]

slide-166
SLIDE 166

Embeddings capture relational meaning!

vector(‘king’) - vector(‘man’) + vector(‘woman’) ≈ vector(‘queen’) vector(‘Paris’) - vector(‘France’) + vector(‘Italy’) ≈ vector(‘Rome’)

166
slide-167
SLIDE 167

Demo time

http://projector.tensorflow.org

167
slide-168
SLIDE 168 168

Next Lecture: Training Deep Neural Networks