[PPT] - Lecture 12: Computational Graph Backpropagation Aykut Erdem March PowerPoint Presentation

SLIDE 1

Lecture 12:

− Computational Graph − Backpropagation

Aykut Erdem

March 2016 Hacettepe University

SLIDE 2

Administrative

Assignment 2 due March 20, 2016! 
Midterm exam on Thursday, March 24, 2016

− You are responsible from the beginning till the end

f this class

− You can prepare and bring a full-page copy sheet

(A4-paper, both sides) to the exam. 

Assignment 3 will be out soon!

− It is due April 7, 2016 − You will implement a 2-layer Neural Network

2

SLIDE 3

Last time…   Multilayer Perceptron

3

Layer Representation
(typically) iterate between

linear mapping Wx and   nonlinear function

Loss function

to measure quality of  estimate so far

yi = Wixi xi+1 = σ(yi)

x1 x2 x3 x4 y

W1 W2 W3 W4

l(y, yi)

slide by Alex Smola

SLIDE 4

Last time… Forward Pass

Output of the network can be written as:

(j indexing hidden units, k indexing the output units, D number of inputs)

Activation functions f , g : sigmoid/logistic, tanh, or rectified linear (ReLU)

4

slide by Raquel Urtasun, Richard Zemel, Sanja Fidler

X

k(x)

= g(wk0 +

J

X

j=1

hj(x)wkj)

σ(z) = 1 1 + exp(−z), tanh(z) = exp(z) − exp(−z) exp(z) + exp(−z), ReLU(z) = max(0, z) hj(x) = f (vj0 +

D

X

i=1

xivji)

SLIDE 5

Last time… Forward Pass in Python

Example code for a forward pass for a 3-layer network in Python:

Can be implemented efficiently using matrix operations
Example above: W1 is matrix of size 4 × 3, W2 is 4 × 4. What about

biases and W3?

5

slide by Raquel Urtasun, Richard Zemel, Sanja Fidler

[http://cs231n.github.io/neural-networks-1/]

SLIDE 6

Today

Backpropagation and Neural Networks
Tips and Tricks

6

SLIDE 7

Backpropagation and Neural Networks

7

SLIDE 8

Recap: Loss function/Optimization

8

3.45
8.87

0.09 2.9 4.48 8.02 3.78 1.06

0.36
0.72
0.51

6.04 5.31

4.22
4.19

3.58 4.49

4.37
2.09
2.93

3.42 4.64 2.65 5.1 2.64 5.55

4.34
1.5
4.79

6.14

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

1. Define a loss function that

quantifies our unhappiness with the scores across the training data. 

2. Come up with a way of

efficiently finding the parameters that minimize the loss function. (optimization)

TODO:

We defined a (linear) score function:

SLIDE 9

9

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Softmax Classifier (Multinomial Logistic Regression)

SLIDE 10

10

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Softmax Classifier (Multinomial Logistic Regression)

SLIDE 11

11

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Softmax Classifier (Multinomial Logistic Regression)

SLIDE 12

12

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Softmax Classifier (Multinomial Logistic Regression)

SLIDE 13

13

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Softmax Classifier (Multinomial Logistic Regression)

SLIDE 14

14

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Softmax Classifier (Multinomial Logistic Regression)

SLIDE 15

15

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Softmax Classifier (Multinomial Logistic Regression)

SLIDE 16

16

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Softmax Classifier (Multinomial Logistic Regression)

SLIDE 17

17

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Softmax Classifier (Multinomial Logistic Regression)

SLIDE 18

Softmax Classifier (Multinomial Logistic Regression)

18

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

SLIDE 19

Optimization

19

SLIDE 20

Gradient Descent

20

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

SLIDE 21

Mini-batch Gradient Descent

only use a small portion of the training set

to compute the gradient

21

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

SLIDE 22

Mini-batch Gradient Descent

only use a small portion of the training set

to compute the gradient

22 there are also more fancy update formulas (momentum, Adagrad, RMSProp, Adam, …)

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

SLIDE 23

The effects of different update form formulas

23

(image credits to Alec Radford)

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

SLIDE 24

24

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

SLIDE 25

25

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

SLIDE 26

Back-propagation

26

SLIDE 27

Computational Graph

27

x W

*

hinge loss

R

+

L s (scores)

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

SLIDE 28

Convolutional Network (AlexNet)

input image weights loss

28

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

SLIDE 29

Neural Turing Machine

input tape loss

29

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

SLIDE 30

30

e.g. x = -2, y = 5, z = -4

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

SLIDE 31

31

e.g. x = -2, y = 5, z = -4 Want:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

SLIDE 32

32

e.g. x = -2, y = 5, z = -4 Want:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

SLIDE 33

33

e.g. x = -2, y = 5, z = -4 Want:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

SLIDE 34

34

e.g. x = -2, y = 5, z = -4 Want:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

SLIDE 35

35

e.g. x = -2, y = 5, z = -4 Want:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

SLIDE 36

36

e.g. x = -2, y = 5, z = -4 Want:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

SLIDE 37

37

e.g. x = -2, y = 5, z = -4 Want:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

SLIDE 38

38

e.g. x = -2, y = 5, z = -4 Want:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

SLIDE 39

39

e.g. x = -2, y = 5, z = -4 Want:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

SLIDE 40

40

e.g. x = -2, y = 5, z = -4 Want:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Chain rule:

SLIDE 41

41

e.g. x = -2, y = 5, z = -4 Want:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

SLIDE 42

42

e.g. x = -2, y = 5, z = -4 Want:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Chain rule:

SLIDE 43

43

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

activations gradients

“local gradient”

SLIDE 49

49

Another example:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

SLIDE 50

50

Another example:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

SLIDE 51

51

Another example:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

SLIDE 52

52

Another example:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

SLIDE 53

53

Another example:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

SLIDE 54

54

Another example:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

SLIDE 55

55

Another example:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

SLIDE 56

56

Another example:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

SLIDE 57

57

Another example:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

SLIDE 58

58

Another example:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

(-1) * (-0.20) = 0.20

SLIDE 59

59

Another example:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

SLIDE 60

60

Another example:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

[local gradient] x [its gradient] [1] x [0.2] = 0.2 [1] x [0.2] = 0.2 (both inputs!)

SLIDE 61

61

Another example:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

SLIDE 62

62

Another example:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

[local gradient] x [its gradient] x0: [2] x [0.2] = 0.4 w0: [-1] x [0.2] = -0.2

SLIDE 63

63

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

sigmoid function sigmoid gate

SLIDE 64

64

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

sigmoid function sigmoid gate (0.73) * (1 - 0.73) = 0.2

SLIDE 65

Patterns in backward flow

add gate: gradient distributor
max gate: gradient router
mul gate: gradient… “switcher”?

65

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

SLIDE 66

66

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Gradients add at branches

+

SLIDE 67

Implementation: forward/backward API

67

Graph (or Net) object. (Rough pseudo code)

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

SLIDE 68

Implementation: forward/backward API

68

(x,y,z are scalars)

* x y z

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

SLIDE 69

Implementation: forward/backward API

69

(x,y,z are scalars)

* x y z

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

SLIDE 70

Summary

neural nets will be very large: no hope of writing down

gradient formula by hand for all parameters

backpropagation = recursive application of the chain rule

along a computational graph to compute the gradients of all inputs/parameters/intermediates

implementations maintain a graph structure, where the

nodes implement the forward() / backward() API.

forward: compute result of an operation and save any

intermediates needed for gradient computation in memory

backward: apply the chain rule to compute the gradient of

the loss function with respect to the inputs.

70

SLIDE 71

Where are we now…

71

Mini-batch SGD Loop: 1.Sample a batch of data 2.Forward prop it through the graph, get loss 3.Backprop to calculate the gradients 4.Update the parameters using the gradient

SLIDE 72

Next Lecture:     Convolutional Neural Networks

72

Lecture 12:

Administrative

(A4-paper, both sides) to the exam.

Last time… Multilayer Perceptron

W1 W2 W3 W4

Last time… Forward Pass

Last time… Forward Pass in Python

Today

Backpropagation and Neural Networks

Recap: Loss function/Optimization

Softmax Classifier (Multinomial Logistic Regression)

Softmax Classifier (Multinomial Logistic Regression)

Softmax Classifier (Multinomial Logistic Regression)

Softmax Classifier (Multinomial Logistic Regression)

Softmax Classifier (Multinomial Logistic Regression)

Softmax Classifier (Multinomial Logistic Regression)

Softmax Classifier (Multinomial Logistic Regression)

Softmax Classifier (Multinomial Logistic Regression)

Softmax Classifier (Multinomial Logistic Regression)

Softmax Classifier (Multinomial Logistic Regression)

Optimization

Gradient Descent

Mini-batch Gradient Descent

to compute the gradient

Mini-batch Gradient Descent

to compute the gradient

The effects of different update form formulas

Back-propagation

Computational Graph

Convolutional Network (AlexNet)

Neural Turing Machine

f

f

f

f

f

f

Patterns in backward flow

Gradients add at branches

Implementation: forward/backward API

Implementation: forward/backward API

Implementation: forward/backward API

Summary

Where are we now…

Mini-batch SGD Loop: 1.Sample a batch of data 2.Forward prop it through the graph, get loss 3.Backprop to calculate the gradients 4.Update the parameters using the gradient

Next Lecture: Convolutional Neural Networks

(A4-paper, both sides) to the exam. 

Last time…   Multilayer Perceptron

Next Lecture:     Convolutional Neural Networks