Lecture 12: Computational Graph Backpropagation Aykut Erdem March - - PowerPoint PPT Presentation

lecture 12
SMART_READER_LITE
LIVE PREVIEW

Lecture 12: Computational Graph Backpropagation Aykut Erdem March - - PowerPoint PPT Presentation

Lecture 12: Computational Graph Backpropagation Aykut Erdem March 2016 Hacettepe University Administrative Assignment 2 due March 20, 2016! Midterm exam on Thursday, March 24, 2016 You are responsible from the beginning


slide-1
SLIDE 1

Lecture 12:

− Computational Graph − Backpropagation

Aykut Erdem

March 2016 Hacettepe University

slide-2
SLIDE 2

Administrative

  • Assignment 2 due March 20, 2016!

  • Midterm exam on Thursday, March 24, 2016

− You are responsible from the beginning till the end

  • f this class

− You can prepare and bring a full-page copy sheet

(A4-paper, both sides) to the exam.


  • Assignment 3 will be out soon!

− It is due April 7, 2016 − You will implement a 2-layer Neural Network

2

slide-3
SLIDE 3

Last time… 
 Multilayer Perceptron

3

  • Layer Representation
  • (typically) iterate between


linear mapping Wx and 
 nonlinear function

  • Loss function


to measure quality of
 estimate so far

yi = Wixi xi+1 = σ(yi)

x1 x2 x3 x4 y

W1 W2 W3 W4

l(y, yi)

slide by Alex Smola

slide-4
SLIDE 4

Last time… Forward Pass

  • Output of the network can be written as:

(j indexing hidden units, k indexing the output units, D number of inputs)

  • Activation functions f , g : sigmoid/logistic, tanh, or rectified linear (ReLU)

4

slide by Raquel Urtasun, Richard Zemel, Sanja Fidler

X

  • k(x)

= g(wk0 +

J

X

j=1

hj(x)wkj)

σ(z) = 1 1 + exp(−z), tanh(z) = exp(z) − exp(−z) exp(z) + exp(−z), ReLU(z) = max(0, z) hj(x) = f (vj0 +

D

X

i=1

xivji)

slide-5
SLIDE 5

Last time… Forward Pass in Python

  • Example code for a forward pass for a 3-layer network in Python: 



 
 


  • Can be implemented efficiently using matrix operations
  • Example above: W1 is matrix of size 4 × 3, W2 is 4 × 4. What about

biases and W3?

5

slide by Raquel Urtasun, Richard Zemel, Sanja Fidler

[http://cs231n.github.io/neural-networks-1/]

slide-6
SLIDE 6

Today

  • Backpropagation and Neural Networks
  • Tips and Tricks

6

slide-7
SLIDE 7

Backpropagation and Neural Networks

7

slide-8
SLIDE 8

Recap: Loss function/Optimization

8

  • 3.45
  • 8.87

0.09 2.9 4.48 8.02 3.78 1.06

  • 0.36
  • 0.72
  • 0.51

6.04 5.31

  • 4.22
  • 4.19

3.58 4.49

  • 4.37
  • 2.09
  • 2.93

3.42 4.64 2.65 5.1 2.64 5.55

  • 4.34
  • 1.5
  • 4.79

6.14

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

  • 1. Define a loss function that

quantifies our unhappiness with the scores across the training data.


  • 2. Come up with a way of

efficiently finding the parameters that minimize the loss function. (optimization)

TODO:

We defined a (linear) score function:

slide-9
SLIDE 9

9

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Softmax Classifier (Multinomial Logistic Regression)

slide-10
SLIDE 10

10

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Softmax Classifier (Multinomial Logistic Regression)

slide-11
SLIDE 11

11

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Softmax Classifier (Multinomial Logistic Regression)

slide-12
SLIDE 12

12

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Softmax Classifier (Multinomial Logistic Regression)

slide-13
SLIDE 13

13

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Softmax Classifier (Multinomial Logistic Regression)

slide-14
SLIDE 14

14

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Softmax Classifier (Multinomial Logistic Regression)

slide-15
SLIDE 15

15

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Softmax Classifier (Multinomial Logistic Regression)

slide-16
SLIDE 16

16

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Softmax Classifier (Multinomial Logistic Regression)

slide-17
SLIDE 17

17

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Softmax Classifier (Multinomial Logistic Regression)

slide-18
SLIDE 18

Softmax Classifier (Multinomial Logistic Regression)

18

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

slide-19
SLIDE 19

Optimization

19

slide-20
SLIDE 20

Gradient Descent

20

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

slide-21
SLIDE 21

Mini-batch Gradient Descent

  • only use a small portion of the training set

to compute the gradient

21

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

slide-22
SLIDE 22

Mini-batch Gradient Descent

  • only use a small portion of the training set

to compute the gradient

22 there are also more fancy update formulas (momentum, Adagrad, RMSProp, Adam, …)

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

slide-23
SLIDE 23

The effects of different update form formulas

23

(image credits to Alec Radford)

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

slide-24
SLIDE 24

24

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

slide-25
SLIDE 25

25

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

slide-26
SLIDE 26

Back-propagation

26

slide-27
SLIDE 27

Computational Graph

27

x W

*

hinge loss

R

+

L s (scores)

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

slide-28
SLIDE 28

Convolutional Network (AlexNet)

input image weights loss

28

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

slide-29
SLIDE 29

Neural Turing Machine

input tape loss

29

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

slide-30
SLIDE 30

30

e.g. x = -2, y = 5, z = -4

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

slide-31
SLIDE 31

31

e.g. x = -2, y = 5, z = -4 Want:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

slide-32
SLIDE 32

32

e.g. x = -2, y = 5, z = -4 Want:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

slide-33
SLIDE 33

33

e.g. x = -2, y = 5, z = -4 Want:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

slide-34
SLIDE 34

34

e.g. x = -2, y = 5, z = -4 Want:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

slide-35
SLIDE 35

35

e.g. x = -2, y = 5, z = -4 Want:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

slide-36
SLIDE 36

36

e.g. x = -2, y = 5, z = -4 Want:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

slide-37
SLIDE 37

37

e.g. x = -2, y = 5, z = -4 Want:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

slide-38
SLIDE 38

38

e.g. x = -2, y = 5, z = -4 Want:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

slide-39
SLIDE 39

39

e.g. x = -2, y = 5, z = -4 Want:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

slide-40
SLIDE 40

40

e.g. x = -2, y = 5, z = -4 Want:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Chain rule:

slide-41
SLIDE 41

41

e.g. x = -2, y = 5, z = -4 Want:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

slide-42
SLIDE 42

42

e.g. x = -2, y = 5, z = -4 Want:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Chain rule:

slide-43
SLIDE 43

43

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

f

activations

slide-44
SLIDE 44

44

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

f

activations

“local gradient”

slide-45
SLIDE 45

45

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

f

activations gradients

“local gradient”

slide-46
SLIDE 46

46

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

f

activations gradients

“local gradient”

slide-47
SLIDE 47

47

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

f

activations gradients

“local gradient”

slide-48
SLIDE 48

48

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

f

activations gradients

“local gradient”

slide-49
SLIDE 49

49

Another example:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

slide-50
SLIDE 50

50

Another example:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

slide-51
SLIDE 51

51

Another example:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

slide-52
SLIDE 52

52

Another example:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

slide-53
SLIDE 53

53

Another example:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

slide-54
SLIDE 54

54

Another example:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

slide-55
SLIDE 55

55

Another example:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

slide-56
SLIDE 56

56

Another example:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

slide-57
SLIDE 57

57

Another example:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

slide-58
SLIDE 58

58

Another example:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

(-1) * (-0.20) = 0.20

slide-59
SLIDE 59

59

Another example:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

slide-60
SLIDE 60

60

Another example:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

[local gradient] x [its gradient] [1] x [0.2] = 0.2 [1] x [0.2] = 0.2 (both inputs!)

slide-61
SLIDE 61

61

Another example:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

slide-62
SLIDE 62

62

Another example:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

[local gradient] x [its gradient] x0: [2] x [0.2] = 0.4 w0: [-1] x [0.2] = -0.2

slide-63
SLIDE 63

63

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

sigmoid function sigmoid gate

slide-64
SLIDE 64

64

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

sigmoid function sigmoid gate (0.73) * (1 - 0.73) = 0.2

slide-65
SLIDE 65

Patterns in backward flow

  • add gate: gradient distributor
  • max gate: gradient router
  • mul gate: gradient… “switcher”?

65

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

slide-66
SLIDE 66

66

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Gradients add at branches

+

slide-67
SLIDE 67

Implementation: forward/backward API

67

Graph (or Net) object. (Rough pseudo code)

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

slide-68
SLIDE 68

Implementation: forward/backward API

68

(x,y,z are scalars)

* x y z

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

slide-69
SLIDE 69

Implementation: forward/backward API

69

(x,y,z are scalars)

* x y z

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

slide-70
SLIDE 70

Summary

  • neural nets will be very large: no hope of writing down

gradient formula by hand for all parameters

  • backpropagation = recursive application of the chain rule

along a computational graph to compute the gradients of all inputs/parameters/intermediates

  • implementations maintain a graph structure, where the

nodes implement the forward() / backward() API.

  • forward: compute result of an operation and save any

intermediates needed for gradient computation in memory

  • backward: apply the chain rule to compute the gradient of

the loss function with respect to the inputs.

70

slide-71
SLIDE 71

Where are we now…

71

Mini-batch SGD Loop: 1.Sample a batch of data 2.Forward prop it through the graph, get loss 3.Backprop to calculate the gradients 4.Update the parameters using the gradient

slide-72
SLIDE 72

Next Lecture: 
 
 Convolutional Neural Networks

72