BBM406 Fundamentals of Machine Learning Lecture 12: Computational - - PowerPoint PPT Presentation

bbm406
SMART_READER_LITE
LIVE PREVIEW

BBM406 Fundamentals of Machine Learning Lecture 12: Computational - - PowerPoint PPT Presentation

Illustration: 3Blue1Brown BBM406 Fundamentals of Machine Learning Lecture 12: Computational Graph Backpropagation Aykut Erdem // Hacettepe University // Fall 2019 Last time Multilayer Perceptron Layer Representation y W 4 y i =


slide-1
SLIDE 1

Aykut Erdem // Hacettepe University // Fall 2019

Lecture 12:

Computational Graph Backpropagation

BBM406

Fundamentals of 
 Machine Learning

Illustration: 3Blue1Brown

slide-2
SLIDE 2

Last time… 
 Multilayer Perceptron

2

  • Layer Representation
  • (typically) iterate between


linear mapping Wx and 
 nonlinear function

  • Loss function


to measure quality of
 estimate so far

yi = Wixi xi+1 = σ(yi)

x1 x2 x3 x4 y

W1 W2 W3 W4

l(y, yi)

slide by Alex Smola
slide-3
SLIDE 3

Last time… Forward Pass

  • Output of the network can be written as:

(j indexing hidden units, k indexing the output units, D number of inputs)

  • Activation functions f , g : sigmoid/logistic, tanh, or rectified linear (ReLU)

3

slide by Raquel Urtasun, Richard Zemel, Sanja Fidler

X

  • k(x)

= g(wk0 +

J

X

j=1

hj(x)wkj)

σ(z) = 1 1 + exp(−z), tanh(z) = exp(z) − exp(−z) exp(z) + exp(−z), ReLU(z) = max(0, z) hj(x) = f (vj0 +

D

X

i=1

xivji)

slide-4
SLIDE 4

Last time… Forward Pass in Python

  • Example code for a forward pass for a 3-layer network in Python: 



 
 


  • Can be implemented efficiently using matrix operations
  • Example above: W1 is matrix of size 4 × 3, W2 is 4 × 4. What about

biases and W3?

4

slide by Raquel Urtasun, Richard Zemel, Sanja Fidler

[http://cs231n.github.io/neural-networks-1/]

slide-5
SLIDE 5

Backpropagation

5

slide-6
SLIDE 6

Recap: Loss function/Optimization

6

  • 3.45
  • 8.87

0.09 2.9 4.48 8.02 3.78 1.06

  • 0.36
  • 0.72
  • 0.51

6.04 5.31

  • 4.22
  • 4.19

3.58 4.49

  • 4.37
  • 2.09
  • 2.93

3.42 4.64 2.65 5.1 2.64 5.55

  • 4.34
  • 1.5
  • 4.79

6.14

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
  • 1. Define a loss function that

quantifies our unhappiness with the scores across the training data.


  • 2. Come up with a way of

efficiently finding the parameters that minimize the loss function. (optimization)

TODO:

We defined a (linear) score function:

slide-7
SLIDE 7

7

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Softmax Classifier (Multinomial Logistic Regression)

slide-8
SLIDE 8

8

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Softmax Classifier (Multinomial Logistic Regression)

slide-9
SLIDE 9

9

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Softmax Classifier (Multinomial Logistic Regression)

slide-10
SLIDE 10

10

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Softmax Classifier (Multinomial Logistic Regression)

slide-11
SLIDE 11

11

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Softmax Classifier (Multinomial Logistic Regression)

slide-12
SLIDE 12

12

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Softmax Classifier (Multinomial Logistic Regression)

slide-13
SLIDE 13

13

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Softmax Classifier (Multinomial Logistic Regression)

slide-14
SLIDE 14

14

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Softmax Classifier (Multinomial Logistic Regression)

slide-15
SLIDE 15

15

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Softmax Classifier (Multinomial Logistic Regression)

slide-16
SLIDE 16

Softmax Classifier (Multinomial Logistic Regression)

16

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
slide-17
SLIDE 17

Optimization

17

slide-18
SLIDE 18

Gradient Descent

18

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
slide-19
SLIDE 19

Mini-batch Gradient Descent

  • only use a small portion of the training set

to compute the gradient

19

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
slide-20
SLIDE 20

Mini-batch Gradient Descent

  • only use a small portion of the training set

to compute the gradient

20 there are also more fancy update formulas (momentum, Adagrad, RMSProp, Adam, …)

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
slide-21
SLIDE 21

The effects of different update form formulas

21

(image credits to Alec Radford)

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
slide-22
SLIDE 22

22

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
slide-23
SLIDE 23

23

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
slide-24
SLIDE 24

Computational Graph

24

x W

*

hinge loss

R

+

L s (scores)

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
slide-25
SLIDE 25

Convolutional Network (AlexNet)

input image weights loss

25

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
slide-26
SLIDE 26

Neural Turing Machine

input tape loss

26

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
slide-27
SLIDE 27

27

e.g. x = -2, y = 5, z = -4

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
slide-28
SLIDE 28

28

e.g. x = -2, y = 5, z = -4 Want:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
slide-29
SLIDE 29

29

e.g. x = -2, y = 5, z = -4 Want:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
slide-30
SLIDE 30

30

e.g. x = -2, y = 5, z = -4 Want:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
slide-31
SLIDE 31

31

e.g. x = -2, y = 5, z = -4 Want:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
slide-32
SLIDE 32

32

e.g. x = -2, y = 5, z = -4 Want:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
slide-33
SLIDE 33

33

e.g. x = -2, y = 5, z = -4 Want:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
slide-34
SLIDE 34

34

e.g. x = -2, y = 5, z = -4 Want:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
slide-35
SLIDE 35

35

e.g. x = -2, y = 5, z = -4 Want:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
slide-36
SLIDE 36

36

e.g. x = -2, y = 5, z = -4 Want:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
slide-37
SLIDE 37

37

e.g. x = -2, y = 5, z = -4 Want:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Chain rule:

slide-38
SLIDE 38

38

e.g. x = -2, y = 5, z = -4 Want:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
slide-39
SLIDE 39

39

e.g. x = -2, y = 5, z = -4 Want:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Chain rule:

slide-40
SLIDE 40

40

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

f

activations

slide-41
SLIDE 41

41

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

f

activations

“local gradient”

slide-42
SLIDE 42

42

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

f

activations gradients

“local gradient”

slide-43
SLIDE 43

43

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

f

activations gradients

“local gradient”

slide-44
SLIDE 44

44

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

f

activations gradients

“local gradient”

slide-45
SLIDE 45

45

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

f

activations gradients

“local gradient”

slide-46
SLIDE 46

46

Another example:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
slide-47
SLIDE 47

47

Another example:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
slide-48
SLIDE 48

48

Another example:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
slide-49
SLIDE 49

49

Another example:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
slide-50
SLIDE 50

50

Another example:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
slide-51
SLIDE 51

51

Another example:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
slide-52
SLIDE 52

52

Another example:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
slide-53
SLIDE 53

53

Another example:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
slide-54
SLIDE 54

54

Another example:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
slide-55
SLIDE 55

55

Another example:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

(-1) * (-0.20) = 0.20

slide-56
SLIDE 56

56

Another example:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
slide-57
SLIDE 57

57

Another example:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

[local gradient] x [its gradient] [1] x [0.2] = 0.2 [1] x [0.2] = 0.2 (both inputs!)

slide-58
SLIDE 58

58

Another example:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
slide-59
SLIDE 59

59

Another example:

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

[local gradient] x [its gradient] x0: [2] x [0.2] = 0.4 w0: [-1] x [0.2] = -0.2

0.40

slide-60
SLIDE 60

60

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

sigmoid function sigmoid gate

0.40

slide-61
SLIDE 61

61

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

sigmoid function sigmoid gate (0.73) * (1 - 0.73) = 0.2

0.40

slide-62
SLIDE 62

Patterns in backward flow

  • add gate: gradient distributor
  • max gate: gradient router
  • mul gate: gradient… “switcher”?

62

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
slide-63
SLIDE 63

63

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Gradients add at branches

+

slide-64
SLIDE 64

Implementation: forward/backward API

64

Graph (or Net) object. (Rough pseudo code)

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson 0.40
slide-65
SLIDE 65

Implementation: forward/backward API

65

(x,y,z are scalars)

* x y z

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
slide-66
SLIDE 66

Implementation: forward/backward API

66

(x,y,z are scalars)

* x y z

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson
slide-67
SLIDE 67

Summary

  • neural nets will be very large: no hope of writing down

gradient formula by hand for all parameters

  • backpropagation = recursive application of the chain rule

along a computational graph to compute the gradients of all inputs/parameters/intermediates

  • implementations maintain a graph structure, where the

nodes implement the forward() / backward() API.

  • forward: compute result of an operation and save any

intermediates needed for gradient computation in memory

  • backward: apply the chain rule to compute the gradient of

the loss function with respect to the inputs.

67

slide-68
SLIDE 68

Where are we now…

68

Mini-batch SGD Loop: 1.Sample a batch of data 2.Forward prop it through the graph, get loss 3.Backprop to calculate the gradients 4.Update the parameters using the gradient

slide-69
SLIDE 69

Next Lecture: 
 
 Introduction to Deep Learning

69