Lecture 12: Perceptron and Back Propagation CS109A Introduction to - - PowerPoint PPT Presentation

lecture 12 perceptron and back propagation
SMART_READER_LITE
LIVE PREVIEW

Lecture 12: Perceptron and Back Propagation CS109A Introduction to - - PowerPoint PPT Presentation

Lecture 12: Perceptron and Back Propagation CS109A Introduction to Data Science Pavlos Protopapas and Kevin Rader Outline 1. Review of Classification and Logistic Regression 2. Introduction to Optimization Gradient Descent Stochastic


slide-1
SLIDE 1

CS109A Introduction to Data Science

Pavlos Protopapas and Kevin Rader

Lecture 12: Perceptron and Back Propagation

slide-2
SLIDE 2

CS109A, PROTOPAPAS, RADER

Outline

  • 1. Review of Classification and Logistic Regression
  • 2. Introduction to Optimization

Gradient Descent – Stochastic Gradient Descent

  • 3. Single Neuron Network (‘Perceptron’)
  • 4. Multi-Layer Perceptron
  • 5. Back Propagation

2

slide-3
SLIDE 3

CS109A, PROTOPAPAS, RADER

Outline

  • 1. Review of Classification and Logistic Regression
  • 2. Introduction to Optimization

Gradient Descent – Stochastic Gradient Descent

  • 3. Single Neuron Network (‘Perceptron’)
  • 4. Multi-Layer Perceptron
  • 5. Back Propagation

3

slide-4
SLIDE 4

CS109A, PROTOPAPAS, RADER

Classification and Logistic Regression

4

slide-5
SLIDE 5

CS109A, PROTOPAPAS, RADER

Classification

Methods that are centered around modeling and prediction of a

quantitative response variable (ex, number of taxi pickups, number of bike rentals, etc) are called regressions (and Ridge, LASSO, etc). When the response variable is categorical, then the problem is no longer called a regression problem but is instead labeled as a classification problem. The goal is to attempt to classify each observation into a category (aka, class or cluster) defined by Y, based on a set of predictor variables X.

5

slide-6
SLIDE 6

CS109A, PROTOPAPAS, RADER

Heart Data

Age Sex ChestPain RestBP Chol Fbs RestECG MaxHR ExAng Oldpeak Slope Ca Thal AHD

63 1

typical

145 233 1 2 150 2.3 3 0.0

fixed

No 67 1

asymptomatic

160 286 2 108 1 1.5 2 3.0

normal

Yes 67 1

asymptomatic

120 229 2 129 1 2.6 2 2.0

reversable

Yes 37 1

nonanginal

130 250 187 3.5 3 0.0

normal

No 41

nontypical

130 204 2 172 1.4 1 0.0

normal

No response variable Y is Yes/No

6

slide-7
SLIDE 7

CS109A, PROTOPAPAS, RADER

Heart Data: logistic estimation

We'd like to predict whether or not a person has a heart disease. And we'd like to make this prediction, for now, just based on the MaxHR.

7

slide-8
SLIDE 8

CS109A, PROTOPAPAS, RADER

Logistic Regression

Logistic Regression addresses the problem of estimating a probability, 𝑄 𝑧 = 1 , given an input X. The logistic regression model uses a function, called the logistic function, to model 𝑄 𝑧 = 1 :

P(Y = 1) = eβ0+β1X 1 + eβ0+β1X = 1 1 + e−(β0+β1X)

8

slide-9
SLIDE 9

CS109A, PROTOPAPAS, RADER

Logistic Regression

As a result the model will predict 𝑄 𝑧 = 1 with an 𝑇-shaped curve, which is the general shape of the logistic function. 𝛾( shifts the curve right or left by c = −

+, +-.

𝛾. controls how steep the 𝑇-shaped curve is distance from ½ to ~1 or ½\ to ~0 to ½ is 0

+-

Note: if 𝛾. is positive, then the predicted 𝑄 𝑧 = 1 goes from zero for small values of 𝑌 to one for large values of 𝑌 and if 𝛾. is negative, then has the 𝑄 𝑧 = 1 opposite association.

9

slide-10
SLIDE 10

CS109A, PROTOPAPAS, RADER

Logistic Regression

− 𝛾( 𝛾. 2𝛾. 𝛾. 4

10

slide-11
SLIDE 11

CS109A, PROTOPAPAS, RADER

Logistic Regression

P(Y = 1) = 1 1 + e−(β0+β1X)

11

slide-12
SLIDE 12

CS109A, PROTOPAPAS, RADER

Logistic Regression

P(Y = 1) = 1 1 + e−(β0+β1X)

12

slide-13
SLIDE 13

CS109A, PROTOPAPAS, RADER

Estimating the coefficients for Logistic Regression

Find the coefficients that minimize the loss function ℒ 𝛾(, 𝛾. = − 6[𝑧8 log 𝑞8 + 1 − 𝑧8 log(1 − 𝑞8)]

  • 8

13

slide-14
SLIDE 14

CS109A, PROTOPAPAS, RADER

But what is the idea?

14

Start with Regression or Logistic Regression

𝑍 = 𝑔(𝛾( + 𝛾.𝑦. + 𝛾0𝑦0 + 𝛾E𝑦E + 𝛾F𝑦F) 𝑦. 𝑦0 𝑦E 𝑦F

Coefficients or Weights Intercept or Bias

f(X)=

. .GHIJKL

Classification f 𝑌 = 𝑋N𝑌 Regression 𝑋N = 𝑋

(, 𝑋 ., … , 𝑋 F

= [𝛾(, 𝛾., … , 𝛾F]

slide-15
SLIDE 15

CS109A, PROTOPAPAS, RADER

But what is the idea?

15

Start with all randomly selected weights. Most likely it will perform horribly. For example, in our heart data, the model will be giving us the wrong answer.

𝑞̂ = 0.9 → 𝑍𝑓𝑡 𝑁𝑏𝑦𝐼𝑆 = 200 𝐵𝑕𝑓 = 52 𝑇𝑓𝑦 = 𝑁𝑏𝑚𝑓 𝐷ℎ𝑝𝑚 = 152

Bad Computer

y=No

Correct answer

slide-16
SLIDE 16

CS109A, PROTOPAPAS, RADER

But what is the idea?

16

Start with all randomly selected weights. Most likely it will perform horribly. For example, in our heart data, the model will be giving us the wrong answer.

𝑞̂ = 0.4 → 𝑂𝑝 𝑁𝑏𝑦𝐼𝑆 = 170 𝐵𝑕𝑓 = 42 𝑇𝑓𝑦 = 𝑁𝑏𝑚𝑓 𝐷ℎ𝑝𝑚 = 342

y=Yes

Bad Computer

Correct answer

slide-17
SLIDE 17

CS109A, PROTOPAPAS, RADER

But what is the idea?

17

  • Loss Function: Takes all of these results and averages them and tells us how bad or

good the computer or those weights are.

  • Telling the computer how bad or good is, does not help.
  • You want to tell it how to change those weights so it gets better.

Loss function: ℒ 𝑥(, 𝑥., 𝑥0, 𝑥E, 𝑥F For now let’s only consider one weight, ℒ 𝑥.

slide-18
SLIDE 18

CS109A, PROTOPAPAS, RADER

Minimizing the Loss function

18

To find the optimal point of a function ℒ 𝑋 And find the 𝑋 that satisfies that equation. Sometimes there is no explicit solution for that. Ideally we want to know the value of 𝑥. that gives the minimul ℒ 𝑋 𝑒ℒ(𝑋) 𝑒𝑋 = 0

slide-19
SLIDE 19

CS109A, PROTOPAPAS, RADER

Minimizing the Loss function

19

A more flexible method is

  • Start from any point
  • Determine which direction to go to reduce the loss (left or right)
  • Specifically, we can calculate the slope of the function at this point
  • Shift to the right if slope is negative or shift to the left if slope is positive
  • Repeat
slide-20
SLIDE 20

CS109A, PROTOPAPAS, RADER

Minimization of the Loss Function

If the step is proportional to the slope then you avoid overshooting the minimum. Question: What is the mathematical function that describes the slope? Question: How do we generalize this to more than one predictor? Question: What do you think it is a good approach for telling the model how to change (what is the step size) to become better?

20

slide-21
SLIDE 21

CS109A, PROTOPAPAS, RADER

Minimization of the Loss Function

If the step is proportional to the slope then you avoid overshooting the minimum. Question: What is the mathematical function that describes the slope? Derivative Question: How do we generalize this to more than one predictor? Take the derivative with respect to each coefficient and do the same sequentially Question: What do you think it is a good approach for telling the model how to change (what is the step size) to become better? More on this later

21

slide-22
SLIDE 22

CS109A, PROTOPAPAS, RADER

Let’s play the Pavlos game

We know that we want to go in the opposite direction of the derivative and we know we want to be making a step proportionally to the derivative. Making a step means:

22

𝑥gHh = 𝑥ijk + 𝑡𝑢𝑓𝑞 Opposite direction of the derivative means: 𝑥gHh = 𝑥ijk − 𝜇 𝑒ℒ 𝑒𝑥 Change to more conventional notation: 𝑥(8G.) = 𝑥(8) − 𝜇 𝑒ℒ 𝑒𝑥

Learning Rate

slide-23
SLIDE 23

CS109A, PROTOPAPAS, RADER

Gradient Descent

  • Algorithm for optimization of first
  • rder to finding a minimum of a

function.

  • It is an iterative method.
  • L is decreasing in the direction of

the negative derivative.

  • The learning rate is controlled by

the magnitude of 𝜇.

23

L w

  • +

𝑥(8G.) = 𝑥(8) − 𝜇 𝑒ℒ 𝑒𝑥

slide-24
SLIDE 24

CS109A, PROTOPAPAS, RADER

Considerations

  • We still need to derive the derivatives.
  • We need to know what is the learning rate or how to set it.
  • We need to avoid local minima.
  • Finally, the full likelihood function includes summing up all

individual ‘errors’. Unless you are a statistician, this can be hundreds of thousands of examples.

24

slide-25
SLIDE 25

CS109A, PROTOPAPAS, RADER

Considerations

  • We still need to derive the derivatives.
  • We need to know what is the learning rate or how to set it.
  • We need to avoid local minima.
  • Finally, the full likelihood function includes summing up all

individual ‘errors’. Unless you are a statistician, this can be hundreds of thousands of examples.

25

slide-26
SLIDE 26

CS109A, PROTOPAPAS, RADER

Derivatives: Memories from middle school

26

slide-27
SLIDE 27

CS109A, PROTOPAPAS, RADER

Linear Regression

27

d f dβ0 = 0 ⇒ 2 X

i

(yi − β0 − β1xi) X

i

yi − β0n − β1 X

i

xi = 0 β0 = ¯ y − β1¯ x d f dβ1 = 0 ⇒ 2 X

i

(yi − β0 − β1xi)(−xi) − X

i

xiyi + β0 X

i

xi + β1 X

i

x2

i = 0

− X

i

xiyi + (¯ y − β1¯ x) X

i

xi + β1 X

i

x2

i = 0

β1 X

i

x2

i − n¯

x2 ! = X

i

xiyi − n¯ x¯ y ⇒ β1 = P

i xiyi − n¯

x¯ y P

i x2 i − n¯

x2 ⇒ β1 = P

i(xi − ¯

x)(yi − ¯ y) P

i(xi − ¯

x)2 f = X

i

(yi − β0 − β1xi)2

slide-28
SLIDE 28

CS109A, PROTOPAPAS, RADER

Logistic Regression Derivatives

28

Can we do it? Wolfram Alpha can do it for us! We need a formalism to deal with these derivatives.

slide-29
SLIDE 29

CS109A, PROTOPAPAS, RADER

Chain Rule

  • Chain rule for computing gradients:
  • 𝑧 = 𝑕 𝑦 𝑨 = 𝑔 𝑧 = 𝑔 𝑕 𝑦
  • For longer chains

29

𝜖𝑨 𝜖𝑦 = 𝜖𝑨 𝜖𝑧 𝜖𝑧 𝜖𝑦 𝒛 = 𝑕 𝒚 𝑨 = 𝑔 𝒛 = 𝑔 𝑕 𝒚 𝜖𝑨 𝜖𝑦8 = 6 𝜖𝑨 𝜖𝑧r 𝜖𝑧r 𝜖𝑦8

  • r

∂z ∂xi = … ∂z ∂yj1

jm

j1

…∂yjm ∂xi

slide-30
SLIDE 30

CS109A, PROTOPAPAS, RADER

Logistic Regression derivatives

30

ℒ = 6 ℒ8

  • 8

= − 6 log 𝑀8

  • 8

= − 6[𝑧8 log 𝑞8 + 1 − 𝑧8 log(1 − 𝑞8)]

  • 8

tℒ tu = ∑ tℒw tu

  • 8

= ∑ (

  • 8

tℒw

x

tu+ tℒw

y

tu)

ℒ8 = −𝑧8 log 1 1 + 𝑓zuK{ − 1 − 𝑧8 log(1 − 1 1 + 𝑓zuK{) For logistic regression, the –ve log of the likelihood is: ℒ8 = ℒ8

| + ℒ8 }

To simplify the analysis let us split it into two parts, So the derivative with respect to W is:

slide-31
SLIDE 31

CS109A, PROTOPAPAS, RADER

31

Variables Partial derivatives Partial derivatives 𝜊. = −𝑋N𝑌 𝜖𝜊. 𝜖𝑋 = −𝑌 𝜖𝜊. 𝜖𝑋 = −𝑌 𝜊0 = 𝑓•- = 𝑓zuK{ 𝜖𝜊0 𝜖𝜊. = 𝑓•- 𝜖𝜊0 𝜖𝜊. = 𝑓zuK{ 𝜊E = 1 + 𝜊0 = 1 + 𝑓zuK{ 𝜖𝜊E 𝜖𝜊0 = 1

t•€ t•• =1

𝜊F = 1 𝜊E = 1 1 + 𝑓zuK{ = 𝑞 𝜖𝜊F 𝜖𝜊E = − 1 𝜊E 𝜖𝜊F 𝜖𝜊E = − 1 1 + 𝑓zuK{ 0 𝜊‚ = log 𝜊F = log 𝑞 = log 1 1 + 𝑓zuK{ 𝜖𝜊‚ 𝜖𝜊F = 1 𝜊F 𝜖𝜊‚ 𝜖𝜊F = 1 + 𝑓zuK{ ℒ8

| = −𝑧𝜊‚

𝜖ℒ 𝜖𝜊‚ = −𝑧 𝜖ℒ 𝜖𝜊‚ = −𝑧 𝜖ℒ8

|

𝜖𝑋 = 𝜖ℒ8 𝜖𝜊‚ 𝜖𝜊‚ 𝜖𝜊F 𝜖𝜊F 𝜖𝜊E 𝜖𝜊E 𝜖𝜊0 𝜖𝜊0 𝜖𝜊. 𝜖𝜊. 𝜖𝑋 𝜖ℒ8

|

𝜖𝑋 = −𝑧𝑌𝑓zuK{ 1 1 + 𝑓zuK{

ℒ8

| = −𝑧8 log

1 1 + 𝑓zuK{

slide-32
SLIDE 32

CS109A, PROTOPAPAS, RADER

32

Variables derivatives Partial derivatives wrt to X,W 𝜊. = −𝑋N𝑌 𝜖𝜊. 𝜖𝑋 = −𝑌 𝜖𝜊. 𝜖𝑋 = −𝑌 𝜊0 = 𝑓•- = 𝑓zuK{ 𝜖𝜊0 𝜖𝜊. = 𝑓•- 𝜖𝜊0 𝜖𝜊. = 𝑓zuK{ 𝜊E = 1 + 𝜊0 = 1 + 𝑓zuK{ 𝜖𝜊E 𝜖𝜊0 = 1

t•€ t0 =1

𝜊F = 1 𝜊E = 1 1 + 𝑓zuK{ = 𝑞 𝜖𝜊F 𝜖𝜊E = − 1 𝜊E 𝜖𝜊F 𝜖𝜊E = − 1 1 + 𝑓zuK{ 0 𝜊‚ = 1 − 𝜊F = 1 − 1 1 + 𝑓zuK{ 𝜖𝜊‚ 𝜖𝜊F = −1

t•ƒ t•„ =-1

𝜊… = log 𝜊‚ = log(1 − 𝑞) = log 1 1 + 𝑓zuK{ 𝜖𝜊… 𝜖𝜊‚ = 1 𝜊‚ 𝜖𝜊… 𝜖𝜊‚ = 1 + 𝑓zuK{ 𝑓zuK{ ℒ8

} = (1 − 𝑧)𝜊…

𝜖ℒ 𝜖𝜊… = 1 − 𝑧 𝜖ℒ 𝜖𝜊… = 1 − 𝑧 𝜖ℒ8

}

𝜖𝑋 = 𝜖ℒ8

}

𝜖𝜊… 𝜖𝜊… 𝜖𝜊‚ 𝜖𝜊‚ 𝜖𝜊F 𝜖𝜊F 𝜖𝜊E 𝜖𝜊E 𝜖𝜊0 𝜖𝜊0 𝜖𝜊. 𝜖𝜊. 𝜖𝑋 𝜖ℒ8

}

𝜖𝑋 = (1 − 𝑧)𝑌 1 1 + 𝑓zuK{

ℒ8

} = −(1 − 𝑧8) log[1 −

1 1 + 𝑓zuK{]

slide-33
SLIDE 33

CS109A, PROTOPAPAS, RADER

Learning Rate

33

slide-34
SLIDE 34

CS109A, PROTOPAPAS, RADER

Learning Rate

Trial and Error. There are many alternative methods which address how to set

  • r adjust the learning rate, using the derivative or second

derivatives and or the momentum. To be discussed in the next lectures on NN.

34

  • J. Nocedal y S. Wright, “Numerical optimization”, Springer, 1999 🔘

∗ TLDR: J. Bullinaria, “Learning with Momentum, Conjugate Gradient Learning”, 2015 🔘

slide-35
SLIDE 35

CS109A, PROTOPAPAS, RADER

Local and Global minima

35

slide-36
SLIDE 36

CS109A, PROTOPAPAS, RADER

Local vs Global Minima

36

L 𝛊

slide-37
SLIDE 37

CS109A, PROTOPAPAS, RADER

Local vs Global Minima

37

L 𝛊

slide-38
SLIDE 38

CS109A, PROTOPAPAS, RADER

Local vs Global Minima

No guarantee that we get the global minimum. Question: What would be a good strategy?

38

slide-39
SLIDE 39

CS109A, PROTOPAPAS, RADER

Large data

39

slide-40
SLIDE 40

CS109A, PROTOPAPAS, RADER

Batch and Stochastic Gradient Descent

Instead of using all the examples for every step, use a subset

  • f them (batch).

For each iteration k, use the following loss function to derive the derivatives: which is an approximation to the full Loss function.

40

ℒ = − 6[𝑧8 log 𝑞8 + 1 − 𝑧8 log(1 − 𝑞8)]

  • 8

ℒ‡ = − 6[𝑧8 log 𝑞8 + 1 − 𝑧8 log(1 − 𝑞8)]

  • 8∈‰Š
slide-41
SLIDE 41

CS109A, PROTOPAPAS, RADER

Batch and Stochastic Gradient Descent

41

L 𝛊 Full Likelihood: Batch Likelihood:

slide-42
SLIDE 42

CS109A, PROTOPAPAS, RADER

Batch and Stochastic Gradient Descent

42

L 𝛊 Full Likelihood: Batch Likelihood:

slide-43
SLIDE 43

CS109A, PROTOPAPAS, RADER

Batch and Stochastic Gradient Descent

43

L 𝛊 Full Likelihood: Batch Likelihood:

slide-44
SLIDE 44

CS109A, PROTOPAPAS, RADER

Batch and Stochastic Gradient Descent

44

L 𝛊 Full Likelihood: Batch Likelihood:

slide-45
SLIDE 45

CS109A, PROTOPAPAS, RADER

Batch and Stochastic Gradient Descent

45

L 𝛊 Full Likelihood: Batch Likelihood:

slide-46
SLIDE 46

CS109A, PROTOPAPAS, RADER

Batch and Stochastic Gradient Descent

46

L 𝛊 Full Likelihood: Batch Likelihood:

slide-47
SLIDE 47

CS109A, PROTOPAPAS, RADER

Batch and Stochastic Gradient Descent

47

L 𝛊 Full Likelihood: Batch Likelihood:

slide-48
SLIDE 48

CS109A, PROTOPAPAS, RADER

Batch and Stochastic Gradient Descent

48

L 𝛊 Full Likelihood: Batch Likelihood:

slide-49
SLIDE 49

CS109A, PROTOPAPAS, RADER

Batch and Stochastic Gradient Descent

49

L 𝛊 Full Likelihood: Batch Likelihood:

slide-50
SLIDE 50

CS109A, PROTOPAPAS, RADER

Artificial Neural Networks (ANN)

50

slide-51
SLIDE 51

CS109A, PROTOPAPAS, RADER

Logistic Regression Revisited

51

𝑦8 Affine ℎ8 = 𝛾( + 𝛾.𝑦8 Activation 𝑞8 = 1 1 + 𝑓z‹w

ℒ8 𝛾 = −𝑧8 ln 𝑞8 − 1 − 𝑧8 ln (1 − 𝑞8)

Loss Fun 𝑦‡

ℒ‡ 𝛾 = −𝑧‡ ln 𝑞‡ − 1 − 𝑧‡ ln (1 − 𝑞‡)

Affine ℎ‡ = 𝛾( + 𝛾.𝑦‡ Activation 𝑞8 = 1 1 + 𝑓z‹Š Loss Fun

ℒ(𝛾) = 6 ℒ8 𝛾

g 8

Affine 𝑌 ℎ = 𝛾( + 𝛾.𝑌 Activation 𝑞 = 1 1 + 𝑓z‹ Loss Fun

slide-52
SLIDE 52

CS109A, PROTOPAPAS, RADER

Build our first ANN

52

ℒ(𝛾) = 6 ℒ8 𝛾

g 8

Affine 𝑌 ℎ = 𝛾( + 𝛾.𝑌 Activation 𝑞 = 1 1 + 𝑓z‹ Loss Fun

ℒ(𝛾) = 6 ℒ8 𝛾

g 8

Affine 𝑌 ℎ = 𝛾N𝑌 Activation 𝑞 = 1 1 + 𝑓z‹ Loss Fun

ℒ(𝑋) = 6 ℒ8 𝑋

g 8

Affine 𝑌 ℎ = 𝑋N𝑌 Activation 𝑧 = 1 1 + 𝑓z‹ Loss Fun

slide-53
SLIDE 53

CS109A, PROTOPAPAS, RADER

Build our first ANN

53

ℒ(𝑋) = 6 ℒ8 𝑋

g 8

Affine 𝑌 ℎ = 𝑋N𝑌 Activation 𝑞 = 1 1 + 𝑓z‹ Loss Fun 𝑌 𝑍

slide-54
SLIDE 54

CS109A, PROTOPAPAS, RADER

Example Using Heart Data

54

Slightly modified data to illustrate a concept.

slide-55
SLIDE 55

CS109A, PROTOPAPAS, RADER

Example Using Heart Data

55

𝑌 𝑍

slide-56
SLIDE 56

CS109A, PROTOPAPAS, RADER

Example

56

𝑌 𝑍’ 𝑌 𝑍

slide-57
SLIDE 57

CS109A, PROTOPAPAS, RADER

Pavlos game #232

57

𝑍 𝑍′ 𝑌 𝑌 W1 W2 𝑌 W1 W2

ℎ. + ℎ0

ℎ. ℎ0

slide-58
SLIDE 58

CS109A, PROTOPAPAS, RADER

Pavlos game #232

58

𝑌 𝑍 𝑌 𝑍 𝑌 W1

𝑋

E

W1 W2 W2 ℎ. ℎ0 𝑟 = 𝑋

E.ℎ. + 𝑋 E0ℎ0 + 𝑋 E(

slide-59
SLIDE 59

CS109A, PROTOPAPAS, RADER

Pavlos game #232

59

𝑌 𝑍 𝑌 𝑍 𝑌 W1

𝑋

E

W1 W2 W2 ℎ. ℎ0 𝑟 = 𝑋

E.ℎ. + 𝑋 E0ℎ0 + 𝑋 E(

𝑞 = 1 1 + 𝑓z• 𝑀 = −𝑧 ln p − 1 − y ln(1 − p) Need to learn W1, W2 and W3.

slide-60
SLIDE 60

CS109A, PROTOPAPAS, RADER

Backpropagation

60

slide-61
SLIDE 61

CS109A, PROTOPAPAS, RADER

Backpropagation: Logistic Regression Revisited

61

ℒ(𝛾) = 6 ℒ8 𝛾

g 8

Affine 𝑌 ℎ = 𝛾( + 𝛾.𝑌 Activation 𝑞 = 1 1 + 𝑓z‹ Loss Fun

𝜖ℒ 𝜖𝑞

tℒ t“ t“ t‹ tℒ t“ t“ t‹ t‹ t+ 𝜖𝑞 𝜖ℎ = 𝜏(ℎ)(1 − 𝜏 ℎ ) 𝜖ℒ 𝜖𝑞 = −𝑧 1 𝑞 − 1 − 𝑧 1 1 − 𝑞 𝜖ℎ 𝜖𝛾. = 𝑌, 𝑒ℒ 𝑒𝛾( = 1

𝜖ℒ 𝜖𝛾. = −𝑌𝜏 ℎ 1 − 𝜏 ℎ [𝑧 1 𝑞 + 1 − 𝑧 1 1 − 𝑞] 𝜖ℒ 𝜖𝛾( = −𝜏 ℎ 1 − 𝜏 ℎ [ 𝑧 1 𝑞 + 1 − 𝑧 1 1 − 𝑞]

slide-62
SLIDE 62

CS109A, PROTOPAPAS, RADER

Backpropagation

62

  • 1. Derivatives need to be evaluated at some values of X,y and W.
  • 2. But since we have an expression, we can build a function that takes as

input X,y,W and returns the derivatives and then we can use gradient descent to update.

  • 3. This approach works well but it does not generalize. For example if the

network is changed, we need to write a new function to evaluate the derivatives. For example this network will need a different function for the derivatives

𝑌 W1

𝑋

E

W2 𝑍

slide-63
SLIDE 63

CS109A, PROTOPAPAS, RADER

Backpropagation

63

  • 1. Derivatives need to be evaluated at some values of X,y and W.
  • 2. But since we have an expression, we can build a function that takes as

input X,y,W and returns the derivatives and then we can use gradient descent to update.

  • 3. This approach works well but it does not generalize. For example if the

network is changed, we need to write a new function to evaluate the derivatives. For example this network will need a different function for the derivatives

𝑌 W1

𝑋

E

W2 𝑍

𝑋

F

𝑋

slide-64
SLIDE 64

CS109A, PROTOPAPAS, RADER

  • Backpropagation. Pavlos game #456

64

Need to find a formalism to calculate the derivatives of the loss wrt to weights that is:

  • 1. Flexible enough that adding a node or a layer or changing something

in the network won’t require to re-derive the functional form from scratch.

  • 2. It is exact.
  • 3. It is computationally efficient.

Hints:

  • 1. Remember we only need to evaluate the derivatives at 𝑌8, 𝑧8 and 𝑋(‡).
  • 2. We should take advantage of the chain rule we learned before
slide-65
SLIDE 65

CS109A, PROTOPAPAS, RADER

Idea 1: Evaluate the derivative at: X={3}, y=1, W=3

65

Variables derivatives Value of the variable Value of the partial derivative 𝑒𝝄𝒐 𝑒𝑿 𝜊. = −𝑋N𝑌 𝜖𝜊. 𝜖𝑋 = −𝑌 −9

  • 3
  • 3

𝜊0 = 𝑓•- = 𝑓zuK{ 𝜖𝜊0 𝜖𝜊. = 𝑓•- 𝑓z› 𝑓z›

  • 3𝑓z›

𝜊E = 1 + 𝜊0 = 1 + 𝑓zuK{ 𝜖𝜊E 𝜖𝜊0 = 1 1+𝑓z› 1

  • 3𝑓z›

𝜊F = 1 𝜊E = 1 1 + 𝑓zuK{ = 𝑞 𝜖𝜊F 𝜖𝜊E = − 1 𝜊E 1 1 + 𝑓z› 1 1 + 𝑓z›

  • 3𝑓z›

. .GHIœ

𝜊‚ = log 𝜊F = log 𝑞 = log 1 1 + 𝑓zuK{ 𝜖𝜊‚ 𝜖𝜊F = 1 𝜊F log 1 1 + 𝑓z› 1 + 𝑓z›

  • 3𝑓z›

. .GHIœ

ℒ8

| = −𝑧𝜊‚

𝜖ℒ 𝜖𝜊‚ = −𝑧 − log 1 1 + 𝑓z› −1 3𝑓z›

. .GHIœ

𝜖ℒ8

|

𝜖𝑋 = 𝜖ℒ8 𝜖𝜊‚ 𝜖𝜊‚ 𝜖𝜊F 𝜖𝜊F 𝜖𝜊E 𝜖𝜊E 𝜖𝜊0 𝜖𝜊0 𝜖𝜊. 𝜖𝜊. 𝜖𝑋 −3

0.00037018372

slide-66
SLIDE 66

CS109A, PROTOPAPAS, RADER

Basic functions

66

We still need to derive derivatives L

Variables derivatives Value of the variable Value of the partial derivative 𝑒𝝄𝒐 𝑒𝑿 𝜊. = −𝑋N𝑌 𝜖𝜊. 𝜖𝑋 = −𝑌 −9

  • 3
  • 3

𝜊0 = 𝑓•- = 𝑓zuK{ 𝜖𝜊0 𝑒𝜖𝜊. = 𝑓•- 𝑓z› 𝑓z›

  • 3𝑓z›

𝜊E = 1 + 𝜊0 = 1 + 𝑓zuK{ 𝜖𝜊E 𝜖𝜊0 = 1 1+𝑓z› 1

  • 3𝑓z›

𝜊F = 1 𝜊E = 1 1 + 𝑓zuK{ = 𝑞 𝜖𝜊F 𝜖𝜊E = − 1 𝜊E 1 1 + 𝑓z› 1 1 + 𝑓z›

  • 3𝑓z›

. .GHIœ

𝜊‚ = log 𝜊F = log 𝑞 = log 1 1 + 𝑓zuK{ 𝜖𝜊‚ 𝜖𝜊F = 1 𝜊F log 1 1 + 𝑓z› 1 + 𝑓z›

  • 3𝑓z›

. .GHIœ

ℒ8

| = −𝑧𝜊‚

𝜖ℒ 𝜖𝜊‚ = −𝑧 − log 1 1 + 𝑓z› −1 3𝑓z›

. .GHIœ

𝜖ℒ8

|

𝜖𝑋 = 𝜖ℒ8 𝜖𝜊‚ 𝜖𝜊‚ 𝜖𝜊F 𝜖𝜊F 𝜖𝜊E 𝜖𝜊E 𝜖𝜊0 𝜖𝜊0 𝜖𝜊. 𝜖𝜊. 𝜖𝑋 −3

0.00037018372

slide-67
SLIDE 67

CS109A, PROTOPAPAS, RADER

Basic functions

67

Notice though those are basic functions that my grandparent can do

𝜊( = 𝑌 𝜖𝜊( 𝜖𝑌 = 1

def x0(x): return X

def derx0(): return 1

𝜊. = −𝑋N𝜊( 𝜖𝜊. 𝜖𝑋 = −𝑌

def x1(a,x): return –a*X

def derx1(a,x): return -a

𝜊0 = e•- 𝜖𝜊0 𝜖𝜊. = 𝑓•-

def x2(x): return np.exp(x)

def derx2(x): return np.exp(x)

𝜊E = 1 + 𝜊0 𝜖𝜊E 𝜖𝜊0 = 1

def x3(x): return 1+x

def derx3(x): return 1

𝜊F = 1 𝜊E 𝜖𝜊F 𝜖𝜊E = − 1 𝜊E

def der1(x): return 1/(x)

def derx4(x): return -(1/x)**(2)

𝜊‚ = log 𝜊F 𝜖𝜊‚ 𝜖𝜊F = 1 𝜊F

def der1(x): return np.log(x)

def derx5(x) return 1/x

ℒ8

| = −𝑧𝜊‚

𝜖ℒ 𝜖𝜊‚ = −𝑧

def der1(y,x): return –y*x

def derL(y): return -y

slide-68
SLIDE 68

CS109A, PROTOPAPAS, RADER

Putting it altogether

  • 1. We specify the network structure

68

𝑌 W1

𝑋

E

W2 𝑍

𝑋

F

𝑋

  • 2. We create the computational graph …

What is computational graph?

slide-69
SLIDE 69

CS109A, PROTOPAPAS, RADER

69

X W 𝜊( = 𝑋 × 𝜊. = 𝑋N𝑌 𝜊.

Ÿ=X

𝑓𝑦𝑞 𝜊0 = 𝑓z•- 𝜊0

Ÿ = −𝑓z•-

+

𝜊E = 1 + 𝑓zuK{ ÷ 𝜊F = 1 1 + 𝑓zuK{

Log

𝜊‚ = log 1 1 + 𝑓zuK{ 1

  • 𝜊… = 1 −

1 1 + 𝑓zuK{

log

𝜊¡ = log(1 − 1 1 + 𝑓zuK{) 1-y × 𝜊¢ = 1 − y log(1 − 1 1 + 𝑓zuK{) y × 𝜊› = ylog( 1 1 + 𝑓zuK{)

+

−ℒ = 𝜊› = ylog( 1 1 + 𝑓zuK{) + 1 − y log(1 − 1 1 + 𝑓zuK{) −

Computational Graph

slide-70
SLIDE 70

CS109A, PROTOPAPAS, RADER

Putting it altogether

  • 1. We specify the network structure

70

𝑌 W1

𝑋

E

W2 𝑍

𝑋

F

𝑋

  • We create the computational graph.
  • At each node of the graph we build two functions: the evaluation of

the variable and its partial derivative with respect to the previous variable (as shown in the table 3 slides back)

  • Now we can either go forward or backward depending on the situation.

In general, forward is easier to implement and to understand. The difference is clearer when there are multiple nodes per layer.

slide-71
SLIDE 71

CS109A, PROTOPAPAS, RADER

Forward mode: Evaluate the derivative at: X={3}, y=1, W=3

71

Variables derivatives Value of the variable Value of the partial derivative 𝑒ℒ 𝑒𝝄𝒐 𝜊. = −𝑋N𝑌 𝜖𝜊. 𝜖𝑋 = −𝑌 −9

  • 3
  • 3

𝜊0 = 𝑓•- = 𝑓zuK{ 𝜖𝜊0 𝜖𝜊. = 𝑓•- 𝑓z› 𝑓z›

  • 3𝑓z›

𝜊E = 1 + 𝜊0 = 1 + 𝑓zuK{ 𝜖𝜊E 𝜖𝜊0 = 1 1+𝑓z› 1

  • 3𝑓z›

𝜊F = 1 𝜊E = 1 1 + 𝑓zuK{ = 𝑞 𝜖𝜊F 𝜖𝜊E = − 1 𝜊E 1 1 + 𝑓z› 1 1 + 𝑓z›

  • 3𝑓z›

. .GHIœ

𝜊‚ = log 𝜊F = log 𝑞 = log 1 1 + 𝑓zuK{ 𝜖𝜊‚ 𝜖𝜊F = 1 𝜊F log 1 1 + 𝑓z› 1 + 𝑓z›

  • 3𝑓z›

. .GHIœ

ℒ8

| = −𝑧𝜊‚

𝜖ℒ 𝜖𝜊‚ = −𝑧 − log 1 1 + 𝑓z› −1 3𝑓z›

. .GHIœ

𝜖ℒ8

|

𝜖𝑋 = 𝜖ℒ8 𝜖𝜊‚ 𝜖𝜊‚ 𝜖𝜊F 𝜖𝜊F 𝜖𝜊E 𝜖𝜊E 𝜖𝜊0 𝜖𝜊0 𝜖𝜊. 𝜖𝜊. 𝜖𝑋 −3

0.00037018372

slide-72
SLIDE 72

CS109A, PROTOPAPAS, RADER

Backward mode: Evaluate the derivative at: X={3}, y=1, W=3

72

Variables derivatives Value of the variable Value of the partial derivative 𝜊. = −𝑋N𝑌 𝜖𝜊. 𝜖𝑋 = −𝑌 −9

  • 3

𝜊0 = 𝑓•- = 𝑓zuK{ 𝜖𝜊0 𝜖𝜊. = 𝑓•- 𝑓z› 𝑓z› 𝜊E = 1 + 𝜊0 = 1 + 𝑓zuK{ 𝜖𝜊E 𝜖𝜊0 = 1 1+𝑓z› 1 𝜊F = 1 𝜊E = 1 1 + 𝑓zuK{ = 𝑞 𝜖𝜊F 𝜖𝜊E = − 1 𝜊E 1 1 + 𝑓z› 1 1 + 𝑓z› 𝜊‚ = log 𝜊F = log 𝑞 = log 1 1 + 𝑓zuK{ 𝜖𝜊‚ 𝜖𝜊F = 1 𝜊F log 1 1 + 𝑓z› 1 + 𝑓z› ℒ8

| = −𝑧𝜊‚

𝜖ℒ 𝜖𝜊‚ = −𝑧 − log 1 1 + 𝑓z› −1 𝜖ℒ8

|

𝜖𝑋 = 𝜖ℒ8 𝜖𝜊‚ 𝜖𝜊‚ 𝜖𝜊F 𝜖𝜊F 𝜖𝜊E 𝜖𝜊E 𝜖𝜊0 𝜖𝜊0 𝜖𝜊. 𝜖𝜊. 𝜖𝑋 Type equation here.

Store all these values