Neural Networks: Backpropagation Machine Learning Based on slides - - PowerPoint PPT Presentation

neural networks backpropagation
SMART_READER_LITE
LIVE PREVIEW

Neural Networks: Backpropagation Machine Learning Based on slides - - PowerPoint PPT Presentation

Neural Networks: Backpropagation Machine Learning Based on slides and material from Geoffrey Hinton, Richard Socher, Dan Roth, 1 Yoav Goldberg, Shai Shalev-Shwartz and Shai Ben-David, and others This lecture What is a neural network?


slide-1
SLIDE 1

Machine Learning

Neural Networks: Backpropagation

1

Based on slides and material from Geoffrey Hinton, Richard Socher, Dan Roth, Yoav Goldberg, Shai Shalev-Shwartz and Shai Ben-David, and others

slide-2
SLIDE 2

This lecture

  • What is a neural network?
  • Predicting with a neural network
  • Training neural networks

– Backpropagation

  • Practical concerns

3

slide-3
SLIDE 3

Training a neural network

  • Given

– A network architecture (layout of neurons, their connectivity and activations) – A dataset of labeled examples

  • S = {(xi, yi)}
  • The goal: Learn the weights of the neural network
  • Remember: For a fixed architecture, a neural network is a

function parameterized by its weights

– Prediction: ! = $$(&, ()

4

slide-4
SLIDE 4

Recall: Learning as loss minimization

We have a classifier NN that is completely defined by its weights Learn the weights by minimizing a loss *

6

Perhaps with a regularizer

min

( . *($$ &/, ( , !/)

  • /

So far, we saw that this strategy worked for:

  • 1. Logistic Regression
  • 2. Support Vector Machines
  • 3. Perceptron
  • 4. LMS regression

All of these are linear models Same idea for non-linear models too!

Each minimizes a different loss function

slide-5
SLIDE 5

Back to our running example

7

  • utput

Given an input x, how is the output predicted

y = 234

5 + 244 5 74 + 284 5 78

78 = 9(238

: + 248 : ;4 + 288 : ;8)

z4 = 9(234

: + 244 : ;4 + 284 : ;8)

slide-6
SLIDE 6

Back to our running example

9

  • utput

Given an input x, how is the output predicted

Suppose the true label for this example is a number !/ We can write the square loss for this example as:

* = 1 2 !– !/ 8

y = 234

5 + 244 5 74 + 284 5 78

78 = 9(238

: + 248 : ;4 + 288 : ;8)

z4 = 9(234

: + 244 : ;4 + 284 : ;8)

slide-7
SLIDE 7

Learning as loss minimization

We have a classifier NN that is completely defined by its weights Learn the weights by minimizing a loss *

10

Perhaps with a regularizer

min

( . *($$ ;/, 2 , !/)

  • /

How do we solve the

  • ptimization problem?
slide-8
SLIDE 8

Stochastic gradient descent

Given a training set S = {(xi, yi)}, x 2 <d

  • 1. Initialize parameters w
  • 2. For epoch = 1 … T:

1. Shuffle the training set 2. For each training example (xi, yi)2 S:

  • Treat this example as the entire dataset

Compute the gradient of the loss A*($$ &/, ( , !/)

  • Update: ( ← ( − DEA*($$ &/, ( , !/))
  • 3. Return w

11

min

( . *($$ ;/, 2 , !/)

  • /
slide-9
SLIDE 9

Stochastic gradient descent

Given a training set S = {(xi, yi)}, x 2 <d

  • 1. Initialize parameters w
  • 2. For epoch = 1 … T:

1. Shuffle the training set 2. For each training example (xi, yi)2 S:

  • Treat this example as the entire dataset

Compute the gradient of the loss A*($$ &/, ( , !/)

  • Update: ( ← ( − DEA*($$ &/, ( , !/))
  • 3. Return w

12

min

( . *($$ ;/, 2 , !/)

  • /
slide-10
SLIDE 10

Stochastic gradient descent

Given a training set S = {(xi, yi)}, x 2 <d

  • 1. Initialize parameters w
  • 2. For epoch = 1 … T:

1. Shuffle the training set 2. For each training example (xi, yi)2 S:

  • Treat this example as the entire dataset

Compute the gradient of the loss A*($$ &/, ( , !/)

  • Update: ( ← ( − DEA*($$ &/, ( , !/))
  • 3. Return w

13

min

( . *($$ ;/, 2 , !/)

  • /
slide-11
SLIDE 11

Stochastic gradient descent

Given a training set S = {(xi, yi)}, x 2 <d

  • 1. Initialize parameters w
  • 2. For epoch = 1 … T:

1. Shuffle the training set 2. For each training example (xi, yi)2 S:

  • Treat this example as the entire dataset

Compute the gradient of the loss A*($$ &/, ( , !/)

  • Update: ( ← ( − DEA*($$ &/, ( , !/))
  • 3. Return w

14

min

( . *($$ ;/, 2 , !/)

  • /
slide-12
SLIDE 12

Stochastic gradient descent

Given a training set S = {(xi, yi)}, x 2 <d

  • 1. Initialize parameters w
  • 2. For epoch = 1 … T:

1. Shuffle the training set 2. For each training example (xi, yi)2 S:

  • Treat this example as the entire dataset

Compute the gradient of the loss A*($$ &/, ( , !/)

  • Update: ( ← ( − DEA*($$ &/, ( , !/))
  • 3. Return w

15

min

( . *($$ ;/, 2 , !/)

  • /
slide-13
SLIDE 13

Stochastic gradient descent

Given a training set S = {(xi, yi)}, x 2 <d

  • 1. Initialize parameters w
  • 2. For epoch = 1 … T:

1. Shuffle the training set 2. For each training example (xi, yi)2 S:

  • Treat this example as the entire dataset

Compute the gradient of the loss A*($$ &/, ( , !/)

  • Update: ( ← ( − DEA*($$ &/, ( , !/))
  • 3. Return w

17

°t: learning rate, many tweaks possible The objective is not convex. Initialization can be important

min

( . *($$ ;/, 2 , !/)

  • /
slide-14
SLIDE 14

Stochastic gradient descent

Given a training set S = {(xi, yi)}, x 2 <d

  • 1. Initialize parameters w
  • 2. For epoch = 1 … T:

1. Shuffle the training set 2. For each training example (xi, yi)2 S:

  • Treat this example as the entire dataset

Compute the gradient of the loss A*($$ &/, ( , !/)

  • Update: ( ← ( − DEA*($$ &/, ( , !/))
  • 3. Return w

18

°t: learning rate, many tweaks possible The objective is not convex. Initialization can be important

min

( . *($$ ;/, 2 , !/)

  • /

Have we solved everything?

slide-15
SLIDE 15

The derivative of the loss function?

If the neural network is a differentiable function, we can find the gradient

– Or maybe its sub-gradient – This is decided by the activation functions and the loss function

It was easy for SVMs and logistic regression

– Only one layer

But how do we find the sub-gradient of a more complex function?

– Eg: A recent paper used a ~150 layer neural network for image classification!

19

We need an efficient algorithm: Backpropagation

A*($$ &/, ( , !/)

slide-16
SLIDE 16

The derivative of the loss function?

If the neural network is a differentiable function, we can find the gradient

– Or maybe its sub-gradient – This is decided by the activation functions and the loss function

It was easy for SVMs and logistic regression

– Only one layer

But how do we find the sub-gradient of a more complex function?

– Eg: A recent paper used a ~150 layer neural network for image classification!

20

We need an efficient algorithm: Backpropagation

A*($$ &/, ( , !/)

slide-17
SLIDE 17

The derivative of the loss function?

If the neural network is a differentiable function, we can find the gradient

– Or maybe its sub-gradient – This is decided by the activation functions and the loss function

It was easy for SVMs and logistic regression

– Only one layer

But how do we find the sub-gradient of a more complex function?

– Eg: A recent paper used a ~150 layer neural network for image classification!

21

We need an efficient algorithm: Backpropagation

A*($$ &/, ( , !/)

slide-18
SLIDE 18

Checkpoint

22

Where are we

If we have a neural network (structure, activations and weights), we can make a prediction for an input If we had the true label of the input, then we can define the loss for that example If we can take the derivative of the loss with respect to each of the weights, we can take a gradient step in SGD

Questions?

slide-19
SLIDE 19

Checkpoint

23

Where are we

If we have a neural network (structure, activations and weights), we can make a prediction for an input If we had the true label of the input, then we can define the loss for that example If we can take the derivative of the loss with respect to each of the weights, we can take a gradient step in SGD

Questions?

slide-20
SLIDE 20

Checkpoint

24

Where are we

If we have a neural network (structure, activations and weights), we can make a prediction for an input If we had the true label of the input, then we can define the loss for that example If we can take the derivative of the loss with respect to each of the weights, we can take a gradient step in SGD

Questions?

slide-21
SLIDE 21

Checkpoint

25

Where are we

If we have a neural network (structure, activations and weights), we can make a prediction for an input If we had the true label of the input, then we can define the loss for that example If we can take the derivative of the loss with respect to each of the weights, we can take a gradient step in SGD

Questions?

slide-22
SLIDE 22

Reminder: Chain rule for derivatives

– If 7 is a function of ! and ! is a function of ;

  • Then 7 is a function of ;, as well

– Question: how to find FG

FH

27

Slide courtesy Richard Socher

slide-23
SLIDE 23

Reminder: Chain rule for derivatives

– If 7 = a function of !4 + a function of !8, and the !/’s are functions of ;

  • Then 7 is a function of ;, as well

– Question: how to find FG

FH

28

Slide courtesy Richard Socher

slide-24
SLIDE 24

Reminder: Chain rule for derivatives

– If 7 is a sum of functions of !/’s, and the !/’s are functions

  • f ;
  • Then 7 is a function of ;, as well

– Question: how to find FG

FH

29

Slide courtesy Richard Socher

slide-25
SLIDE 25

Backpropagation

30

  • utput

* = 1 2 !– !∗ 8 y = 234

5 + 244 5 74 + 284 5 78

78 = 9(238

: + 248 : ;4 + 288 : ;8)

z4 = 9(234

: + 244 : ;4 + 284 : ;8)

slide-26
SLIDE 26

Backpropagation

31

We want to compute

FI FJKL

M and

FI FJKL

N

  • utput

* = 1 2 !– !∗ 8 y = 234

5 + 244 5 74 + 284 5 78

78 = 9(238

: + 248 : ;4 + 288 : ;8)

z4 = 9(234

: + 244 : ;4 + 284 : ;8)

slide-27
SLIDE 27

Backpropagation

32

Applying the chain rule to compute the gradient (And remembering partial computations along the way to speed up things)

We want to compute

FI FJKL

M and

FI FJKL

N

  • utput

* = 1 2 !– !∗ 8 y = 234

5 + 244 5 74 + 284 5 78

78 = 9(238

: + 248 : ;4 + 288 : ;8)

z4 = 9(234

: + 244 : ;4 + 284 : ;8)

slide-28
SLIDE 28

Output layer

33

  • utput

* = 1 2 !– !∗ 8 y = 234

5 + 244 5 74 + 284 5 78

O* O234

5 = O*

O! O! O234

3 Backpropagation example

slide-29
SLIDE 29

Output layer

34

  • utput

* = 1 2 !– !∗ 8 y = 234

5 + 244 5 74 + 284 5 78

O* O234

5 = O*

O! O! O234

5 Backpropagation example

slide-30
SLIDE 30

Output layer

35

  • utput

* = 1 2 !– !∗ 8 y = 234

5 + 244 5 74 + 284 5 78

O* O234

5 = O*

O! O! O234

5

O* O! = ! − !∗

Backpropagation example

slide-31
SLIDE 31

Output layer

36

  • utput

* = 1 2 !– !∗ 8 y = 234

5 + 244 5 74 + 284 5 78

O* O234

5 = O*

O! O! O234

5

O* O! = ! − !∗ O! O234

5 = 1 Backpropagation example

slide-32
SLIDE 32

Output layer

37

O* O244

5 = O*

O! O! O244

3

  • utput

* = 1 2 !– !∗ 8 y = 234

5 + 244 5 74 + 284 5 78

Backpropagation example

slide-33
SLIDE 33

Output layer

38

O* O244

5 = O*

O! O! O244

5

  • utput

* = 1 2 !– !∗ 8 y = 234

5 + 244 5 74 + 284 5 78

Backpropagation example

slide-34
SLIDE 34

Output layer

39

O* O244

5 = O*

O! O! O244

5

O* O! = ! − !∗

  • utput

* = 1 2 !– !∗ 8 y = 234

5 + 244 5 74 + 284 5 78

Backpropagation example

slide-35
SLIDE 35

Output layer

40

O* O244

5 = O*

O! O! O244

5

O* O! = ! − !∗ O! O234

5 = 74

  • utput

* = 1 2 !– !∗ 8 y = 234

5 + 244 5 74 + 284 5 78

Backpropagation example

slide-36
SLIDE 36

Output layer

41

O* O244

5 = O*

O! O! O244

5

O* O! = ! − !∗ O! O234

5 = 74

  • utput

* = 1 2 !– !∗ 8 y = 234

5 + 244 5 74 + 284 5 78

We have already computed this partial derivative for the previous case Cache to speed up! Backpropagation example

slide-37
SLIDE 37

Hidden layer derivatives

42

  • utput

* = 1 2 !– !∗ 8 y = 234

5 + 244 5 74 + 284 5 78

78 = 9(238

: + 248 : ;4 + 288 : ;8)

z4 = 9(234

: + 244 : ;4 + 284 : ;8) Backpropagation example

slide-38
SLIDE 38

Hidden layer derivatives

43

We want

FI FJPP

N

  • utput

* = 1 2 !– !∗ 8 y = 234

5 + 244 5 74 + 284 5 78

78 = 9(238

: + 248 : ;4 + 288 : ;8)

z4 = 9(234

: + 244 : ;4 + 284 : ;8) Backpropagation example

slide-39
SLIDE 39

Hidden layer derivatives

44

O* O288

: = O*

O! O! O288

: Backpropagation example

* = 1 2 !– !∗ 8

slide-40
SLIDE 40

Hidden layer

45

O* O288

: = O*

O! O! O288

:

= O* O! O O288

: (234 5 + 244 5 74 + 284 5 78)

y = 234

5 + 244 5 74 + 284 5 78 Backpropagation example

slide-41
SLIDE 41

Hidden layer

46

O* O288

: = O*

O! O! O288

:

= O* O! O O288

: (234 5 + 244 5 74 + 284 5 78)

y = 234

5 + 244 5 74 + 284 5 78

= O* O! (244

5

O O288

: 74 + 284 5

O O288

: 78) Backpropagation example

slide-42
SLIDE 42

Hidden layer

47

O* O288

: = O*

O! O! O288

:

= O* O! O O288

: (234 5 + 244 5 74 + 284 5 78)

y = 234

5 + 244 5 74 + 284 5 78

= O* O! (244

5

O O288

: 74 + 284 5

O O288

: 78)

74 is not a function of 288

:

Backpropagation example

slide-43
SLIDE 43

Hidden layer

48

O* O288

: = O*

O! O! O288

:

= O* O! O O288

: (234 5 + 244 5 74 + 284 5 78)

y = 234

5 + 244 5 74 + 284 5 78

= O* O! 284

5

O78 O288

: Backpropagation example

slide-44
SLIDE 44

Hidden layer

49

O* O288

: = O*

O! O! O288

:

= O* O! O O288

: (234 5 + 244 5 74 + 284 5 78)

= O* O! 284

5

O78 O288

:

78 = 9(238

: + 248 : ;4 + 288 : ;8) Backpropagation example

slide-45
SLIDE 45

Hidden layer

50

O* O288

: = O*

O! O! O288

:

= O* O! O O288

: (234 5 + 244 5 74 + 284 5 78)

= O* O! 284

5

O78 O288

:

78 = 9(238

: + 248 : ;4 + 288 : ;8) Call this s Backpropagation example

slide-46
SLIDE 46

Hidden layer

51

O* O288

: = O*

O! O! O288

:

= O* O! O O288

: (234 5 + 244 5 74 + 284 5 78)

= O* O! 284

5

O78 O288

:

78 = 9(238

: + 248 : ;4 + 288 : ;8) Call this s

= O* O! 284

5 O78

OQ OQ O288

: Backpropagation example

slide-47
SLIDE 47

Hidden layer

52

O* O288

: = O*

O! 284

5 O78

OQ OQ O288

:

78 = 9(238

: + 248 : ;4 + 288 : ;8) Call this s Backpropagation example (From previous slide)

slide-48
SLIDE 48

Hidden layer

53

O* O288

: = O*

O! 284

5 O78

OQ OQ O288

:

78 = 9(238

: + 248 : ;4 + 288 : ;8) Call this s Each of these partial derivatives is easy Backpropagation example

slide-49
SLIDE 49

Hidden layer

54

O* O288

: = O*

O! 284

5 O78

OQ OQ O288

:

78 = 9(238

: + 248 : ;4 + 288 : ;8) Call this s

O* O! = ! − !∗

Each of these partial derivatives is easy Backpropagation example

slide-50
SLIDE 50

Hidden layer

55

O* O288

: = O*

O! 284

5 O78

OQ OQ O288

:

78 = 9(238

: + 248 : ;4 + 288 : ;8) Call this s

O* O! = ! − !∗ O78 OQ = 78(1 − 78)

Why? Because 78 Q is the logistic function we have already seen Each of these partial derivatives is easy Backpropagation example

slide-51
SLIDE 51

Hidden layer

56

O* O288

: = O*

O! 284

5 O78

OQ OQ O288

:

78 = 9(238

: + 248 : ;4 + 288 : ;8) Call this s

O* O! = ! − !∗ O78 OQ = 78(1 − 78)

Why? Because 78 Q is the logistic function we have already seen

OQ O288

: = ;8 Each of these partial derivatives is easy Backpropagation example

slide-52
SLIDE 52

Hidden layer

57

O* O288

: = O*

O! 284

5 O78

OQ OQ O288

:

78 = 9(238

: + 248 : ;4 + 288 : ;8) Call this s

O* O! = ! − !∗ O78 OQ = 78(1 − 78)

Why? Because 78 Q is the logistic function we have already seen

OQ O288

: = ;8

More important: We have already computed many of these partial derivatives because we are proceeding from top to bottom (i.e. backwards)

Backpropagation example Each of these partial derivatives is easy

slide-53
SLIDE 53

The Backpropagation Algorithm

The same algorithm works for multiple layers Repeated application of the chain rule for partial derivatives

– First perform forward pass from inputs to the output – Compute loss – From the loss, proceed backwards to compute partial derivatives using the chain rule – Cache partial derivatives as you compute them

  • Will be used for lower layers

58

slide-54
SLIDE 54

Mechanizing learning

  • Backpropagation gives you the gradient that will be used for

gradient descent

– SGD gives us a generic learning algorithm – Backpropagation is a generic method for computing partial derivatives

  • A recursive algorithm that proceeds from the top of the

network to the bottom

  • Modern neural network libraries implement automatic

differentiation using backpropagation

– Allows easy exploration of network architectures – Don’t have to keep deriving the gradients by hand each time

59

slide-55
SLIDE 55

Stochastic gradient descent

Given a training set S = {(xi, yi)}, x 2 <d 1. Initialize parameters w 2. For epoch = 1 … T:

1. Shuffle the training set 2. For each training example (xi, yi)2 S:

  • Treat this example as the entire dataset
  • Compute the gradient of the loss A*($$ &/, ( , !/) using

backpropagation

  • Update: ( ← ( − DEA*($$ &/, ( , !/))

3. Return w

60

°t: learning rate, many tweaks possible The objective is not convex. Initialization can be important

min

( . *($$ ;/, 2 , !/)

  • /