Multi-Layer Networks & Back-Propagation M. Soleymani Deep - - PowerPoint PPT Presentation

multi layer networks back propagation
SMART_READER_LITE
LIVE PREVIEW

Multi-Layer Networks & Back-Propagation M. Soleymani Deep - - PowerPoint PPT Presentation

Multi-Layer Networks & Back-Propagation M. Soleymani Deep Learning Sharif University of Technology Spring 2019 Most slides have been adapted from Bhiksha Raj, 11-785, CMU 2019 and some from Fei Fei Li et. al, cs231n, Stanford 2017 These


slide-1
SLIDE 1

Multi-Layer Networks & Back-Propagation

  • M. Soleymani

Deep Learning Sharif University of Technology Spring 2019 Most slides have been adapted from Bhiksha Raj, 11-785, CMU 2019 and some from Fei Fei Li et. al, cs231n, Stanford 2017

slide-2
SLIDE 2

These boxes are functions

2

N.Net V

  • ice

signal Transcription N.Net Image Textcaption N.Net Game State Next move

  • Take an input
  • Produce an output
  • Can be modeled by a neuralnetwork!
slide-3
SLIDE 3

Questions

3

N.Net Something

  • dd

Something weird

  • Preliminaries:

– How do we represent theinput? – How do we represent theoutput?

  • How do we compose the network that performs the requisitefunction?
slide-4
SLIDE 4

Questions

4

  • Preliminaries:

– How do we represent theinput? – How do we represent theoutput?

  • How do we compose the network that performs the requisite function?

N.Net Something

  • dd

Something weird

slide-5
SLIDE 5

Preliminaries : The units in the networks

5

+

. . .

.

1 2 3 1 2 3 d d i i i

  • Units or neurons

– General setting, inputs are realvalued – A bias 𝑐 representing a threshold to trigger the perceptron – Activation functions are notnecessarily threshold functions

slide-6
SLIDE 6

Preliminaries : Redrawing the neuron

6

+

. .

.

1 2 3 1 2 3 d-1 d - 1 d d +1 i i i

  • The bias can also be viewed as the weight of another input

component that is always set to1

d

slide-7
SLIDE 7

Learning problem

  • Given: the architecture of thenetwork
  • Training data: A set of input-output pairs

𝒚($), 𝒛($) , 𝒚()), 𝒛()) , … , (𝒚(+), 𝒛(+))

  • We want to find the function 𝑔 on the input space to get the output

– We consider a neural network as a parametric function 𝑔(𝒚; 𝑿)

7

slide-8
SLIDE 8

What is f()? Typicalnetwork

8

Input units Output units Hidden units

  • We assume a“layered” network for simplicity
  • Generic terminology

– We will refer to the inputs as the input units

– No neurons here – the “input units” are just the inputs

– We refer to the outputs as the output units – Intermediate units are “hidden”units

slide-9
SLIDE 9

First : the structure of the network

9

  • We will assume a feed-forwardnetwork

– No loops: Neuron outputs do not feed back to their inputs directly or indirectly – Loopy networks are a future topic

  • Part of the design of a network: Thearchitecture

– How many layers/neurons, which neuron connects to which andhow, etc.

  • For now, assume the architecture of the network is capable of

representing the needed function

slide-10
SLIDE 10

What we learn : The parameter of the network

10

  • Given: the architecture of thenetwork
  • The parameters of the network: The weights and biases

– The weights associated with the blue arrows in the picture

  • Learning the network: Determining the values of these parameters

such that the network computes the desired function

slide-11
SLIDE 11

Problem setup

  • Given: the architecture of thenetwork
  • Training data: A set of input-output pairs

𝒚($), 𝒛($) , 𝒚()), 𝒛()) , … , (𝒚(+), 𝒛(+))

  • We want to find the function 𝑔

– We consider a neural network as a parametric function 𝑔(𝒚; 𝑿)

  • We need a loss function to show how penalizes the obtained output

𝑔(𝒚; 𝑿) when the desired output is 𝒛 1 𝑂 1 𝑚𝑝𝑡𝑡 𝑔 𝒚(5); 𝑿 , 𝒛(5)

+ 56$

11

slide-12
SLIDE 12

Training an MLP

  • We define differentiable loss or divergence between the output of the

network and the desired output for the training instances

– And a total error, which is the average divergence over all training instances

  • We optimize network parameters to minimize this error

12

slide-13
SLIDE 13

Training an MLP: Activation function

  • Learning networks of threshold-activation neurons requires solving a

hard combinatorial-optimization problem

– Because we cannot compute the influence of small changes to the parameters

  • n the overall error
  • Instead we use continuous activation functions to enables us to

estimate network parameters

– This makes the output of the network differentiable w.r.t every parameter in the network – The logistic activation neuron actually computes the a posteriori probability of the output given the input

13

slide-14
SLIDE 14

Activation function

14

  • With threshold, the neuron’s output is aflat function with zero derivative

everywhere, except at 0 where itis non-differentiable

– Youcan vary the weights a lot without changing the error – There is no indication of which direction to change the weights to reduce error

slide-15
SLIDE 15

Activation function

15

+

. . .

1 2 3 1 2 3 N-1 N-1 N N N +1

  • Makes the neuron differentiable, with non-zero derivativesover much of

the inputspace

– Small changes in weight can result in non-negligible changes in output – This enables us to estimate the parameters using gradient descent techniques..

slide-16
SLIDE 16

Vector notation

16

Given a training of input-output pairs 𝒚($), 𝒛($) , 𝒚()), 𝒛()) , … , (𝒚(+), 𝒛(+))

  • 𝒚(5) = 𝑦$

5 , 𝑦) 5 , … , 𝑦9 5

is the nth input vector

  • 𝒛(5) = 𝑧

$ 5 , 𝑧) 5 , … , 𝑧; 5

is the nth desired output

  • 𝒑(5) = 𝑝$

5 , 𝑝) 5 , … , 𝑝; 5 is the nth vector of actual outputs of the network

  • We will sometimes drop the superscript when referring to a specific instance

𝑧$ 𝑧; 𝑦$ 𝑦9

slide-17
SLIDE 17

Representing the input

17

Input L ayer Output Layer Hidden Layers

  • Vectors of numbers

– (or may even be just a scalar, if input layer is of size 1) – E.g. vector of pixelvalues – E.g. vector of speech features – E.g. real-valued vector representing text

  • We will see how this happens later in the course

– Other real valued vectors

slide-18
SLIDE 18

Representing the output

18

Input L ayer Output Layer Hidden Layers

  • If the desired output is real-valued, no special tricks are necessary

– Scalar Output : single output neuron

  • o = scalar (real value)

– Vector Output : as many output neurons as the dimension of the desired output

  • 𝒑 = [𝑝$, 𝑝), … , 𝑝>] (vector of real values)
slide-19
SLIDE 19

Examples of lossfunctions

19

Square error y1y2y3y4 Div

  • For real-valued output vectors, the (scaled) 𝑀) divergence is popular

𝐹𝑠𝑠𝑝𝑠 𝒛, 𝒑 = 1 2 𝒛 − 𝒑 ) = 1 2 1(𝑧; − 𝑝;))

  • ;

– Squared Euclidean distance between true and desired output – Note: this is differentiable

𝑒𝐹 𝒛, 𝒑 𝑒𝑝; = −(𝑧; − 𝑝;) 𝛼H𝐹 𝒛, 𝒑 = [𝑝$ − 𝑧$, 𝑝) − 𝑧), … , 𝑝> − 𝑧>]

slide-20
SLIDE 20

Representing the output

20

  • If the desired output is binary (is this a cat or not), use a simple 1/0

representation of the desired output

– 1 =YES it’s acat – 0 = NO it’s not a cat.

slide-21
SLIDE 21

Typical Problem statement: binary classification

21

Training data

( , 0) ( , 1) ( , 1) ( , 0) ( , 0) ( , 1)

Input: vector of pixel values Output: sigmoid

  • Given, many positive and negative examples (trainingdata),

– learn all weights such that the network does the desired job

slide-22
SLIDE 22

Activation function?

22

  • Real-life data are rarelyclean

– Overlapping classes – Rosenblatt’s perceptron wouldn’t work in the first place

slide-23
SLIDE 23

Non-linearly separable data

23

83

x1 x2

  • Two-dimensional example

– Blue dots (on the floor) on the “red” side – Red dots (suspended at Y=1)on the “blue” side – No line will cleanly separate the two colors

slide-24
SLIDE 24

Non-linearly separable data: 1-Dexample

24

y

  • One-dimensional example forvisualization

– All (red) dots at Y=1 represent instances of classY=1 – All (blue) dots at Y=0 are from classY=0 – The data are notlinearly separable

  • No threshold will cleanly separate red and blue dots
slide-25
SLIDE 25

The probability of y=1

25

  • Consider this differently: at each point look at a

small window around that point

  • Plot the average value within thewindow

– This is an approximation of the probability of Y=1 at that point

x y

slide-26
SLIDE 26

Logistic regression

26

x1 x2

i i i

– It actually computes the probability that the input belongs to class 1

  • This the perceptron with asigmoid activation

Decision: y >0.5?

When X is a 2-D variable

slide-27
SLIDE 27

Representing the output

27

𝜏(𝑨 )

  • If the desired output is binary (is this a cat or not), use a simple 1/0 representation of

the desired output

  • Output activation: Typically a sigmoid

– Viewed as the probability 𝑄 𝑍 = 1 𝒚 of class value 1

  • Indicating the fact that for actual data, in general a feature vector may occur for both classes, but with

different probabilities

  • Is differentiable
slide-28
SLIDE 28

Differentiable Activation

28

+

. . .

1 2 3 1 2 3 N-1 N-1 N N N +1

  • z

i i i

  • This particular one has anice interpretation
slide-29
SLIDE 29

For binary classifier: Logistic regression

29

K L Div

  • For binary classifier with scalar output 𝑝 ∈ 0,1 , 𝑧 is 0/1, the cross entropy between the

probability distribution [𝑝, 1 − 𝑝] and the ideal output probability [𝑧, 1 − 𝑧] is popular

𝑀 𝑧, 𝑝 = −𝑧𝑚𝑝𝑕𝑝 − 1 − 𝑧 log (1 − 𝑝)

  • Derivative

𝑒𝑀 𝑧, 𝑝 𝑒𝑝 = − 1 𝑝 𝑗𝑔 𝑧 = 1 1 1 − 𝑝 𝑗𝑔 𝑧 = 0

𝑧 𝑝 𝑝 = 𝜏(𝑨)

slide-30
SLIDE 30

Choosing cost function: Examples

31

} Regression problem

– SSE

} Classification problem

– Cross-entropy

  • Binary classification

𝐹 = 1 𝐹5

+ 56$

𝐹5 = 1 2 𝑝 5 − 𝑧 5

)

𝐹5 = 1 2 𝒑 5 − 𝒛 5

)

= 1 𝑝;

5 − 𝑧; 5 ) > ;6$

One dimensional output Multi-dimensional output

𝑧$ 𝑧> 𝑦$ 𝑦9 𝑚𝑝𝑡𝑡5 = −𝑧 5 log 𝑝 5 − (1 − 𝑧 5 ) log(1 − 𝑝 5 )

Output layer uses sigmoid activation function

slide-31
SLIDE 31

Multi-class output: One-hot representations

  • Consider a network that must distinguish if an input is a cat, a dog, a camel, a

hat, or a flower

  • For inputs of each of the five classes the desired output is:

Cat : [1 0 0 0 0 ]T dog : [0 1 0 0 0 ]T camel : [0 0 1 0 0 ]T hat : [0 0 0 1 0 ]T flower : [0 0 0 0 1 ]T

  • For input of any class, we will have a five-dimensional vector output with four

zeros and a single 1 at the position of the class

  • This is a one hot vector

32

slide-32
SLIDE 32

Multi-class networks

33

Input L ayer Output Layer Hidden Layers

  • For a multi-class classifier with N classes, the one-hot representation will have

N binary outputs

– An N-dimensional binary vector

  • The neural network’s output too must ideally be binary (N-1 zeros and a single 1 in the

right place)

  • More realistically, it will be aprobability vector

– N probability values that sum to 1.

slide-33
SLIDE 33

Vector activation example: Softmax

34

  • Example: Softmax vector activation

Parameters are weights and bias

𝑝U

slide-34
SLIDE 34

Vector Activations

35

Input L ayer Output Layer Hidden Layers

  • We can also have neuron that have multiple couple outputs

𝑧$, 𝑧), … , 𝑧V = 𝑔(𝑦$, 𝑦), … , 𝑦;; 𝑿)

– Function 𝑔(. ) operates on set of inputs to produce set of outputs – Modifying a single parameter in 𝑿 will affect all outputs

slide-35
SLIDE 35

Multi-class classification: Output

36

Input L ayer Output Layer Hidden Layers

s

  • f

t m a x

  • Softmax vector activation is often used at the output of multi-class

classifier nets

𝑨U = 1 𝑥

YU (V)𝑏Y (5[$)

  • Y

𝑝U = exp (𝑨U) ∑ exp (𝑨

Y)

  • Y
  • This can be viewed as the probability 𝑝U = 𝑄 𝑑𝑚𝑏𝑡𝑡 = 𝑗 𝒚
slide-36
SLIDE 36

For multi-class classification

37

y1y2y3y4 K L Div() E

  • Desired output 𝑧 is one hot vector 0 0 … 1 … 0 0 0 wit the 1 in the 𝑑-th position(for class c)
  • Actual output will be probability distribution [𝑝$, 𝑝), … , 𝑝V]
  • The cross-entropy between the desired one-hot output and actual output

𝑀 𝒛, 𝒑 = − 1 𝑧U𝑚𝑝𝑕𝑝U = −𝑚𝑝𝑕𝑝a

  • U
  • Derivative

𝑒𝑀(𝒛, 𝒑) 𝑒𝑝U = b− 1 𝑝a 𝑔𝑝𝑠 𝑢ℎ𝑓 𝑑 𝑢ℎ 𝑑𝑝𝑛𝑞𝑝𝑜𝑓𝑜𝑢 0 𝑔𝑝𝑠 𝑠𝑓𝑛𝑏𝑗𝑜𝑗𝑜𝑕 𝑑𝑝𝑛𝑞𝑝𝑜𝑓𝑜𝑢 𝛼

𝒑𝑀(𝒛, 𝒑) = [0 0 … −1

𝑝a … 0 0 ]

The slopeis negative w.r.t. 𝑝a Indicates increasing 𝑝a will reduce divergence

slide-37
SLIDE 37

For multi-class classification

38

  • Desired output 𝑧 is one hot vector 0 0 … 1 … 0 0 0 wit the 1 in the 𝑑-th position(for class c)
  • Actual output will be probability distribution [𝑝$, 𝑝), … , 𝑝V]
  • The cross-entropy between the desired one-hot output and actual output

𝑀 𝒛, 𝒑 = − 1 𝑧U𝑚𝑝𝑕𝑝U = −𝑚𝑝𝑕𝑝a

  • U
  • Derivative

𝑒𝑀(𝒛, 𝒑) 𝑒𝑝U = b− 1 𝑝a 𝑔𝑝𝑠 𝑢ℎ𝑓 𝑑 𝑢ℎ 𝑑𝑝𝑛𝑞𝑝𝑜𝑓𝑜𝑢 0 𝑔𝑝𝑠 𝑠𝑓𝑛𝑏𝑗𝑜𝑗𝑜𝑕 𝑑𝑝𝑛𝑞𝑝𝑜𝑓𝑜𝑢 𝛼

𝒑𝑀(𝒛, 𝒑) = [0 0 … −1

𝑝a … 0 0 ]

Note: when 𝒛 = 𝒑 the derivative is not 0

The slopeis negative w.r.t. 𝑝a Indicates increasing 𝑝a will reduce divergence

y1y2y3y4 K L Div() E

slide-38
SLIDE 38

Choosing cost function: Examples

40

} Regression problem

– SSE

} Classification problem

– Cross-entropy

  • Binary classification
  • Multi-class classification

𝑚𝑝𝑡𝑡5 = −log 𝑝i(j) 𝑚𝑝𝑡𝑡5 = −𝑧 5 log 𝑝 5 − (1 − 𝑧 5 ) log(1 − 𝑝 5 )

Output layer uses sigmoid activation function

Output is found by a softmax layer 𝑝U =

klm ∑ kln

  • npq

𝑝 = 1 1 + 𝑓s 𝐹 = 1 𝐹5

+ 56$

𝐹5 = 1 2 𝑝 5 − 𝑧 5

)

𝐹5 = 1 2 𝒑 5 − 𝒛 5

)

= 1 𝑝;

5 − 𝑧; 5 ) > ;6$

One dimensional output Multi-dimensional output

𝑧$ 𝑧> 𝑦$ 𝑦9

slide-39
SLIDE 39

Problem setup

  • Given: the architecture of thenetwork
  • Training data: A set of input-output pairs

𝒚($), 𝒛($) , 𝒚()), 𝒛()) , … , (𝒚(+), 𝒛(+))

  • We need a loss function to show how penalizes the obtained output 𝑝 = 𝑔(𝒚; 𝑿)

when the desired output is 𝒛 𝐹(𝑿) = 1 𝑚𝑝𝑡𝑡 𝒑(5), 𝒛(5)

+ 56$

= 1 𝑂 1 𝑚𝑝𝑡𝑡 𝑔 𝒚(5); 𝑿 , 𝒛(5)

+ 56$

  • Minimize 𝐹 w.r.t. 𝑿 that containts 𝑥U,Y

; , 𝑐 Y [;]

41

slide-40
SLIDE 40

How to adjust weights for multi layer networks?

  • We need multiple layers of adaptive, non-linear hidden units. But

how can we train such nets?

– We need an efficient way of adapting all the weights, not just the last layer. – Learning the weights going into hidden units is equivalent to learning features. – This is difficult because nobody is telling us directly what the hidden units should do.

42

slide-41
SLIDE 41

Find the weights by optimizing the cost

43

  • Start from random weights and then adjust them iteratively to get lower cost.
  • Update the weights according to the gradient of the cost function

Source: http://3b1b.co

slide-42
SLIDE 42

How does the network learn?

44

  • Which changes to the weights do improve the most?
  • The magnitude of each element shows how sensitive the cost is

to that weight or bias.

𝛼𝐹

𝛼𝐹 Source: http://3b1b.co

slide-43
SLIDE 43

Training multi-layer networks

45

  • Back-propagation

– Training algorithm that is used to adjust weights in multi-layer networks (based on the training data) – The back-propagation algorithm is based on gradient descent – Use chain rule and dynamic programming to efficiently compute gradients

slide-44
SLIDE 44

Training Neural Nets through Gradient Descent

46

Total training error:

  • Gradient descent algorithm
  • Initialize all weights and biases 𝑥UY

[;]

– Using the extended notation : the bias is also weight

  • Do :

– For every layer 𝑙 for all 𝑗, 𝑘 update:

  • 𝑥U,Y

[;] = 𝑥U,Y [;] − 𝜃 9w 9xm,n

[y]

  • Until 𝐹 has converged

Assuming the bias is also represented as a weight

𝐹 = 1 𝑚𝑝𝑡𝑡 𝒑(5), 𝒛(5)

+ 56$

slide-45
SLIDE 45

The derivative

47

  • Computing the derivative

Total derivative: Total training error:

𝐹 = 1 𝑚𝑝𝑡𝑡 𝒑(5), 𝒛(5)

+ 56$

𝑒𝐹 𝑒𝑥U,Y

[;] = 1 𝑚𝑝𝑡𝑡 𝒑(5), 𝒛(5)

𝑒𝑥U,Y

[;] + 56$

slide-46
SLIDE 46

Training by gradient descent

  • Initialize all weights 𝑥UY

[;]

  • Do :

– For all 𝑗 , 𝑘 , 𝑙, initialize

9w 9xm,n

[y] = 0

– For all 𝑜 = 1: 𝑂

  • For every layer 𝑙 for all 𝑗, 𝑘:
  • Compute

9 V{|| { j ,i j 9xm,n

[y]

  • 9w

9xm,n

[y] +=

9 V{|| { j ,i j 9xm,n

[y]

– For every layer 𝑙 for all 𝑗, 𝑘:

𝑥U,Y

[;] = 𝑥U,Y [;] − } T 9w 9xm,n

[y]

48

slide-47
SLIDE 47

The derivative

49

  • So we must first figure out how to compute the

derivative of divergences of individual training inputs

Total derivative: Total training error:

𝐹 = 1 𝑚𝑝𝑡𝑡 𝒑(5), 𝒛(5)

+ 56$

𝑒𝐹 𝑒𝑥U,Y

[;] = 1 𝑒 𝑚𝑝𝑡𝑡 𝒑(5), 𝒛(5)

𝑒𝑥U,Y

[;] + 56$

slide-48
SLIDE 48

Calculus Refresher: Basic rules of calculus

50

with derivative

dy dx

the following must hold for sufficiently small For any differentiable function

1 2 M

with partial derivatives

∂y ∂y ∂y ∂x1 ∂x2 ∂xM

the following must hold for sufficiently small

1 2 M

𝑒𝑧 𝑒𝑦 ≈ Δ𝑧 Δ𝑦

slide-49
SLIDE 49

Calculus Refresher: Chainrule

51

Check –we can confirm that : For any nested function

slide-50
SLIDE 50

Calculus Refresher: Distributed Chain rule

52

Check:

1 1 2 2 M M 1 1 2 2 M M

slide-51
SLIDE 51

Distributed Chain Rule: Influence Diagram

53

1 2

1 2 M

M

  • 𝑦 affects 𝑧 through each 𝑕$, … , 𝑕€
slide-52
SLIDE 52

Distributed Chain Rule: Influence Diagram

54

1 2 M

1 1 M M

  • Small perturbations in

perturbations in each o cause small each of which individually additively perturbs

slide-53
SLIDE 53

Simple chain rule

  • 𝑨 = 𝑔 𝑕 𝑦
  • 𝑧 = 𝑕(𝑦)

55

slide-54
SLIDE 54

Multiple paths chain rule

56

slide-55
SLIDE 55

Backpropagation: a simple example

57

slide-56
SLIDE 56

Backpropagation: a simple example

58

slide-57
SLIDE 57

Backpropagation: a simple example

59

slide-58
SLIDE 58

Backpropagation: a simple example

60

slide-59
SLIDE 59

Backpropagation: a simple example

61

slide-60
SLIDE 60

Backpropagation: a simple example

62

slide-61
SLIDE 61

How to propagate the gradients backward

63

𝑨 = 𝑔(𝑦, 𝑧)

slide-62
SLIDE 62

How to propagate the gradients backward

64

slide-63
SLIDE 63

Returning to ourproblem

65

  • How to compute

𝑒 𝑚𝑝𝑡𝑡 𝒑, 𝒛 𝑒𝑥U,Y

[;]

slide-64
SLIDE 64

A first closer look at the network

66

+

  • Showing a tiny 2-input network forillustration

– Actual network would have many more neurons and inputs

  • Explicitly separating the weighted sum of inputs from the

activation

+ + + +

𝑔(.) 𝑔(.) 𝑔(.) 𝑔(.) 𝑔(.)

𝑝

slide-65
SLIDE 65

A first closer look at the network

67

  • Showing a tiny 2-input network forillustration

– Actual network would have many more neurons and inputs

  • Expanded with all weights andactivations shown
  • The overall function is differentiable w.r.t every weight,bias

and input

+ + + + +

(1) 2,1 (1) 3,1 (2) 2,1 (2) 3,1 (3) 1,1 (3) 2,1 (3) 3,1 (1) 3,2 (2) 3,2 (1) 2,2 (1) 1,1 (1) 1,2 (2) 2,2 (2) 1,1 (2) 1,2

𝑝

slide-66
SLIDE 66

Backpropagation: Notation

68

  • 𝒃[‚] ← 𝐽𝑜𝑞𝑣𝑢
  • 𝑝𝑣𝑢𝑞𝑣𝑢 ← 𝒃[†]

𝑔(. ) 𝑔(. ) 𝑔(. ) 𝑔(. ) 𝑔(. ) 𝒃[V[$] 𝒃[V] 𝒜[V]

slide-67
SLIDE 67

Backpropagation: Last layer gradient

𝑏U

[V[$]

𝑨

Y [†]

𝑏Y

[†]

𝑔

𝑏U

[†] = 𝑔 𝑨U [†]

𝑨Y

[†] = 1 𝑥UY [†]𝑏U [†[$] € U6‚

For squared error loss: 𝑚𝑝𝑡𝑡 = 1 2 1 𝑝

Y − 𝑧Y ) Y

𝑝

Y = 𝑏Y † 𝑥UY

[†]

69

𝜖𝑚𝑝𝑡𝑡 𝜖𝑏Y

Output j

𝜖𝑚𝑝𝑡𝑡 𝜖𝑏Y

† = (𝑧Y − 𝑏Y † )

𝜖𝑚𝑝𝑡𝑡 𝜖𝑥UY

[†] = 𝜖𝑚𝑝𝑡𝑡

𝜖𝑏Y

𝜖𝑏Y

𝜖𝑥UY

[†]

𝜖𝑏[†] 𝜖𝑥UY

[†] = 𝑔‰ 𝑨 Y [†]

𝜖𝑨

Y [†]

𝜖𝑥UY

[†] = 𝑔‰ 𝑨 Y [†] 𝑏U [†[$]

𝜖𝑚𝑝𝑡𝑡 𝜖𝑥UY

[†] = 𝜖𝑚𝑝𝑡𝑡

𝜖𝑏Y

† 𝑔‰ 𝑨 Y [†] 𝑏U [†[$]

𝜖𝑚𝑝𝑡𝑡 𝜖𝑥UY

[†]

slide-68
SLIDE 68

Activations and theirderivatives

70

2

[*]

  • Some popular activation functions andtheir derivatives
slide-69
SLIDE 69

Previous layers gradients

𝜖 𝑚𝑝𝑡𝑡 𝜖𝑥UY

[V] = 𝜖 𝑚𝑝𝑡𝑡

𝜖𝑏Y

V

𝜖𝑏Y

V

𝜖𝑥UY

[V]

𝜖𝑏[V] 𝜖𝑥UY

[V] =

𝜖𝑏Y

[V]

𝜖𝑨

Y [V] ×

𝜖𝑨

Y [V]

𝜖𝑥UY

[V] = 𝑔‰ 𝑨Y

[V] 𝑏U [V[$]

𝜖 𝑚𝑝𝑡𝑡 𝜖𝑏U

[V[$] = 1 𝜖 𝑚𝑝𝑡𝑡

𝜖𝑏Y

[V] ×

𝜖𝑏Y

[V]

𝜖𝑨Y

[V] ×

𝜖𝑨Y

[V]

𝜖𝑏U

[V[$] 9[‹] Y6$

= 1 𝜖 𝑚𝑝𝑡𝑡 𝜖𝑏Y

[V] ×𝑔‰ 𝑨Y [V] ×𝑥UY [V] 9[‹] Y6$

71

𝑏U

[V[$]

𝑨

Y [V]

𝑏Y

[V]

𝑔 𝑥UY

[V]

𝑏U

[V[$]

𝑏Y

[V]

𝑥UY

[V] 𝑨

Y [V]

𝜖 𝑚𝑝𝑡𝑡 𝜖𝑏Y

V

𝑏U

[V] = 𝑔 𝑨U [V]

𝑨Y

[V] = 1 𝑥UY [V]𝑏U [V[$] € U6‚

𝜖 𝑚𝑝𝑡𝑡 𝜖𝑥UY

[V]

𝜖 𝑚𝑝𝑡𝑡 𝜖𝑏9[‹]

[V]

𝜖 𝑚𝑝𝑡𝑡 𝜖𝑏U

[V[$]

𝜖 𝑚𝑝𝑡𝑡 𝜖𝑏$

[V]

slide-70
SLIDE 70

Backpropagation:

72

𝜖 𝑚𝑝𝑡𝑡 𝜖𝑥UY

[V] = 𝜖 𝑚𝑝𝑡𝑡

𝜖𝑏Y

[V] ×

𝜖𝑏Y

[V]

𝜖𝑥UY

[V]

= 𝜀

Y [V]×𝑏U [V[$]×𝑔‰ 𝑨Y [V] } 𝜀

Y [V] = • V{||

  • Žn

[‹] is the sensitivity of the output to 𝑏Y

[V]

} Sensitivity vectors can be obtained by running a backward process in the

network architecture (hence the name backpropagation.)

𝑏U

[V[$]

𝑨Y

[V]

𝑏Y

[V]

𝑔 𝑏U

[V] = 𝑔 𝑨U [V]

𝑨Y

[V] = 1 𝑥UY [V]𝑏U [V[$] € U6‚

𝑥UY

[V]

We will compute 𝜺[V[$] from 𝜺[V]:

𝜀U

[V[$] = 1 𝜀 Y [V]×𝑔‰ 𝑨 Y [V] ×𝑥UY [V] 9[‹] Y6$

slide-71
SLIDE 71

Find and save 𝜺[†]

73

  • Called error, computed recursively in backward manner
  • For the final layer 𝑚 = 𝑀:

𝜀

Y [†] = 𝜖 𝑚𝑝𝑡𝑡

𝜖𝑏Y

[†]

slide-72
SLIDE 72

Compute 𝜺[V[$] from 𝜺[V]

74

} 𝜀

Y [V] =

  • V{||
  • Žn

[‹] is the sensitivity of the output to 𝑏Y

[V]

} Sensitivity vectors can be obtained by running a backward process in

the network architecture (hence the name backpropagation.)

𝜀U

[V[$] = 1 𝜀 Y [V]×𝑔‰ 𝑨 Y [V] ×𝑥UY [V] 9[‹] Y6$

𝑏U

[V[$]

𝑨

Y [V]

𝑏Y

[V]

𝑔 𝑏U

[V] = 𝑔 𝑨U [V]

𝑨

Y [V] = 1𝑥UY [V]𝑏U [V[$] € Y6‚

𝑥UY

[V]

slide-73
SLIDE 73

Backpropagation Algorithm

75

  • Initialize all weights to small random numbers.
  • While not satisfied
  • For each training example do:

1. Feed forward the training example to the network and compute the outputs of all units in forward step (z and a) and the loss 2. For each unit find its 𝜀 in the backward step 3. Update each network weight 𝑥UY

[V] as 𝑥UY [V] ← 𝑥UY [V] − 𝜃

  • V{||
  • xmn

[‹] where

  • V{||
  • xmn

[‹] = 𝜀

Y [V]×𝑏U [V[$]×

𝑔‰ 𝑨

Y [V]

slide-74
SLIDE 74

Another example

76

slide-75
SLIDE 75

Another example

77

slide-76
SLIDE 76

Another example

78

slide-77
SLIDE 77

Another example

79

slide-78
SLIDE 78

Another example

80

slide-79
SLIDE 79

Another example

81

slide-80
SLIDE 80

Another example

82

slide-81
SLIDE 81

Another example

83

slide-82
SLIDE 82

Another example

84

slide-83
SLIDE 83

Another example

85

slide-84
SLIDE 84

Another example

86

slide-85
SLIDE 85

Another example

87

slide-86
SLIDE 86

Another example

88

slide-87
SLIDE 87

Another example

[local gradient] x [upstream gradient] x0: [2] x [0.2] = 0.4 w0: [-1] x [0.2] = -0.2

89

slide-88
SLIDE 88

Derivative of sigmoid function

90

slide-89
SLIDE 89

Derivative of sigmoid function

91

slide-90
SLIDE 90

Patterns in backward flow

  • add gate: gradient distributor
  • max gate: gradient router

92

slide-91
SLIDE 91

Modularized implementation: forward / backward API

93

slide-92
SLIDE 92

Modularized implementation: forward / backward API

94

slide-93
SLIDE 93

Modularized implementation: forward / backward API

95

slide-94
SLIDE 94

Ve Vector formulation

  • For layered networks it is generally simpler to think of the process in

terms of vector operations

– Simpler arithmetic – Fast matrix libraries make operations much faster

  • We can restate the entire process in vector terms

– This is what is actually used in any real system

97

slide-95
SLIDE 95

Th The Ja Jacobi bian

  • The derivative of a vector function w.r.t. vector input is called a Jacobian
  • It is the matrix of partial derivatives given below

98

𝑧$ 𝑧) ⋮ 𝑧‘ = 𝑔 𝑦$ 𝑦) ⋮ 𝑦9 𝜖𝒛 𝜖𝒚 = 𝜖𝑧$ 𝜖𝑦$ 𝜖𝑧$ 𝜖𝑦) ⋯ 𝜖𝑧$ 𝜖𝑦9 𝜖𝑧) 𝜖𝑦$ 𝜖𝑧) 𝜖𝑦) ⋯ 𝜖𝑧) 𝜖𝑦9 ⋯ ⋯ ⋱ ⋯ 𝜖𝑧‘ 𝜖𝑦$ 𝜖𝑧‘ 𝜖𝑦) ⋯ 𝜖𝑧‘ 𝜖𝑦9 Using vector notation 𝐳 = 𝑔 𝐲 Check: ∆𝐳 = 𝜖𝒛 𝜖𝒚 ∆𝐲

slide-96
SLIDE 96

Matrix calculus

  • Scalar-by-Vector
  • Vector-by-Vector
  • Scalar-by-Matrix
  • Vector-by-Matrix

99

𝜖𝑧 𝜖𝒚 = 𝜖𝑧 𝜖𝑦$ … 𝜖𝑧 𝜖𝑦5 𝜖𝒛 𝜖𝒚 = 𝜖𝑧$ 𝜖𝑦$ … 𝜖𝑧$ 𝜖𝑦5 ⋮ ⋱ ⋮ 𝜖𝑧‘ 𝜖𝑦$ … 𝜖𝑧‘ 𝜖𝑦5 𝜖𝑧 𝜖𝑩 = 𝜖𝑧 𝜖𝐵$$ … 𝜖𝑧 𝜖𝐵$5 ⋮ ⋱ ⋮ 𝜖𝑧 𝜖𝐵‘$ … 𝜖𝑧 𝜖𝐵‘5 𝜖𝑧 𝜖𝐵UY = 𝜖𝑧 𝜖𝒜 𝜖𝒜 𝜖𝐵UY

slide-97
SLIDE 97

Vector-by-matrix gradients

100

slide-98
SLIDE 98

Examples

101

slide-99
SLIDE 99

Ja Jacob

  • bians c

can d des escribe t e the d e der erivatives es of n

  • f neu

eural a activation

  • ns

w. w.r.t their input

  • For Scalar activations

– Number of outputs is identical to the number of inputs

  • Jacobian is a diagonal matrix

– Diagonal entries are individual derivatives of outputs w.r.t inputs – Not showing the superscript “[k]” in equations for brevity

102

𝜖𝒃 𝜖𝒜 = 𝑒𝑏$ 𝑒𝑨$ ⋯ 𝑒𝑏) 𝑒𝑨) ⋯ ⋯ ⋯ ⋱ ⋯ ⋯ 𝑒𝑏‘ 𝑒𝑨‘ z a

slide-100
SLIDE 100
  • For scalar activations (shorthand notation):

– Jacobian is a diagonal matrix – Diagonal entries are individual derivatives of outputs w.r.t inputs

103

𝜖𝒃 𝜖𝒜 = 𝑔′ 𝑨$ ⋯ 𝑔′ 𝑨) ⋯ ⋯ ⋯ ⋱ ⋯ ⋯ 𝑔′ 𝑨‘ z a

𝑏U = 𝑔 𝑨U

Ja Jacob

  • bians c

can d des escribe t e the d e der erivatives es of n

  • f neu

eural ac activ tivatio tions ns w.r.t t the their ir input input

slide-101
SLIDE 101
  • For scalar activations (shorthand notation):

– Jacobian is a diagonal matrix – Diagonal entries are individual derivatives of outputs w.r.t inputs

104

𝜖𝒃 𝜖𝒜 = 𝜏 𝑨$ 1 − 𝜏 𝑨$ ⋯ 𝜏 𝑨) 1 − 𝜏 𝑨) ⋯ ⋯ ⋯ ⋱ ⋯ ⋯ 𝜏 𝑨‘ 1 − 𝜏 𝑨‘ z a

𝑏U = 𝜏 𝑨U

Ja Jacob

  • bians c

can d des escribe t e the d e der erivatives es of n

  • f neu

eural ac activ tivatio tions ns w.r.t t the their ir input input

slide-102
SLIDE 102

Fo For Ve Vector act activations

  • Jacobian is a full matrix

– Entries are partial derivatives of individual outputs w.r.t individual inputs

105

𝜖𝒃 𝜖𝒜 = 𝜖𝑏$ 𝜖𝑨$ 𝜖𝑏$ 𝜖𝑨) ⋯ 𝜖𝑏$ 𝜖𝑨‘ 𝜖𝑏) 𝜖𝑨$ 𝜖𝑏) 𝜖𝑨) ⋯ 𝜖𝑏) 𝜖𝑨‘ ⋯ ⋯ ⋱ ⋯ 𝜖𝑏5 𝜖𝑨$ 𝜖𝑏5 𝜖𝑨) ⋯ 𝜖𝑏5 𝜖𝑨‘ z a

slide-103
SLIDE 103

Sp Special case se: Affi fine fu functions

  • Matrix 𝐗 and bias 𝐜 operating on vector 𝐛[V[$] to produce vector 𝐴 V

106

𝜖𝐴 V 𝜖𝐛[V[$] = 𝐗 V

𝐴 V = 𝐗 V 𝐛[V[$] + 𝐜[V]

slide-104
SLIDE 104

Ve Vector derivatives: : Chain rule

  • We can define a chain rule for Jacobians
  • For vector functions of vector inputs:

107

𝐳 = 𝒈 𝒉 𝐲 𝐳 = 𝒈 𝐴 𝐴 = 𝒉 𝐲 𝜖𝒛 𝜖𝒚 = 𝜖𝒛 𝜖𝒜 𝜖𝒜 𝜖𝒚

Check

∆𝐴 = 𝜖𝒜

𝜖𝒚 ∆𝐲

∆𝐳 = 𝜖𝒛

𝜖𝒜 ∆𝐴 ∆𝐳 = 𝜖𝒛 𝜖𝒜 𝜖𝒜 𝜖𝒚 ∆𝐲 = 𝜖𝒛 𝜖𝒚 ∆𝐲

Note the order: The derivative of the outer function comes first

slide-105
SLIDE 105

Dimension balancing

108

slide-106
SLIDE 106

Backpropagation shape rule

  • When you take gradients against a scalar, the gradient at each

intermediate step has shape of denominator

109

slide-107
SLIDE 107

110

slide-108
SLIDE 108

111

slide-109
SLIDE 109

112

𝒓 𝑔

𝒓 = 𝑿𝒚 𝑔 𝒓 = 𝒓 ) = 𝒓T𝒓

slide-110
SLIDE 110

113

𝒓 𝑔

𝜖𝑔 𝜖𝑔

𝒓 = 𝑿𝒚 𝑔 𝒓 = 𝒓 ) = 𝒓T𝒓

slide-111
SLIDE 111

114

𝜖𝑔 𝜖𝒓 = 2𝒓

𝜖𝑔 𝜖𝑔

𝒓 𝑔

𝒓 = 𝑿𝒚 𝑔 𝒓 = 𝒓 ) = 𝒓T𝒓

𝜖𝑔 𝜖𝒓 = 2𝒓

slide-112
SLIDE 112

115

𝜖𝑔 𝜖𝑋 = 𝜖𝑔 𝜖𝒓 𝒚T = 2𝒓𝒚T

𝜖𝑔 𝜖𝒓

𝒓 = 𝑿𝒚

slide-113
SLIDE 113

Always check: The gradient with respect to a variable should have the same shape as the Variable

116

slide-114
SLIDE 114

117

slide-115
SLIDE 115

Output as a composite function

𝑃𝑣𝑢𝑞𝑣𝑢 = 𝑏[†] = 𝑔 𝑨[†] = 𝑔 𝑋[†]𝑏[†[$] = 𝑔 𝑋[†]𝑔(𝑋[†[$]𝑏[†[)] = 𝑔 𝑋[†]𝑔 𝑋[†[$] … 𝑔 𝑋[)]𝑔 𝑋[$]𝑦

For convenience, we use the same activation functions for all layers. However, output layer neurons most commonly do not need activation function (they show class scores or real-valued targets.)

𝑋[$] 𝑦 × 𝑔 𝑋[)] × 𝑔 𝑋[†] × 𝑔 𝑨[$] 𝑏[$] 𝑨[)] 𝑏[†[$] 𝑨[†] 𝑏[†] 𝑏[†] = 𝑝𝑣𝑢𝑞𝑣𝑢

118

slide-116
SLIDE 116

Backward-pass vector

  • Assume we have • V{||
  • Ž[£]
  • • V{||
  • s[‹] = • V{||
  • Ž[‹]
  • ¤ s ‹
  • s ‹
  • • V{||
  • ¥[‹] = • V{||
  • s[‹]
  • s[‹]
  • ¥[‹] = • V{||
  • s[‹] 𝑏 V[$ T
  • • V{||
  • Ž[‹¦q] = • V{||
  • s[‹]
  • s[‹]
  • Ž[‹¦q] = 𝑋 V T • V{||
  • s[‹]

119

𝑋[$] 𝑦 × 𝑔 𝑋[)] × 𝑔 𝑋[†] × 𝑔 𝑨[$] 𝑏[$] 𝑨[)] 𝑏[†[$] 𝑨[†] 𝑏[†] 𝑏[†] = 𝑝𝑣𝑢𝑞𝑣𝑢 𝑨[)]

The Jacobian will be a diagonal matrix for scalar activations

slide-117
SLIDE 117

Mi Mini ni-ba batch S h SGD

  • Loop:
  • 1. Sample a batch of data
  • 2. Forward prop it through the graph (network), get loss
  • 3. Backprop to calculate the gradients
  • 4. Update the parameters using the gradient

120

slide-118
SLIDE 118

Summary

  • Neural nets may be very large: impractical to write down gradient formula by

hand for all parameters

  • Backpropagation = recursive application of the chain rule along a computational

graph to compute the gradients of all inputs/parameters/intermediates

  • Implementations maintain a graph structure, where the nodes implement the

forward() / backward() API

– forward: compute result of an operation and save any intermediates needed for gradient computation in memory – backward: apply the chain rule to compute the gradient of the loss function with respect to the inputs

121

slide-119
SLIDE 119

Converting error derivatives into a learning procedure

  • The backpropagation algorithm is an efficient way of computing the

gradient of the error function w.r.t. weights and biases.

  • There are many other decisions to be made to have a learning

procedure from these derivatives:

– Convergence or optimization issues: How do we use the error derivatives? – Generalization issues: How can we improve its decisions on unseen data?

122

slide-120
SLIDE 120

Resources

  • Deep Learning Book, Chapter 6.
  • Please see the following note:

– http://cs231n.stanford.edu/handouts/derivatives.pdf – http://cs231n.stanford.edu/handouts/linear-backprop.pdf – http://cs231n.github.io/optimization-2/

123