Neural Networks: Multi-Layer Networks & Back-Propagation M. - - PowerPoint PPT Presentation

neural networks multi layer networks back propagation
SMART_READER_LITE
LIVE PREVIEW

Neural Networks: Multi-Layer Networks & Back-Propagation M. - - PowerPoint PPT Presentation

Neural Networks: Multi-Layer Networks & Back-Propagation M. Soleymani Artificial Intelligence Sharif University of Technology Spring 2019 Most slides have been adapted from Bhiksha Raj, 11-785, CMU 2019 and some from Fei Fei Li et. al,


slide-1
SLIDE 1

Neural Networks: Multi-Layer Networks & Back-Propagation

  • M. Soleymani

Artificial Intelligence Sharif University of Technology Spring 2019 Most slides have been adapted from Bhiksha Raj, 11-785, CMU 2019 and some from Fei Fei Li et. al, cs231n, Stanford 2017

slide-2
SLIDE 2

Reasons to study neural computation

  • Neuroscience: To understand how the brain actually works.

– Its very big and very complicated and made of stuff that dies when you poke it around. So we need to use computer simulations.

  • AI: To solve practical problems by using novel learning algorithms

inspired by the brain

– Learning algorithms can be very useful even if they are not how the brain actually works.

2

slide-3
SLIDE 3

3

slide-4
SLIDE 4

A typical cortical neuron

  • Gross physical structure:

– There is one axon that branches – There is a dendritic tree that collects input from other neurons.

  • Axons typically contact dendritic trees at synapses

– A spike of activity in the axon causes charge to be injected into the post-synaptic neuron.

  • Spike generation:

– There is an axon hillock that generates outgoing spikes whenever enough charge has flowed in at synapses to depolarize the cell membrane.

4

slide-5
SLIDE 5

Binary threshold neurons

  • McCulloch-Pitts (1943): influenced Von Neumann.

– First compute a weighted sum of the inputs. – send out a spike of activity if the weighted sum exceeds a threshold. – McCulloch and Pitts thought that each spike is like the truth value of a proposition and each neuron combines truth values to compute the truth value of another proposition!

𝑗𝑜𝑞𝑣𝑢& 𝑗𝑜𝑞𝑣𝑢' 𝑗𝑜𝑞𝑣𝑢( 𝑔 * 𝑥,𝑦,

  • ,

𝑔: Activation function … 𝑔 Σ 𝑥& 𝑥' 𝑥(

5

slide-6
SLIDE 6

A better figure

6

  • A threshold unit

– “Fires” if the weighted sum of inputs and the “bias” T is positive

+

.....

𝑦& 𝑦' 𝑦0 𝑦( 𝑥& 𝑥' 𝑥0

𝑥(

z = * w4x4 − 𝜄

  • 4

z = 81 if z ≥ 0 0 else

𝜄

slide-7
SLIDE 7

McCulloch-Pitts neuron: binary threshold

7

𝑦& 𝑦' 𝑦( 𝑧 … 𝑧 = 81, 𝑨 ≥ 𝜄 0, 𝑨 < 𝜄 𝑧 𝜄: activation threshold 𝑥& 𝑥' 𝑥( 𝑦& 𝑦' 𝑦( 𝑧 … 𝑥& 𝑥' 𝑥( 𝑐 1 𝑧

bias: 𝑐 = −𝜄

Equivalent to binary McCulloch-Pitts neuron

slide-8
SLIDE 8

Neural nets and the brain

8

  • Neural nets are composed of networks of computational models of

neurons called perceptrons

+

.... .

𝑦& 𝑦' 𝑦0 𝑦( 𝑥& 𝑥' 𝑥0

𝑥(

−𝜄

slide-9
SLIDE 9

The perceptron

9

  • A threshold unit

– “Fires” if the weighted sum of inputs exceeds a threshold – Electrical engineers will call this a threshold gate

  • A basic unit of Booleancircuits

y = 1 if * 𝑥,𝑦, ≥ 𝜄

,

0 else

+

.... .

𝑦& 𝑦' 𝑦0 𝑦( 𝑥& 𝑥' 𝑥0

𝑥(

−𝜄

slide-10
SLIDE 10

The “soft” perceptron (logistic)

10

+

.....

𝒚𝟐 𝒚𝟑 𝒚𝟒 𝒚𝑶

𝒙𝟐 𝒙𝟑 𝒙𝟒

𝒙𝑶

z = * w4x4 − θ

  • 4

y = 1 1 + exp(−z)

  • A “squashing” function instead of a threshold at the output

– The sigmoid “activation” replaces the threshold

  • Activation: The function that acts on the weighted combination of

inputs (and threshold)

−𝜄

slide-11
SLIDE 11

Sigmoid neurons

  • These give a real-valued output that is a smooth and bounded

function of their total input.

  • Typically they use the logistic function

– They have nice derivatives.

11

slide-12
SLIDE 12

Other “activations”

12

+

....

𝒚𝟐 𝒚𝟑 𝒚𝟒 𝒚𝑶

𝒙𝟐 𝒙𝟑 𝒙𝟒 𝒙𝑶

𝒄

  • Does not always have to be a squashing function

– We will hear more about activations later

  • We will continue to assume a “threshold” activation in this lecture

tanh

tanh 𝑨 log (1 + 𝑓[)

sigmoid

1 1 1 + exp (−𝑨)

slide-13
SLIDE 13

Perceptron

13

𝑦& 𝑦' 1

x1 x2

z = * w4x4 + b

  • 4

+

....

𝑦& 𝑦' 𝑦0 𝑦( 𝑥&

𝑥' 𝑥0

𝑥(

𝑐

  • Lean this function

– A step function across a hyperplane

slide-14
SLIDE 14

Learning the perceptron

14

  • Given a number of input output pairs, learn the weights and

bias

– Learn 𝑋 = [𝑥&, … , 𝑥(] and b, given several 𝑌, 𝑧 pairs

𝑦& 𝑦'

+

....

𝑦& 𝑦' 𝑦0 𝑦( 𝑥& 𝑥' 𝑥0

𝑥(

𝑐

𝑧 = b1 𝑗𝑔 * w4x4 + b

  • 4

≥ 0 0 𝑝𝑢ℎ𝑓𝑠𝑥𝑗𝑡𝑓

slide-15
SLIDE 15

Restating the perceptron

15

x1 x2 x3 xd

Wd+1

xd+1=1

  • Restating the perceptron equation by adding another dimension

to 𝑌

𝑧 = 81 𝑗𝑔 ∑ 𝑥,𝑦, ≥ 0

(h& ,i&

0 𝑝𝑢ℎ𝑓𝑠𝑥𝑗𝑡𝑓 Where 𝑦(h& = 1

  • Note that the boundary ∑

𝑥,𝑦, ≥ 0

(h& ,i&

is now a hyperplane through origin

𝑋

(

slide-16
SLIDE 16

The Perceptron Problem

16

that perfectly separates thetwo

  • Find the hyperplane

groups of points

34

* 𝑥,𝑦, = 0

(h& ,i&

slide-17
SLIDE 17

Perceptron Algorithm:Summary

17

  • Cycle through the traininginstances
  • Only update 𝒙 on misclassifiedinstances
  • If instance misclassified:

– If instance is positive class 𝒙 = 𝒙 + 𝒚

– If instance is negative class

𝒙 = 𝒙 − 𝒚

slide-18
SLIDE 18

A Simple Method: The Perceptron Algorithm

18

  • Initialize: Randomly initialize the hyperplane

– i.e. randomly initialize the normalvector 𝑥

  • Classification rule 𝑡𝑗𝑕𝑜(𝒙k𝒚)

– Vectors on the same side of the hyperplane as 𝑋 will be assigned +1 class, and those on the other side will be assigned -1

  • The random initial plane will make mistakes

+1 (blue)

  • 1 (red)

𝒙

slide-19
SLIDE 19

How to learn the weights: multi class example

19

This example has been adopted from Hinton slides, “NN for Machine Learning”, coursera, 2015.

slide-20
SLIDE 20

How to learn the weights: multi class example

  • If correct: no change
  • If wrong:

– lower score of the wrong answer (by removing the input from the weight vector of the wrong answer) – raise score of the target (by adding the input to the weight vector of the target class)

20

This example has been adopted from Hinton slides, “NN for Machine Learning”, coursera, 2015.

slide-21
SLIDE 21

How to learn the weights: multi class example

  • If correct: no change
  • If wrong:

– lower score of the wrong answer (by removing the input from the weight vector of the wrong answer) – raise score of the target (by adding the input to the weight vector of the target class)

21

This example has been adopted from Hinton slides, “NN for Machine Learning”, coursera, 2015.

slide-22
SLIDE 22

How to learn the weights: multi class example

  • If correct: no change
  • If wrong:

– lower score of the wrong answer (by removing the input from the weight vector of the wrong answer) – raise score of the target (by adding the input to the weight vector of the target class)

22

This example has been adopted from Hinton slides, “NN for Machine Learning”, coursera, 2015.

slide-23
SLIDE 23

How to learn the weights: multi class example

  • If correct: no change
  • If wrong:

– lower score of the wrong answer (by removing the input from the weight vector of the wrong answer) – raise score of the target (by adding the input to the weight vector of the target class)

23

This example has been adopted from Hinton slides, “NN for Machine Learning”, coursera, 2015.

slide-24
SLIDE 24

How to learn the weights: multi class example

  • If correct: no change
  • If wrong:

– lower score of the wrong answer (by removing the input from the weight vector of the wrong answer) – raise score of the target (by adding the input to the weight vector of the target class)

24

This example has been adopted from Hinton slides, “NN for Machine Learning”, coursera, 2015.

slide-25
SLIDE 25

How to learn the weights: multi class example

  • If correct: no change
  • If wrong:

– lower score of the wrong answer (by removing the input from the weight vector of the wrong answer) – raise score of the target (by adding the input to the weight vector of the target class)

25

This example has been adopted from Hinton slides, “NN for Machine Learning”, coursera, 2015.

slide-26
SLIDE 26

Single layer networks as template matching

  • Weights for each class as a template (or sometimes also called a

prototype) for that class.

– The winner is the most similar template.

  • The ways in which hand-written digits vary are much too complicated

to be captured by simple template matches of whole shapes.

  • To capture all the allowable variations of a digit we need to learn the

features that it is composed of.

26

slide-27
SLIDE 27

What binary threshold neurons cannot do

  • A binary threshold output unit cannot even tell if two single bit

features are the same!

  • A geometric view of what binary threshold neurons cannot do
  • The positive and negative cases cannot be separated by a plane

27

slide-28
SLIDE 28

Networks with hidden units

  • Networks without hidden units are very limited in the input-output

mappings they can learn to model.

– More layers of linear units do not help. Its still linear. – Fixed output non-linearities are not enough.

  • We need multiple layers of adaptive, non-linear hidden units. But

how can we train such nets?

28

slide-29
SLIDE 29

The multi-layer perceptron

29

  • A network of perceptrons

– Generally “layered”

slide-30
SLIDE 30

Feed-forward neural networks

  • Also called Multi-Layer Perceptron (MLP)

30

slide-31
SLIDE 31

MLP with single hidden layer

31

  • Two-layer MLP (Number of layers of adaptive weights is counted)

𝑝l 𝒚 = 𝜔 * 𝑥

nl [']𝑨 n

  • nip

⇒ 𝑝l 𝒚 = 𝜔 * 𝑥

nl [']𝜚 * 𝑥,n [&]𝑦, ( ,ip

  • nip

… … … Input Output 𝑦p = 1 𝑦( 𝑝& 𝑝s 𝑥

nl [']

𝑥,n

[&]

𝜚 𝜔 𝑨p = 1 𝑨& 𝑨o 𝑨

n

𝜔 𝜚 𝜚 𝑦& 𝑗 = 0, … , 𝑒 𝑘 = 1 … 𝑁 𝑘 = 1 … 𝑁 𝑙 = 1, … , 𝐿

slide-32
SLIDE 32

Beyond linear models

32

𝒈 = 𝑿𝒚 𝒈 = 𝑿'𝜚 𝑿𝟐𝒚

slide-33
SLIDE 33

Beyond linear models

33

𝒈 = 𝑿𝒚 𝒈 = 𝑿'𝜚 𝑿𝟐𝒚 𝒈 = 𝑿0𝜚 𝑿'𝜚 𝑿𝟐𝒚

slide-34
SLIDE 34

learns to extract features

34

  • MLP with one hidden layer is a generalized linear model:

– 𝑝l(𝒚) = 𝜔 ∑

𝑥

nl [']𝑔

n(𝒚)

  • ni&

– 𝑔

n 𝒚 = 𝜚 ∑

𝑥

n, [&]𝑦, ( ,ip

– The form of the nonlinearity (basis functions 𝑔

n) is adapted from the training data

(not fixed in advance)

  • 𝑔

n is defined based on parameters which can be also adapted during training

  • Thus, we don’t need expert knowledge or time consuming tuning of

hand-crafted features

slide-35
SLIDE 35

Defining “depth”

35

  • What is a “deep” network
slide-36
SLIDE 36

The multi-layer perceptron

36

N.Net

  • Inputs are real or Boolean stimuli
  • Outputs are real or Boolean values

– Can have multiple outputs for a single input

  • What can this network compute?

– What kinds of input/output relationships can it model?

slide-37
SLIDE 37

MLPs approximate functions

37

x

ℎ2 ℎn

X Y Z

  • 1

1 A 2 1 1 2

  • 1

1 2 1 1 1 1 2 1 1

  • 1

1 1 1 1

  • MLP s can compose Boolean functions
  • MLPs can compose real-valued functions
  • What are the limitations?
slide-38
SLIDE 38

The perceptron as a Boolean gate

38

X Y

1 1 2

X

  • 1
  • A perceptron can model any simple binary Boolean gate

X

1 1 1

Y

slide-39
SLIDE 39

MLP as Boolean Functions

39

2 1 X Y Z 1 2

  • 1

1 2 1 1 1 1

  • 1

1 A 1 1 1

  • 1

1 1 1 1

  • MLPs are universal Boolean functions

– Any function over any number of inputs and any number of outputs

  • But how many “layers” will they need?
slide-40
SLIDE 40

How many layers for aBoolean MLP?

40

X1 X2 X3 X4 X5 Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table

Truth table shows all input combinations for which output is 1

  • A Boolean function is just a truth table
slide-41
SLIDE 41

How many layers for aBoolean MLP?

41

X1 X2 X3 X4 X5 Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table

Truth table shows all input combinations for which output is 1

  • Expressed in disjunctive normal form

y = 𝑌 {&𝑌 {'X0X}𝑌 {~ +𝑌 {&X'𝑌 {0X}X~ +𝑌 {&X'X0𝑌 {}𝑌 {~+ X&𝑌 {'𝑌 {0𝑌 {}X~ + X&𝑌 {'X0X}X~ + X&X'𝑌 {0𝑌 {}X~

slide-42
SLIDE 42

How many layers for aBoolean MLP?

42

X1 X2 X3 X4 X5 Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table

Truth table shows all input combinations for which output is 1

  • Expressed in disjunctive normal form

y = 𝑌 {&𝑌 {'X0X}𝑌 {~ +𝑌 {&X'𝑌 {0X}X~ +𝑌 {&X'X0𝑌 {}𝑌 {~+ X&𝑌 {'𝑌 {0𝑌 {}X~ + X&𝑌 {'X0X}X~ + X&X'𝑌 {0𝑌 {}X~

X' X0 X} X~ X&

slide-43
SLIDE 43

How many layers for aBoolean MLP?

43

X1 X2 X3 X4 X5 Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table

Truth table shows all input combinations for which output is 1

  • Expressed in disjunctive normal form

y = 𝑌 {&𝑌 {'X0X}𝑌 {~ +𝑌 {&X'𝑌 {0X}X~ +𝑌 {&X'X0𝑌 {}𝑌 {~+ X&𝑌 {'𝑌 {0𝑌 {}X~ + X&𝑌 {'X0X}X~ + X&X'𝑌 {0𝑌 {}X~

X' X0 X} X~ X&

slide-44
SLIDE 44

How many layers for aBoolean MLP?

44

X1 X2 X3 X4 X5 Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table

Truth table shows all input combinations for which output is 1

  • Expressed in disjunctive normal form

y = 𝑌 {&𝑌 {'X0X}𝑌 {~ +𝑌 {&X'𝑌 {0X}X~ +𝑌 {&X'X0𝑌 {}𝑌 {~+ X&𝑌 {'𝑌 {0𝑌 {}X~ + X&𝑌 {'X0X}X~ + X&X'𝑌 {0𝑌 {}X~

X' X0 X} X~ X&

slide-45
SLIDE 45

How many layers for aBoolean MLP?

45

X1 X2 X3 X4 X5 Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table

Truth table shows all input combinations for which output is 1

  • Expressed in disjunctive normal form

y = 𝑌 {&𝑌 {'X0X}𝑌 {~ +𝑌 {&X'𝑌 {0X}X~ +𝑌 {&X'X0𝑌 {}𝑌 {~+ X&𝑌 {'𝑌 {0𝑌 {}X~ + X&𝑌 {'X0X}X~ + X&X'𝑌 {0𝑌 {}X~

X' X0 X} X~ X&

slide-46
SLIDE 46

How many layers for aBoolean MLP?

46

X1 X2 X3 X4 X5 Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table

Truth table shows all input combinations for which output is 1

  • Expressed in disjunctive normal form

y = 𝑌 {&𝑌 {'X0X}𝑌 {~ +𝑌 {&X'𝑌 {0X}X~ +𝑌 {&X'X0𝑌 {}𝑌 {~+ X&𝑌 {'𝑌 {0𝑌 {}X~ + X&𝑌 {'X0X}X~ + X&X'𝑌 {0𝑌 {}X~

X' X0 X} X~ X&

slide-47
SLIDE 47

How many layers for aBoolean MLP?

47

X1 X2 X3 X4 X5 Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table

Truth table shows all input combinations for which output is 1

  • Expressed in disjunctive normal form

y = 𝑌 {&𝑌 {'X0X}𝑌 {~ +𝑌 {&X'𝑌 {0X}X~ +𝑌 {&X'X0𝑌 {}𝑌 {~+ X&𝑌 {'𝑌 {0𝑌 {}X~ + X&𝑌 {'X0X}X~ + X&X'𝑌 {0𝑌 {}X~

X' X0 X} X~ X&

slide-48
SLIDE 48

How many layers for aBoolean MLP?

48

X1 X2 X3 X4 X5 Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table

Truth table shows all input combinations for which output is 1

  • Expressed in disjunctive normal form

y = 𝑌 {&𝑌 {'X0X}𝑌 {~ +𝑌 {&X'𝑌 {0X}X~ +𝑌 {&X'X0𝑌 {}𝑌 {~+ X&𝑌 {'𝑌 {0𝑌 {}X~ + X&𝑌 {'X0X}X~ + X&X'𝑌 {0𝑌 {}X~

X' X0 X} X~ X&

slide-49
SLIDE 49

How many layers for aBoolean MLP?

49

X1 X2 X3 X4 X5 Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table

  • Any truth table can be expressed in this manner!
  • A one-hidden-layer MLP is a Universal Boolean Function
  • But what is the largest number of perceptrons required in the

single hidden layer for an N-input-variable function?

y = 𝑌 {&𝑌 {'X0X}𝑌 {~ +𝑌 {&X'𝑌 {0X}X~ +𝑌 {&X'X0𝑌 {}𝑌 {~+ X&𝑌 {'𝑌 {0𝑌 {}X~ + X&𝑌 {'X0X}X~ + X&X'𝑌 {0𝑌 {}X~

X' X0 X} X~ X&

slide-50
SLIDE 50

MLPs as universal approximators

50

slide-51
SLIDE 51

MLP as a continuous-valued regression

51

+

x

1 T2 1 T

1

T1 T

2

1

  • 1

T T

1 2 x

f(x)

  • A simple 3-unit MLP with a “summing” output unit can

generate a “square pulse” over an input

– Output is 1 only if the input lies between T1 and T2 – T1 and T2 can be arbitrarily specified

slide-52
SLIDE 52

MLP as a continuous-valued regression

52

x

ℎ□ ℎ□ ℎ□

+

x

1 T

2

1 T

1

T1 T

2

1

  • 1

T T

1 2 x

f(x)

  • A simple 3-unit MLP can generate a “square pulse” over an input
  • An MLP with many units can model an arbitrary function over an

input

– To arbitrary precision

  • Simply make the individual pulses narrower
  • A one-layer MLP can model an arbitrary function of a single input
slide-53
SLIDE 53

Summary

  • MLPs are universal Boolean function
  • MLPs are universal classifiers
  • MLPs are universal function approximators
  • A single-layer MLP can approximate anything to arbitrary precision

– But could be exponentially or even infinitely wide in its inputs size

  • Deeper MLPs can achieve the same precision with far fewer

neurons

– Deeper networks are more expressive

53

slide-54
SLIDE 54

Learning problem

  • Given: the architecture of thenetwork
  • Training data: A set of input-output pairs

𝒚(&), 𝒛(&) , 𝒚('), 𝒛(') , … , (𝒚(€), 𝒛(€))

  • We want to find the function 𝑔 on the input space to get the output

– We consider a neural network as a parametric function 𝑔(𝒚; 𝑿)

54

slide-55
SLIDE 55

What is f()? Typicalnetwork

55

Input units Output units Hidden units

  • We assume a“layered” network for simplicity
  • Generic terminology

– We will refer to the inputs as the input units

– No neurons here – the “input units” are just the inputs

– We refer to the outputs as the output units – Intermediate units are “hidden”units

slide-56
SLIDE 56

What we learn : The parameter of the network

56

  • Given: the architecture of thenetwork
  • The parameters of the network: The weights and biases

– The weights associated with the blue arrows in the picture

  • Learning the network: Determining the values of these parameters

such that the network computes the desired function

slide-57
SLIDE 57

Problem setup

  • Given: the architecture of thenetwork
  • Training data: A set of input-output pairs

𝒚(&), 𝒛(&) , 𝒚('), 𝒛(') , … , (𝒚(€), 𝒛(€))

  • We want to find the function 𝑔

– We consider a neural network as a parametric function 𝑔(𝒚; 𝑿)

  • We need a loss function to show how penalizes the obtained output

𝑔(𝒚; 𝑿) when the desired output is 𝒛 1 𝑂 * 𝑚𝑝𝑡𝑡 𝑔 𝒚(„); 𝑿 , 𝒛(„)

€ „i&

57

slide-58
SLIDE 58

Training an MLP

  • We define differentiable loss or divergence between the output of the

network and the desired output for the training instances

– And a total error, which is the average divergence over all training instances

  • We optimize network parameters to minimize this error

58

slide-59
SLIDE 59

Representing the output

59

Input L ayer Output Layer Hidden Layers

  • If the desired output is real-valued, no special tricks are necessary

– Scalar Output : single output neuron

  • o = scalar (real value)

– Vector Output : as many output neurons as the dimension of the desired output

  • 𝒑 = [𝑝&, 𝑝', … , 𝑝s] (vector of real values)
slide-60
SLIDE 60

Examples of lossfunctions

60

Square error y1y2y3y4 Div

  • For real-valued output vectors, the (scaled) 𝑀' divergence is popular

𝐹𝑠𝑠𝑝𝑠 𝒛, 𝒑 = 1 2 𝒛 − 𝒑 ' = 1 2 *(𝑧l − 𝑝l)'

  • l

– Squared Euclidean distance between true and desired output – Note: this is differentiable

𝑒𝐹 𝒛, 𝒑 𝑒𝑝l = −(𝑧l − 𝑝l) 𝛼Š𝐹 𝒛, 𝒑 = [𝑝& − 𝑧&, 𝑝' − 𝑧', … , 𝑝s − 𝑧s]

slide-61
SLIDE 61

Classification: Activation function

61

  • With threshold, the neuron’s output is aflat function with zero derivative

everywhere, except at 0 where itis non-differentiable

– Youcan vary the weights a lot without changing the error – There is no indication of which direction to change the weights to reduce error

slide-62
SLIDE 62

Activation function

62

+

. . .

1 2 3 1 2 3 N-1 N-1 N N N +1

  • Makes the neuron differentiable, with non-zero derivativesover much of

the inputspace

– Small changes in weight can result in non-negligible changes in output – This enables us to estimate the parameters using gradient descent techniques..

slide-63
SLIDE 63

Differentiable Activation

63

+

. . .

1 2 3 1 2 3 N-1 N-1 N N N +1

  • z

i i i

  • This particular one has anice interpretation
slide-64
SLIDE 64

For binary classifier: Logistic regression

64

K LDiv

  • For binary classifier with scalar output 𝑝 ∈ 0,1 , 𝑧 is 0/1, the cross entropy between the

probability distribution [𝑝, 1 − 𝑝] and the ideal output probability [𝑧, 1 − 𝑧] is popular

𝑀 𝑧, 𝑝 = −𝑧𝑚𝑝𝑕𝑝 − 1 − 𝑧 log (1 − 𝑝)

  • Derivative

𝑒𝑀 𝑧, 𝑝 𝑒𝑝 = − 1 𝑝 𝑗𝑔 𝑧 = 1 1 1 − 𝑝 𝑗𝑔 𝑧 = 0

𝑧 𝑝 𝑝 = 𝜏(𝑨)

slide-65
SLIDE 65

Choosing cost function: Examples

65

} Regression problem

– SSE

} Classification problem

– Cross-entropy

  • Binary classification

𝐹 = * 𝐹„

€ „i&

𝐹„ = 1 2 𝑝 „ − 𝑧 „

'

𝐹„ = 1 2 𝒑 „ − 𝒛 „

'

= * 𝑝l

„ − 𝑧l „ ' s li&

One dimensional output Multi-dimensional output

𝑧& 𝑧s 𝑦& 𝑦( 𝑚𝑝𝑡𝑡„ = −𝑧 „ log 𝑝 „ − (1 − 𝑧 „ ) log(1 − 𝑝 „ )

Output layer uses sigmoid activation function

slide-66
SLIDE 66

Multi-class output: One-hot representations

  • Consider a network that must distinguish if an input is a cat, a dog, a camel, a

hat, or a flower

  • For inputs of each of the five classes the desired output is:

Cat : [1 0 0 0 0 ]k dog : [0 1 0 0 0 ]k camel : [0 0 1 0 0 ]k hat : [0 0 0 1 0 ]k flower : [0 0 0 0 1 ]k

  • For input of any class, we will have a five-dimensional vector output with four

zeros and a single 1 at the position of the class

  • This is a one hot vector

66

slide-67
SLIDE 67

Multi-class networks

67

Input L ayer Output Layer Hidden Layers

  • For a multi-class classifier with N classes, the one-hot representation will have

N binary outputs

– An N-dimensional binary vector

  • The neural network’s output too must ideally be binary (N-1 zeros and a single 1 in the

right place)

  • More realistically, it will be aprobability vector

– N probability values that sum to 1.

slide-68
SLIDE 68

Vector activation example: Softmax

68

  • Example: Softmax vector activation

Parameters are weights and bias

𝑝,

slide-69
SLIDE 69

Vector Activations

69

Input L ayer Output Layer Hidden Layers

  • We can also have neuron that have multiple couple outputs

𝑧&, 𝑧', … , 𝑧• = 𝑔(𝑦&, 𝑦', … , 𝑦l; 𝑿)

– Function 𝑔(. ) operates on set of inputs to produce set of outputs – Modifying a single parameter in 𝑿 will affect all outputs

slide-70
SLIDE 70

Multi-class classification: Output

70

Input L ayer Output Layer Hidden Layers

s

  • f

t m a x

  • Softmax vector activation is often used at the output of multi-class

classifier nets

𝑨, = * 𝑥

n, (•)𝑏n („•&)

  • n

𝑝, = exp (𝑨,) ∑ exp (𝑨

n)

  • n
  • This can be viewed as the probability 𝑝, = 𝑄 𝑑𝑚𝑏𝑡𝑡 = 𝑗 𝒚
slide-71
SLIDE 71

For multi-class classification

71

y1y2y3y4 K LDiv() E

  • Desired output 𝑧 is one hot vector 0 0 … 1 … 0 0 0 wit the 1 in the 𝑑-th position(for class c)
  • Actual output will be probability distribution [𝑝&, 𝑝', … , 𝑝•]
  • The cross-entropy between the desired one-hot output and actual output

𝑀 𝒛, 𝒑 = − * 𝑧,𝑚𝑝𝑕𝑝, = −𝑚𝑝𝑕𝑝“

  • ,
  • Derivative

𝑒𝑀(𝒛, 𝒑) 𝑒𝑝, = b− 1 𝑝“ 𝑔𝑝𝑠 𝑢ℎ𝑓 𝑑 𝑢ℎ 𝑑𝑝𝑛𝑞𝑝𝑜𝑓𝑜𝑢 0 𝑔𝑝𝑠 𝑠𝑓𝑛𝑏𝑗𝑜𝑗𝑜𝑕 𝑑𝑝𝑛𝑞𝑝𝑜𝑓𝑜𝑢 𝛼

𝒑𝑀(𝒛, 𝒑) = [0 0 … −1

𝑝“ … 0 0 ]

The slopeis negative w.r.t. 𝑝“ that indicates increasing 𝑝“ will reduce divergence

slide-72
SLIDE 72

For multi-class classification

72

  • Desired output 𝑧 is one hot vector 0 0 … 1 … 0 0 0 wit the 1 in the 𝑑-th position(for class c)
  • Actual output will be probability distribution [𝑝&, 𝑝', … , 𝑝•]
  • The cross-entropy between the desired one-hot output and actual output

𝑀 𝒛, 𝒑 = − * 𝑧,𝑚𝑝𝑕𝑝, = −𝑚𝑝𝑕𝑝“

  • ,
  • Derivative

𝑒𝑀(𝒛, 𝒑) 𝑒𝑝, = b− 1 𝑝“ 𝑔𝑝𝑠 𝑢ℎ𝑓 𝑑 𝑢ℎ 𝑑𝑝𝑛𝑞𝑝𝑜𝑓𝑜𝑢 0 𝑔𝑝𝑠 𝑠𝑓𝑛𝑏𝑗𝑜𝑗𝑜𝑕 𝑑𝑝𝑛𝑞𝑝𝑜𝑓𝑜𝑢 𝛼

𝒑𝑀(𝒛, 𝒑) = [0 0 … −1

𝑝“ … 0 0 ]

Note: when 𝒛 = 𝒑 the derivative is not 0 y1y2y3y4 K LDiv() E

The slopeis negative w.r.t. 𝑝“ that indicates increasing 𝑝“ will reduce divergence

slide-73
SLIDE 73

For multi-class classification

73

  • It is sometimes useful to set the target output to [𝜗 𝜗 … 1 − 𝐿 − 1 𝜗 … 𝜗 𝜗 𝜗] with the

value 1 − (𝐿 − 1)𝜗 in the 𝑑-th position (for class 𝑑) and 𝜗 elsewhere for some small 𝜗

  • The cross-entropy remains:

𝐿𝑀 𝒛, 𝒑 = * 𝑧,𝑚𝑝𝑕𝑝,

  • ,
  • Derivative

𝑒𝐿𝑀(𝒛, 𝒑) 𝑒𝑝, = − 1 − 𝐿 − 1 𝜗 𝑝“ 𝑔𝑝𝑠 𝑢ℎ𝑓 𝑑 𝑢ℎ 𝑑𝑝𝑛𝑞𝑝𝑜𝑓𝑜𝑢 − 𝜗 𝑝, 𝑔𝑝𝑠 𝑠𝑓𝑛𝑏𝑗𝑜𝑗𝑜𝑕 𝑑𝑝𝑛𝑞𝑝𝑜𝑓𝑜𝑢𝑡

y1y2y3y4 K LDiv() E

slide-74
SLIDE 74

Choosing cost function: Examples

74

} Regression problem

– SSE

} Classification problem

– Cross-entropy

  • Binary classification
  • Multi-class classification

𝑚𝑝𝑡𝑡„ = −log 𝑝–(—) 𝑚𝑝𝑡𝑡„ = −𝑧 „ log 𝑝 „ − (1 − 𝑧 „ ) log(1 − 𝑝 „ )

Output layer uses sigmoid activation function

Output is found by a softmax layer 𝑝, =

˜™š ∑ ˜™›

œ ›•ž

𝑝 = 1 1 + 𝑓[ 𝐹 = * 𝐹„

€ „i&

𝐹„ = 1 2 𝑝 „ − 𝑧 „

'

𝐹„ = 1 2 𝒑 „ − 𝒛 „

'

= * 𝑝l

„ − 𝑧l „ ' s li&

One dimensional output Multi-dimensional output

𝑧& 𝑧s 𝑦& 𝑦(

slide-75
SLIDE 75

Problem setup

  • Given: the architecture of thenetwork
  • Training data: A set of input-output pairs

𝒚(&), 𝒛(&) , 𝒚('), 𝒛(') , … , (𝒚(€), 𝒛(€))

  • We need a loss function to show how penalizes the obtained output 𝑝 = 𝑔(𝒚; 𝑿)

when the desired output is 𝒛 𝐹(𝑿) = * 𝑚𝑝𝑡𝑡 𝒑(„), 𝒛(„)

€ „i&

= 1 𝑂 * 𝑚𝑝𝑡𝑡 𝑔 𝒚(„); 𝑿 , 𝒛(„)

€ „i&

  • Minimize 𝐹 w.r.t. 𝑿 that containts 𝑥,,n

l , 𝑐 n [l]

75

slide-76
SLIDE 76

How to adjust weights for multi layer networks?

  • We need multiple layers of adaptive, non-linear hidden units. But

how can we train such nets?

– We need an efficient way of adapting all the weights, not just the last layer. – Learning the weights going into hidden units is equivalent to learning features. – This is difficult because nobody is telling us directly what the hidden units should do.

76

slide-77
SLIDE 77

Find the weights by optimizing the cost

77

  • Start from random weights and then adjust them iteratively to get lower cost.
  • Update the weights according to the gradient of the cost function

Source: http://3b1b.co

slide-78
SLIDE 78

How does the network learn?

78

  • Which changes to the weights do improve the most?
  • The magnitude of each element shows how sensitive the cost is

to that weight or bias.

𝛼𝐹

𝛼𝐹 Source: http://3b1b.co

slide-79
SLIDE 79

Training multi-layer networks

79

  • Back-propagation

– Training algorithm that is used to adjust weights in multi-layer networks (based on the training data) – The back-propagation algorithm is based on gradient descent – Use chain rule and dynamic programming to efficiently compute gradients

slide-80
SLIDE 80

Training Neural Nets through Gradient Descent

80

Total training error:

  • Gradient descent algorithm
  • Initialize all weights and biases 𝑥,n

[l]

– Using the extended notation : the bias is also weight

  • Do :

– For every layer 𝑙 for all 𝑗, 𝑘 update:

  • 𝑥,,n

[l] = 𝑥,,n [l] − 𝜃 ( (¡š,›

[¢]

  • Until 𝐹 has converged

Assuming the bias is also represented as a weight

𝐹 = * 𝑚𝑝𝑡𝑡 𝒑(„), 𝒛(„)

€ „i&

slide-81
SLIDE 81

The derivative

81

  • Computing the derivative

Total derivative: Total training error:

𝐹 = * 𝑚𝑝𝑡𝑡 𝒑(„), 𝒛(„)

€ „i&

𝑒𝐹 𝑒𝑥,,n

[l] = * 𝑚𝑝𝑡𝑡 𝒑(„), 𝒛(„)

𝑒𝑥,,n

[l] € „i&

slide-82
SLIDE 82

Training by gradient descent

  • Initialize all weights 𝑥,n

[l]

  • Do :

– For all 𝑗 , 𝑘 , 𝑙, initialize

( (¡š,›

[¢] = 0

– For all 𝑜 = 1: 𝑂

  • For every layer 𝑙 for all 𝑗, 𝑘:
  • Compute

( •¤¥¥ ¤ — ,– — (¡š,›

[¢]

  • (

(¡š,›

[¢] +=

( •¤¥¥ ¤ — ,– — (¡š,›

[¢]

– For every layer 𝑙 for all 𝑗, 𝑘:

𝑥,,n

[l] = 𝑥,,n [l] − ¦ k ( (¡š,›

[¢]

82

slide-83
SLIDE 83

The derivative

83

  • So we must first figure out how to compute the

derivative of divergences of individual training inputs

Total derivative: Total training error:

𝐹 = * 𝑚𝑝𝑡𝑡 𝒑(„), 𝒛(„)

€ „i&

𝑒𝐹 𝑒𝑥,,n

[l] = * 𝑒 𝑚𝑝𝑡𝑡 𝒑(„), 𝒛(„)

𝑒𝑥,,n

[l] € „i&

slide-84
SLIDE 84

Calculus Refresher: Basic rules of calculus

84

with derivative

dy dx

the following must hold for sufficiently small For any differentiable function

1 2 M

with partial derivatives

∂y ∂y ∂y ∂x1 ∂x2 ∂xM

the following must hold for sufficiently small

1 2 M

𝑒𝑧 𝑒𝑦 ≈ Δ𝑧 Δ𝑦

slide-85
SLIDE 85

Calculus Refresher: Chainrule

85

Check –we can confirm that : For any nested function

slide-86
SLIDE 86

Simple chain rule

  • 𝑨 = 𝑔 𝑕 𝑦
  • 𝑧 = 𝑕(𝑦)

86

slide-87
SLIDE 87

Multiple paths chain rule

87

slide-88
SLIDE 88

Returning to ourproblem

88

  • How to compute

𝑒 𝑚𝑝𝑡𝑡 𝒑, 𝒛 𝑒𝑥,,n

[l]

slide-89
SLIDE 89

Backpropagation: Notation

89

  • 𝒃[p] ← 𝐽𝑜𝑞𝑣𝑢
  • 𝑝𝑣𝑢𝑞𝑣𝑢 ← 𝒃[¬]

𝑔(. ) 𝑔(. ) 𝑔(. ) 𝑔(. ) 𝑔(. ) 𝒃[••&] 𝒃[•] 𝒜[•]

slide-90
SLIDE 90

Output as a composite function

𝑃𝑣𝑢𝑞𝑣𝑢 = 𝑏[¬] = 𝑔 𝑨[¬] = 𝑔 𝑋[¬]𝑏[¬•&] = 𝑔 𝑋[¬]𝑔(𝑋[¬•&]𝑏[¬•'] = 𝑔 𝑋[¬]𝑔 𝑋[¬•&] … 𝑔 𝑋[']𝑔 𝑋[&]𝑦

For convenience, we use the same activation functions for all layers. However, output layer neurons most commonly do not need activation function (they show class scores or real-valued targets.)

𝑋[&] 𝑦 × 𝑔 𝑋['] × 𝑔 𝑋[¬] × 𝑔 𝑨[&] 𝑏[&] 𝑨['] 𝑏[¬•&] 𝑨[¬] 𝑏[¬] 𝑏[¬] = 𝑝𝑣𝑢𝑞𝑣𝑢

90

slide-91
SLIDE 91

Sp Special case: : Affine functions

  • Matrix 𝐗 and bias 𝐜 operating on vector 𝐛[••&] to produce vector 𝐴 •

91

𝜖𝐴 • 𝜖𝐛[••&] = 𝐗 •

𝐴 • = 𝐗 • 𝐛[••&] + 𝐜[•]

slide-92
SLIDE 92

Backward-pass vector

  • Assume we have

µ •¤¥¥ µ¶[·]

  • µ •¤¥¥

µ[[¸] = µ •¤¥¥ µ¶[¸] µ¹ [ ¸ µ[ ¸

  • µ •¤¥¥

µ¶[¸ºž] = µ •¤¥¥ µ[[¸] µ[[¸] µ¶[¸ºž] = 𝑋 • k µ •¤¥¥ µ[[¸]

= 𝑋 • k 𝜖 𝑚𝑝𝑡𝑡 𝜖𝑏[•] 𝜖𝑔 𝑨 • 𝜖𝑨 •

  • µ •¤¥¥

µ»[¸] = µ •¤¥¥ µ[[¸] µ[[¸] µ»[¸] = µ •¤¥¥ µ[[¸] 𝑏 ••& k

92

𝑋[&] 𝑦 × 𝑔 𝑋['] × 𝑔 𝑋[¬] × 𝑔 𝑨[&] 𝑏[&] 𝑨['] 𝑏[¬•&] 𝑨[¬] 𝑏[¬] 𝑏[¬] = 𝑝𝑣𝑢𝑞𝑣𝑢 𝑨[']

slide-93
SLIDE 93

How to propagate the gradients backward

93

𝑨 = 𝑔(𝑦, 𝑧)

slide-94
SLIDE 94

How to propagate the gradients backward

94

slide-95
SLIDE 95

Patterns in backward flow

  • add gate: gradient distributor
  • max gate: gradient router

95

slide-96
SLIDE 96

Modularized implementation: forward / backward API

96

slide-97
SLIDE 97

Modularized implementation: forward / backward API

97

slide-98
SLIDE 98

Mi Mini-ba batch h SGD

  • Loop:
  • 1. Sample a batch of data
  • 2. Forward prop it through the graph (network), get loss
  • 3. Backprop to calculate the gradients
  • 4. Update the parameters using the gradient

98

slide-99
SLIDE 99

Converting error derivatives into a learning procedure

  • The backpropagation algorithm is an efficient way of computing the

gradient of the error function w.r.t. weights and biases.

  • There are many other decisions to be made to have a learning

procedure from these derivatives:

– Convergence or optimization issues: How do we use the error derivatives? – Generalization issues: How can we improve its decisions on unseen data?

99

slide-100
SLIDE 100

Resources

  • Please see the following note:

– http://cs231n.stanford.edu/handouts/derivatives.pdf – http://cs231n.stanford.edu/handouts/linear-backprop.pdf – http://cs231n.github.io/optimization-2/

100