[PPT] - CSC 411: Lecture 10: Neural Networks I Class based on Raquel Urtasun PowerPoint Presentation

SLIDE 1

CSC 411: Lecture 10: Neural Networks I

Class based on Raquel Urtasun & Rich Zemel’s lectures Sanja Fidler

University of Toronto

Feb 10, 2016

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 1 / 62

SLIDE 2

Today

Multi-layer Perceptron Forward propagation Backward propagation

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 2 / 62

SLIDE 3

Motivating Examples

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 3 / 62

SLIDE 4

Are You Excited about Deep Learning?

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 4 / 62

SLIDE 5

Limitations of Linear Classifiers

Linear classifiers (e.g., logistic regression) classify inputs based on linear combinations of features xi

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 5 / 62

SLIDE 6

Limitations of Linear Classifiers

Linear classifiers (e.g., logistic regression) classify inputs based on linear combinations of features xi Many decisions involve non-linear functions of the input

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 5 / 62

SLIDE 7

Limitations of Linear Classifiers

Linear classifiers (e.g., logistic regression) classify inputs based on linear combinations of features xi Many decisions involve non-linear functions of the input Canonical example: do 2 input elements have the same value? 0,1 0,0 1,0 1,1

utput =1
utput =0

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 5 / 62

SLIDE 8

Limitations of Linear Classifiers

Linear classifiers (e.g., logistic regression) classify inputs based on linear combinations of features xi Many decisions involve non-linear functions of the input Canonical example: do 2 input elements have the same value? 0,1 0,0 1,0 1,1

utput =1
utput =0

The positive and negative cases cannot be separated by a plane

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 5 / 62

SLIDE 9

Limitations of Linear Classifiers

Linear classifiers (e.g., logistic regression) classify inputs based on linear combinations of features xi Many decisions involve non-linear functions of the input Canonical example: do 2 input elements have the same value? 0,1 0,0 1,0 1,1

utput =1
utput =0

The positive and negative cases cannot be separated by a plane What can we do?

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 5 / 62

SLIDE 10

How to Construct Nonlinear Classifiers?

We would like to construct non-linear discriminative classifiers that utilize functions of input variables

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 6 / 62

SLIDE 11

How to Construct Nonlinear Classifiers?

We would like to construct non-linear discriminative classifiers that utilize functions of input variables Use a large number of simpler functions

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 6 / 62

SLIDE 12

How to Construct Nonlinear Classifiers?

We would like to construct non-linear discriminative classifiers that utilize functions of input variables Use a large number of simpler functions

◮ If these functions are fixed (Gaussian, sigmoid, polynomial basis

functions), then optimization still involves linear combinations of (fixed functions of) the inputs

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 6 / 62

SLIDE 13

How to Construct Nonlinear Classifiers?

We would like to construct non-linear discriminative classifiers that utilize functions of input variables Use a large number of simpler functions

◮ If these functions are fixed (Gaussian, sigmoid, polynomial basis

functions), then optimization still involves linear combinations of (fixed functions of) the inputs

◮ Or we can make these functions depend on additional parameters →

need an efficient method of training extra parameters

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 6 / 62

SLIDE 14

Inspiration: The Brain

Many machine learning methods inspired by biology, eg the (human) brain Our brain has ∼ 1011 neurons, each of which communicates (is connected) to ∼ 104 other neurons Figure: The basic computational unit of the brain: Neuron

[Pic credit: http://cs231n.github.io/neural-networks-1/]

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 7 / 62

SLIDE 15

Mathematical Model of a Neuron

Neural networks define functions of the inputs (hidden features), computed by neurons Artificial neurons are called units Figure: A mathematical model of the neuron in a neural network

[Pic credit: http://cs231n.github.io/neural-networks-1/]

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 8 / 62

SLIDE 16

Activation Functions

Most commonly used activation functions: Sigmoid: σ(z) =

1 1+exp(−z)

Tanh: tanh(z) = exp(z)−exp(−z)

exp(z)+exp(−z)

ReLU (Rectified Linear Unit): ReLU(z) = max(0, z)

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 9 / 62

SLIDE 17

Neuron in Python

Example in Python of a neuron with a sigmoid activation function Figure: Example code for computing the activation of a single neuron

[http://cs231n.github.io/neural-networks-1/]

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 10 / 62

SLIDE 18

Neural Network Architecture (Multi-Layer Perceptron)

Network with one layer of four hidden units:

utput units

input units

Figure: Two different visualizations of a 2-layer neural network. In this example: 3 input

units, 4 hidden units and 2 output units

Each unit computes its value based on linear combination of values of units that point into it, and an activation function

[http://cs231n.github.io/neural-networks-1/] Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 11 / 62

SLIDE 19

Neural Network Architecture (Multi-Layer Perceptron)

Network with one layer of four hidden units:

utput units

input units

Figure: Two different visualizations of a 2-layer neural network. In this example: 3 input

units, 4 hidden units and 2 output units

Naming conventions; a 2-layer neural network:

◮ One layer of hidden units ◮ One output layer

(we do not count the inputs as a layer)

[http://cs231n.github.io/neural-networks-1/] Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 12 / 62

SLIDE 20

Neural Network Architecture (Multi-Layer Perceptron)

Going deeper: a 3-layer neural network with two layers of hidden units Figure: A 3-layer neural net with 3 input units, 4 hidden units in the first and second

hidden layer and 1 output unit

Naming conventions; a N-layer neural network:

◮ N − 1 layers of hidden units ◮ One output layer

[http://cs231n.github.io/neural-networks-1/]

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 13 / 62

SLIDE 21

Representational Power

Neural network with at least one hidden layer is a universal approximator (can represent any function).

Proof in: Approximation by Superpositions of Sigmoidal Function, Cybenko, paper

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 14 / 62

SLIDE 22

Representational Power

Neural network with at least one hidden layer is a universal approximator (can represent any function).

Proof in: Approximation by Superpositions of Sigmoidal Function, Cybenko, paper

The capacity of the network increases with more hidden units and more hidden layers

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 14 / 62

SLIDE 23

Representational Power

Neural network with at least one hidden layer is a universal approximator (can represent any function).

Proof in: Approximation by Superpositions of Sigmoidal Function, Cybenko, paper

The capacity of the network increases with more hidden units and more hidden layers Why go deeper? Read eg: Do Deep Nets Really Need to be Deep? Jimmy Ba, Rich

Caruana, Paper: paper]

[http://cs231n.github.io/neural-networks-1/] Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 14 / 62

SLIDE 24

Neural Networks

We only need to know two algorithms

◮ Forward pass: performs inference ◮ Backward pass: performs learning Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 15 / 62

SLIDE 25

Forward Pass: What does the Network Compute?

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 16 / 62

SLIDE 26

Forward Pass: What does the Network Compute?

Output of the network can be written as: hj(x) = f (vj0 +

D

i=1

xivji)

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 16 / 62

SLIDE 27

Forward Pass: What does the Network Compute?

Output of the network can be written as: hj(x) = f (vj0 +

D

i=1

xivji)

k(x)

= g(wk0 +

J

j=1

hj(x)wkj) (j indexing hidden units, k indexing the output units, D number of inputs)

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 16 / 62

SLIDE 28

Forward Pass: What does the Network Compute?

Output of the network can be written as: hj(x) = f (vj0 +

D

i=1

xivji)

k(x)

= g(wk0 +

J

j=1

hj(x)wkj) (j indexing hidden units, k indexing the output units, D number of inputs) Activation functions f , g: sigmoid/logistic, tanh, or rectified linear (ReLU) σ(z) = 1 1 + exp(−z), tanh(z) = exp(z) − exp(−z) exp(z) + exp(−z), ReLU(z) = max(0, z)

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 16 / 62

SLIDE 29

Forward Pass in Python

Example code for a forward pass for a 3-layer network in Python: Can be implemented efficiently using matrix operations

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 17 / 62

SLIDE 30

Forward Pass in Python

Example code for a forward pass for a 3-layer network in Python: Can be implemented efficiently using matrix operations Example above: W1 is matrix of size 4 × 3, W2 is 4 × 4. What about biases and W3?

[http://cs231n.github.io/neural-networks-1/] Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 17 / 62

SLIDE 31

Special Case

What is a single layer (no hiddens) network with a sigmoid act. function?

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 18 / 62

SLIDE 32

Special Case

What is a single layer (no hiddens) network with a sigmoid act. function? Network:

k(x)

= 1 1 + exp(−zk) zk = wk0 +

J

j=1

xjwkj

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 18 / 62

SLIDE 33

Special Case

What is a single layer (no hiddens) network with a sigmoid act. function? Network:

k(x)

= 1 1 + exp(−zk) zk = wk0 +

J

j=1

xjwkj Logistic regression!

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 18 / 62

SLIDE 34

Example Application

Classify image of handwritten digit (32x32 pixels): 4 vs non-4

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 19 / 62

SLIDE 35

Example Application

Classify image of handwritten digit (32x32 pixels): 4 vs non-4 How would you build your network?

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 19 / 62

SLIDE 36

Example Application

Classify image of handwritten digit (32x32 pixels): 4 vs non-4 How would you build your network? For example, use one hidden layer and the sigmoid activation function:

k(x)

= 1 1 + exp(−zk) zk = wk0 +

J

j=1

hj(x)wkj

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 19 / 62

SLIDE 37

Example Application

Classify image of handwritten digit (32x32 pixels): 4 vs non-4 How would you build your network? For example, use one hidden layer and the sigmoid activation function:

k(x)

= 1 1 + exp(−zk) zk = wk0 +

J

j=1

hj(x)wkj How can we train the network, that is, adjust all the parameters w?

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 19 / 62

SLIDE 38

Training Neural Networks

Find weights: w∗ = argmin

w N

n=1

loss(o(n), t(n)) where o = f (x; w) is the output of a neural network

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 20 / 62

SLIDE 39

Training Neural Networks

Find weights: w∗ = argmin

w N

n=1

loss(o(n), t(n)) where o = f (x; w) is the output of a neural network Define a loss function, eg:

◮ Squared loss:

k 1 2(o(n) k

− t(n)

k )2

◮ Cross-entropy loss: −

k t(n) k

log o(n)

k

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 20 / 62

SLIDE 40

Training Neural Networks

Find weights: w∗ = argmin

w N

n=1

loss(o(n), t(n)) where o = f (x; w) is the output of a neural network Define a loss function, eg:

◮ Squared loss:

k 1 2(o(n) k

− t(n)

k )2

◮ Cross-entropy loss: −

k t(n) k

log o(n)

k

Gradient descent: wt+1 = wt − η ∂E ∂wt where η is the learning rate (and E is error/loss)

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 20 / 62

SLIDE 41

Useful Derivatives

name function derivative Sigmoid σ(z) =

1 1+exp(−z)

σ(z) · (1 − σ(z)) Tanh tanh(z) = exp(z)−exp(−z)

exp(z)+exp(−z)

1/ cosh2(z) ReLU ReLU(z) = max(0, z)

1,

if z > 0 0, if z ≤ 0

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 21 / 62

SLIDE 42

Training Neural Networks: Back-propagation

Back-propagation: an efficient method for computing gradients needed to perform gradient-based optimization of the weights in a multi-layer network

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 22 / 62

SLIDE 43

Training Neural Networks: Back-propagation

Back-propagation: an efficient method for computing gradients needed to perform gradient-based optimization of the weights in a multi-layer network Training neural nets: Loop until convergence:

◮ for each example n

1. Given input x(n) , propagate activity forward (x(n) → h(n) → o(n))

(forward pass)

2. Propagate gradients backward (backward pass)
3. Update each weight (via gradient descent)

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 22 / 62

SLIDE 44

Training Neural Networks: Back-propagation

Back-propagation: an efficient method for computing gradients needed to perform gradient-based optimization of the weights in a multi-layer network Training neural nets: Loop until convergence:

◮ for each example n

1. Given input x(n) , propagate activity forward (x(n) → h(n) → o(n))

(forward pass)

2. Propagate gradients backward (backward pass)
3. Update each weight (via gradient descent)

Given any error function E, activation functions g() and f (), just need to derive gradients

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 22 / 62

SLIDE 45

Key Idea behind Backpropagation

We don’t have targets for a hidden unit, but we can compute how fast the error changes as we change its activity

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 23 / 62

SLIDE 46

Key Idea behind Backpropagation

We don’t have targets for a hidden unit, but we can compute how fast the error changes as we change its activity

◮ Instead of using desired activities to train the hidden units, use error

derivatives w.r.t. hidden activities

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 23 / 62

SLIDE 47

Key Idea behind Backpropagation

We don’t have targets for a hidden unit, but we can compute how fast the error changes as we change its activity

◮ Instead of using desired activities to train the hidden units, use error

derivatives w.r.t. hidden activities

◮ Each hidden activity can affect many output units and can therefore

have many separate effects on the error. These effects must be combined

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 23 / 62

SLIDE 48

Key Idea behind Backpropagation

We don’t have targets for a hidden unit, but we can compute how fast the error changes as we change its activity

◮ Instead of using desired activities to train the hidden units, use error

derivatives w.r.t. hidden activities

◮ Each hidden activity can affect many output units and can therefore

have many separate effects on the error. These effects must be combined

◮ We can compute error derivatives for all the hidden units efficiently Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 23 / 62

SLIDE 49

Key Idea behind Backpropagation

We don’t have targets for a hidden unit, but we can compute how fast the error changes as we change its activity

◮ Instead of using desired activities to train the hidden units, use error

derivatives w.r.t. hidden activities

◮ Each hidden activity can affect many output units and can therefore

have many separate effects on the error. These effects must be combined

◮ We can compute error derivatives for all the hidden units efficiently ◮ Once we have the error derivatives for the hidden activities, its easy to

get the error derivatives for the weights going into a hidden unit

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 23 / 62

SLIDE 50

Key Idea behind Backpropagation

We don’t have targets for a hidden unit, but we can compute how fast the error changes as we change its activity

◮ Instead of using desired activities to train the hidden units, use error

derivatives w.r.t. hidden activities

◮ Each hidden activity can affect many output units and can therefore

have many separate effects on the error. These effects must be combined

◮ We can compute error derivatives for all the hidden units efficiently ◮ Once we have the error derivatives for the hidden activities, its easy to

get the error derivatives for the weights going into a hidden unit This is just the chain rule!

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 23 / 62

SLIDE 51

Computing Gradients: Single Layer Network

Let’s take a single layer network

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 24 / 62

SLIDE 52

Computing Gradients: Single Layer Network

Let’s take a single layer network and draw it a bit differently

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 24 / 62

SLIDE 53

Computing Gradients: Single Layer Network

Error gradients for single layer network: ∂E ∂wki =

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 25 / 62

SLIDE 54

Computing Gradients: Single Layer Network

Error gradients for single layer network: ∂E ∂wki = ∂E ∂ok ∂ok ∂zk ∂zk ∂wki

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 25 / 62

SLIDE 55

Computing Gradients: Single Layer Network

Error gradients for single layer network: ∂E ∂wki = ∂E ∂ok ∂ok ∂zk ∂zk ∂wki Error gradient is computable for any continuous activation function g(), and any continuous error function

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 25 / 62

SLIDE 56

Computing Gradients: Single Layer Network

Error gradients for single layer network: ∂E ∂wki = ∂E ∂ok

δo

k

∂ok ∂zk ∂zk ∂wki

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 26 / 62

SLIDE 57

Computing Gradients: Single Layer Network

Error gradients for single layer network: ∂E ∂wki = ∂E ∂ok ∂ok ∂zk ∂zk ∂wki = δo

k

∂ok ∂zk ∂zk ∂wki

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 27 / 62

SLIDE 58

Computing Gradients: Single Layer Network

Error gradients for single layer network: ∂E ∂wki = ∂E ∂ok ∂ok ∂zk ∂zk ∂wki = δo

k · ∂ok

∂zk

δz

k

∂zk ∂wki

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 28 / 62

SLIDE 59

Computing Gradients: Single Layer Network

Error gradients for single layer network: ∂E ∂wki = ∂E ∂ok ∂ok ∂zk ∂zk ∂wki = δz

k

∂zk ∂wki = δz

k · xi

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 29 / 62

SLIDE 60

Gradient Descent for Single Layer Network

Assuming the error function is mean-squared error (MSE), on a single training example n, we have

∂E ∂o(n)

k

= o(n)

k

− t(n)

k

:= δo

k

Using logistic activation functions:

(n)

k

= g(z(n)

k ) = (1 + exp(−z(n) k ))−1

∂o(n)

k

∂z(n)

k

=

(n)

k (1 − o(n) k ) Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 30 / 62

SLIDE 61

Gradient Descent for Single Layer Network

Assuming the error function is mean-squared error (MSE), on a single training example n, we have

∂E ∂o(n)

k

= o(n)

k

− t(n)

k

:= δo

k

Using logistic activation functions:

(n)

k

= g(z(n)

k ) = (1 + exp(−z(n) k ))−1

∂o(n)

k

∂z(n)

k

=

(n)

k (1 − o(n) k )

The error gradient is then:

∂E ∂wki =

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 30 / 62

SLIDE 62

Gradient Descent for Single Layer Network

Assuming the error function is mean-squared error (MSE), on a single training example n, we have

∂E ∂o(n)

k

= o(n)

k

− t(n)

k

:= δo

k

Using logistic activation functions:

(n)

k

= g(z(n)

k ) = (1 + exp(−z(n) k ))−1

∂o(n)

k

∂z(n)

k

=

(n)

k (1 − o(n) k )

The error gradient is then:

∂E ∂wki =

N

n=1

∂E ∂o(n)

k

∂o(n)

k

∂z(n)

k

∂z(n)

k

∂wki =

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 30 / 62

SLIDE 63

Gradient Descent for Single Layer Network

Assuming the error function is mean-squared error (MSE), on a single training example n, we have

∂E ∂o(n)

k

= o(n)

k

− t(n)

k

:= δo

k

Using logistic activation functions:

(n)

k

= g(z(n)

k ) = (1 + exp(−z(n) k ))−1

∂o(n)

k

∂z(n)

k

=

(n)

k (1 − o(n) k )

The error gradient is then:

∂E ∂wki =

N

n=1

∂E ∂o(n)

k

∂o(n)

k

∂z(n)

k

∂z(n)

k

∂wki =

N

n=1

(o(n)

k

− t(n)

k )o(n) k (1 − o(n) k )x(n) i Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 30 / 62

SLIDE 64

Gradient Descent for Single Layer Network

Assuming the error function is mean-squared error (MSE), on a single training example n, we have

∂E ∂o(n)

k

= o(n)

k

− t(n)

k

:= δo

k

Using logistic activation functions:

(n)

k

= g(z(n)

k ) = (1 + exp(−z(n) k ))−1

∂o(n)

k

∂z(n)

k

=

(n)

k (1 − o(n) k )

The error gradient is then:

∂E ∂wki =

N

n=1

∂E ∂o(n)

k

∂o(n)

k

∂z(n)

k

∂z(n)

k

∂wki =

N

n=1

(o(n)

k

− t(n)

k )o(n) k (1 − o(n) k )x(n) i

The gradient descent update rule is given by:

wki ← wki − η ∂E ∂wki =

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 30 / 62

SLIDE 65

Gradient Descent for Single Layer Network

Assuming the error function is mean-squared error (MSE), on a single training example n, we have

∂E ∂o(n)

k

= o(n)

k

− t(n)

k

:= δo

k

Using logistic activation functions:

(n)

k

= g(z(n)

k ) = (1 + exp(−z(n) k ))−1

∂o(n)

k

∂z(n)

k

=

(n)

k (1 − o(n) k )

The error gradient is then:

∂E ∂wki =

N

n=1

∂E ∂o(n)

k

∂o(n)

k

∂z(n)

k

∂z(n)

k

∂wki =

N

n=1

(o(n)

k

− t(n)

k )o(n) k (1 − o(n) k )x(n) i

The gradient descent update rule is given by:

wki ← wki − η ∂E ∂wki = wki − η

N

n=1

(o(n)

k

− t(n)

k )o(n) k (1 − o(n) k )x(n) i Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 30 / 62

SLIDE 66

Multi-layer Neural Network

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 31 / 62

SLIDE 67

Back-propagation: Sketch on One Training Case

Convert discrepancy between each output and its target value into an error derivative E = 1 2

k

(ok − tk)2; ∂E ∂ok = ok − tk

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 32 / 62

SLIDE 68

Back-propagation: Sketch on One Training Case

Convert discrepancy between each output and its target value into an error derivative E = 1 2

k

(ok − tk)2; ∂E ∂ok = ok − tk Compute error derivatives in each hidden layer from error derivatives in layer

above. [assign blame for error at k to each unit j according to its influence
n k (depends on wkj)]

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 32 / 62

SLIDE 69

Back-propagation: Sketch on One Training Case

Convert discrepancy between each output and its target value into an error derivative E = 1 2

k

(ok − tk)2; ∂E ∂ok = ok − tk Compute error derivatives in each hidden layer from error derivatives in layer

above. [assign blame for error at k to each unit j according to its influence
n k (depends on wkj)]

Use error derivatives w.r.t. activities to get error derivatives w.r.t. the weights.

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 32 / 62

SLIDE 70

Gradient Descent for Multi-layer Network

The output weight gradients for a multi-layer network are the same as for a single layer network

∂E ∂wkj =

N

n=1

∂E ∂o(n)

k

∂o(n)

k

∂z(n)

k

∂z(n)

k

∂wkj =

N

n=1

δz,(n)

k

h(n)

j

where δk is the error w.r.t. the net input for unit k

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 33 / 62

SLIDE 71

Gradient Descent for Multi-layer Network

The output weight gradients for a multi-layer network are the same as for a single layer network

∂E ∂wkj =

N

n=1

∂E ∂o(n)

k

∂o(n)

k

∂z(n)

k

∂z(n)

k

∂wkj =

N

n=1

δz,(n)

k

h(n)

j

where δk is the error w.r.t. the net input for unit k Hidden weight gradients are then computed via back-prop:

∂E ∂h(n)

j

=

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 33 / 62

SLIDE 72

Gradient Descent for Multi-layer Network

The output weight gradients for a multi-layer network are the same as for a single layer network

∂E ∂wkj =

N

n=1

∂E ∂o(n)

k

∂o(n)

k

∂z(n)

k

∂z(n)

k

∂wkj =

N

n=1

δz,(n)

k

h(n)

j

where δk is the error w.r.t. the net input for unit k Hidden weight gradients are then computed via back-prop:

∂E ∂h(n)

j

=

k

∂E ∂o(n)

k

∂o(n)

k

∂z(n)

k

∂z(n)

k

∂h(n)

j

=

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 34 / 62

SLIDE 73

Gradient Descent for Multi-layer Network

The output weight gradients for a multi-layer network are the same as for a single layer network

∂E ∂wkj =

N

n=1

∂E ∂o(n)

k

∂o(n)

k

∂z(n)

k

∂z(n)

k

∂wkj =

N

n=1

δz,(n)

k

h(n)

j

where δk is the error w.r.t. the net input for unit k Hidden weight gradients are then computed via back-prop:

∂E ∂h(n)

j

=

k

∂E ∂o(n)

k

∂o(n)

k

∂z(n)

k

∂z(n)

k

∂h(n)

j

=

k

δz,(n)

k

wkj := δh,(n)

j Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 34 / 62

SLIDE 74

Gradient Descent for Multi-layer Network

The output weight gradients for a multi-layer network are the same as for a single layer network

∂E ∂wkj =

N

n=1

∂E ∂o(n)

k

∂o(n)

k

∂z(n)

k

∂z(n)

k

∂wkj =

N

n=1

δz,(n)

k

h(n)

j

where δk is the error w.r.t. the net input for unit k Hidden weight gradients are then computed via back-prop:

∂E ∂h(n)

j

=

k

∂E ∂o(n)

k

∂o(n)

k

∂z(n)

k

∂z(n)

k

∂h(n)

j

=

k

δz,(n)

k

wkj := δh,(n)

j

∂E ∂vji =

N

n=1

∂E ∂h(n)

j

∂h(n)

j

∂u(n)

j

∂u(n)

j

∂vji =

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 35 / 62

SLIDE 75

Gradient Descent for Multi-layer Network

The output weight gradients for a multi-layer network are the same as for a single layer network

∂E ∂wkj =

N

n=1

∂E ∂o(n)

k

∂o(n)

k

∂z(n)

k

∂z(n)

k

∂wkj =

N

n=1

δz,(n)

k

h(n)

j

where δk is the error w.r.t. the net input for unit k Hidden weight gradients are then computed via back-prop:

∂E ∂h(n)

j

=

k

∂E ∂o(n)

k

∂o(n)

k

∂z(n)

k

∂z(n)

k

∂h(n)

j

=

k

δz,(n)

k

wkj := δh,(n)

j

∂E ∂vji =

N

n=1

∂E ∂h(n)

j

∂h(n)

j

∂u(n)

j

∂u(n)

j

∂vji =

N

n=1

δh,(n)

j

f ′(u(n)

j

) ∂u(n)

j

∂vji =

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 35 / 62

SLIDE 76

Gradient Descent for Multi-layer Network

The output weight gradients for a multi-layer network are the same as for a single layer network

∂E ∂wkj =

N

n=1

∂E ∂o(n)

k

∂o(n)

k

∂z(n)

k

∂z(n)

k

∂wkj =

N

n=1

δz,(n)

k

h(n)

j

where δk is the error w.r.t. the net input for unit k Hidden weight gradients are then computed via back-prop:

∂E ∂h(n)

j

=

k

∂E ∂o(n)

k

∂o(n)

k

∂z(n)

k

∂z(n)

k

∂h(n)

j

=

k

δz,(n)

k

wkj := δh,(n)

j

∂E ∂vji =

N

n=1

∂E ∂h(n)

j

∂h(n)

j

∂u(n)

j

∂u(n)

j

∂vji =

N

n=1

δh,(n)

j

f ′(u(n)

j

) ∂u(n)

j

∂vji =

N

n=1

δu,(n)

j

x(n)

i Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 36 / 62

SLIDE 77

Choosing Activation and Loss Functions

When using a neural network for regression, sigmoid activation and MSE as the loss function work well

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 37 / 62

SLIDE 78

Choosing Activation and Loss Functions

When using a neural network for regression, sigmoid activation and MSE as the loss function work well For classification, if it is a binary (2-class) problem, then cross-entropy error function often does better (as we saw with logistic regression) E = −

N

n=1

t(n) log o(n) + (1 − t(n)) log(1 − o(n))

(n) = (1 + exp(−z(n))−1

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 37 / 62

SLIDE 79

Choosing Activation and Loss Functions

When using a neural network for regression, sigmoid activation and MSE as the loss function work well For classification, if it is a binary (2-class) problem, then cross-entropy error function often does better (as we saw with logistic regression) E = −

N

n=1

t(n) log o(n) + (1 − t(n)) log(1 − o(n))

(n) = (1 + exp(−z(n))−1

We can then compute via the chain rule ∂E ∂o = (o − t)/(o(1 − o)) ∂o ∂z = o(1 − o) ∂E ∂z = ∂E ∂o ∂o ∂z = (o − t)

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 37 / 62

SLIDE 80

Multi-class Classification

For multi-class classification problems, use cross-entropy as loss and the softmax activation function

E = −

n
k

t(n)

k

log o(n)

k

(n)

k

= exp(z(n)

k )

j exp(z(n)

j

)

And the derivatives become

∂ok ∂zk = ok(1 − ok) ∂E ∂zk =

j

∂E ∂oj ∂oj ∂zk = (ok − tk)ok(1 − ok)

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 38 / 62

SLIDE 81

Example Application

Now trying to classify image of handwritten digit: 32x32 pixels 10 output units, 1 per digit Use the softmax function:

k

= exp(zk)

j exp(zj)

zk = wk0 +

J

j=1

hj(x)wkj What is J ?

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 39 / 62

SLIDE 82

Ways to Use Weight Derivatives

How often to update

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 40 / 62

SLIDE 83

Ways to Use Weight Derivatives

How often to update

◮ after a full sweep through the training data (batch gradient descent)

wki ← wki − η ∂E ∂wki = wki − η

N

n=1

∂E(o(n), t(n); w) ∂wki

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 40 / 62

SLIDE 84

Ways to Use Weight Derivatives

How often to update

◮ after a full sweep through the training data (batch gradient descent)

wki ← wki − η ∂E ∂wki = wki − η

N

n=1

∂E(o(n), t(n); w) ∂wki

◮ after each training case (stochastic gradient descent) Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 40 / 62

SLIDE 85

Ways to Use Weight Derivatives

How often to update

◮ after a full sweep through the training data (batch gradient descent)

wki ← wki − η ∂E ∂wki = wki − η

N

n=1

∂E(o(n), t(n); w) ∂wki

◮ after each training case (stochastic gradient descent) ◮ after a mini-batch of training cases Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 40 / 62

SLIDE 86

Ways to Use Weight Derivatives

How often to update

◮ after a full sweep through the training data (batch gradient descent)

wki ← wki − η ∂E ∂wki = wki − η

N

n=1

∂E(o(n), t(n); w) ∂wki

◮ after each training case (stochastic gradient descent) ◮ after a mini-batch of training cases

How much to update

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 40 / 62

SLIDE 87

Ways to Use Weight Derivatives

How often to update

◮ after a full sweep through the training data (batch gradient descent)

wki ← wki − η ∂E ∂wki = wki − η

N

n=1

∂E(o(n), t(n); w) ∂wki

◮ after each training case (stochastic gradient descent) ◮ after a mini-batch of training cases

How much to update

◮ Use a fixed learning rate Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 40 / 62

SLIDE 88

Ways to Use Weight Derivatives

How often to update

◮ after a full sweep through the training data (batch gradient descent)

wki ← wki − η ∂E ∂wki = wki − η

N

n=1

∂E(o(n), t(n); w) ∂wki

◮ after each training case (stochastic gradient descent) ◮ after a mini-batch of training cases

How much to update

◮ Use a fixed learning rate ◮ Adapt the learning rate Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 40 / 62

SLIDE 89

Ways to Use Weight Derivatives

How often to update

◮ after a full sweep through the training data (batch gradient descent)

wki ← wki − η ∂E ∂wki = wki − η

N

n=1

∂E(o(n), t(n); w) ∂wki

◮ after each training case (stochastic gradient descent) ◮ after a mini-batch of training cases

How much to update

◮ Use a fixed learning rate ◮ Adapt the learning rate ◮ Add momentum

wki ← wki − v v ← γv + η ∂E ∂wki

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 40 / 62

SLIDE 90

Comparing Optimization Methods

[http://cs231n.github.io/neural-networks-3/, Alec Radford]

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 41 / 62

SLIDE 91

Monitor Loss During Training

Check how your loss behaves during training, to spot wrong hyperparameters, bugs, etc Figure: Left: Good vs bad parameter choices, Right: How a real loss might look like during training. What are the bumps caused by? How could we get a more smooth loss?

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 42 / 62

SLIDE 92

Monitor Accuracy on Train/Validation During Training

Check how your desired performance metrics behaves during training

[http://cs231n.github.io/neural-networks-3/]

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 43 / 62

SLIDE 93

Why ”Deep”?

Supervised Learning: Examples

“dog” Classification c l a s s i f i c a t i

n

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 44 / 62

SLIDE 94

Why ”Deep”?

Supervised Learning: Examples

“dog” Classification c l a s s i f i c a t i

n

Supervised Deep Learning

“dog” Classification

[Picture from M. Ranzato] Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 44 / 62

SLIDE 95

Neural Networks

Deep learning uses composite of simple functions (e.g., ReLU, sigmoid, tanh, max) to create complex non-linear functions

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 45 / 62

SLIDE 96

Neural Networks

Deep learning uses composite of simple functions (e.g., ReLU, sigmoid, tanh, max) to create complex non-linear functions Note: a composite of linear functions is linear!

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 45 / 62

SLIDE 97

Neural Networks

Deep learning uses composite of simple functions (e.g., ReLU, sigmoid, tanh, max) to create complex non-linear functions Note: a composite of linear functions is linear! Example: 2 hidden layer NNet (now matrix and vector form!) with ReLU as nonlinearity

x max(0, W T

1 x + b1)

h1 max(0, W T

2 h1 + b2)

h2 W T

3 h2 + b3

y

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 45 / 62

SLIDE 98

Neural Networks

Deep learning uses composite of simple functions (e.g., ReLU, sigmoid, tanh, max) to create complex non-linear functions Note: a composite of linear functions is linear! Example: 2 hidden layer NNet (now matrix and vector form!) with ReLU as nonlinearity

x max(0, W T

1 x + b1)

h1 max(0, W T

2 h1 + b2)

h2 W T

3 h2 + b3

y

◮ x is the input Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 45 / 62

SLIDE 99

Neural Networks

Deep learning uses composite of simple functions (e.g., ReLU, sigmoid, tanh, max) to create complex non-linear functions Note: a composite of linear functions is linear! Example: 2 hidden layer NNet (now matrix and vector form!) with ReLU as nonlinearity

x max(0, W T

1 x + b1)

h1 max(0, W T

2 h1 + b2)

h2 W T

3 h2 + b3

y

◮ x is the input ◮ y is the output (what we want to predict) Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 45 / 62

SLIDE 100

Neural Networks

Deep learning uses composite of simple functions (e.g., ReLU, sigmoid, tanh, max) to create complex non-linear functions Note: a composite of linear functions is linear! Example: 2 hidden layer NNet (now matrix and vector form!) with ReLU as nonlinearity

x max(0, W T

1 x + b1)

h1 max(0, W T

2 h1 + b2)

h2 W T

3 h2 + b3

y

◮ x is the input ◮ y is the output (what we want to predict) ◮ hi is the i-th hidden layer Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 45 / 62

SLIDE 101

Neural Networks

Deep learning uses composite of simple functions (e.g., ReLU, sigmoid, tanh, max) to create complex non-linear functions Note: a composite of linear functions is linear! Example: 2 hidden layer NNet (now matrix and vector form!) with ReLU as nonlinearity

x max(0, W T

1 x + b1)

h1 max(0, W T

2 h1 + b2)

h2 W T

3 h2 + b3

y

◮ x is the input ◮ y is the output (what we want to predict) ◮ hi is the i-th hidden layer ◮ Wi are the parameters of the i-th layer Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 45 / 62

SLIDE 102

Evaluating the Function

Assume we have learn the weights and we want to do inference Forward Propagation: compute the output given the input

x max(0, W T

1 x + b1)

h1 max(0, W T

2 h1 + b2)

h2 W T

3 h2 + b3

y

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 46 / 62

SLIDE 103

Evaluating the Function

Assume we have learn the weights and we want to do inference Forward Propagation: compute the output given the input

x max(0, W T

1 x + b1)

h1 max(0, W T

2 h1 + b2)

h2 W T

3 h2 + b3

y

Do it in a compositional way, h1 = max(0, W T

1 x + b1)

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 46 / 62

SLIDE 104

Evaluating the Function

Assume we have learn the weights and we want to do inference Forward Propagation: compute the output given the input

x max(0, W T

1 x + b1)

h1 max(0, W T

2 h1 + b2)

h2 W T

3 h2 + b3

y

Do it in a compositional way h1 = max(0, W T

1 x + b1)

h2 = max(0, W T

2 h1 + b2)

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 47 / 62

SLIDE 105

Evaluating the Function

Assume we have learn the weights and we want to do inference Forward Propagation: compute the output given the input

x max(0, W T

1 x + b1)

h1 max(0, W T

2 h1 + b2)

h2 W T

3 h2 + b3

y

Do it in a compositional way h1 = max(0, W T

1 x + b1)

h2 = max(0, W T

2 h1 + b2)

y = max(0, W T

3 h2 + b3)

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 48 / 62

SLIDE 106

Learning

x max(0, W T

1 x + b1)

h1 max(0, W T

2 h1 + b2)

h2 W T

3 h2 + b3

y

We want to estimate the parameters, biases and hyper-parameters (e.g., number of layers, number of units) such that we do good predictions Collect a training set of input-output pairs {x(n), t(n)}

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 49 / 62

SLIDE 107

Learning

x max(0, W T

1 x + b1)

h1 max(0, W T

2 h1 + b2)

h2 W T

3 h2 + b3

y

We want to estimate the parameters, biases and hyper-parameters (e.g., number of layers, number of units) such that we do good predictions Collect a training set of input-output pairs {x(n), t(n)} For classification: Encode the output with 1-K encoding t = [0, .., 1, .., 0]

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 49 / 62

SLIDE 108

Learning

x max(0, W T

1 x + b1)

h1 max(0, W T

2 h1 + b2)

h2 W T

3 h2 + b3

y

We want to estimate the parameters, biases and hyper-parameters (e.g., number of layers, number of units) such that we do good predictions Collect a training set of input-output pairs {x(n), t(n)} For classification: Encode the output with 1-K encoding t = [0, .., 1, .., 0] Define a loss per training example and minimize the empirical risk L(w) = 1 N

n

ℓ(w, x(n), t(n)) with N number of examplesand w contains all parameters

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 49 / 62

SLIDE 109

Loss Function: Classification

L(w) = 1 N

n

ℓ(w, x(n), t(n))

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 50 / 62

SLIDE 110

Loss Function: Classification

L(w) = 1 N

n

ℓ(w, x(n), t(n)) Probability of class k given input (softmax): p(ck = 1|x) = exp(yk) C

j=1 exp(yj)

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 50 / 62

SLIDE 111

Loss Function: Classification

L(w) = 1 N

n

ℓ(w, x(n), t(n)) Probability of class k given input (softmax): p(ck = 1|x) = exp(yk) C

j=1 exp(yj)

Cross entropy is the most used loss function for classification ℓ(w, x(n), t(n)) = −

k

t(n)

k

log p(ck|x)

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 50 / 62

SLIDE 112

Loss Function: Classification

L(w) = 1 N

n

ℓ(w, x(n), t(n)) Probability of class k given input (softmax): p(ck = 1|x) = exp(yk) C

j=1 exp(yj)

Cross entropy is the most used loss function for classification ℓ(w, x(n), t(n)) = −

k

t(n)

k

log p(ck|x) Use gradient descent to train the network min

w

1 N

n

ℓ(w, x(n), t(n))

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 50 / 62

SLIDE 113

Backpropagation

Efficient computation of the gradients by applying the chain rule

x max(0, W T

1 x + b1)

h1 max(0, W T

2 h1 + b2)

h2 W T

3 h2 + b3

y

∂ℓ ∂y Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 51 / 62

SLIDE 114

Backpropagation

Efficient computation of the gradients by applying the chain rule

x max(0, W T

1 x + b1)

h1 max(0, W T

2 h1 + b2)

h2 W T

3 h2 + b3

y

∂ℓ ∂y

p(ck = 1|x) = exp(yk) C

j=1 exp(yj)

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 51 / 62

SLIDE 115

Backpropagation

Efficient computation of the gradients by applying the chain rule

x max(0, W T

1 x + b1)

h1 max(0, W T

2 h1 + b2)

h2 W T

3 h2 + b3

y

∂ℓ ∂y

p(ck = 1|x) = exp(yk) C

j=1 exp(yj)

ℓ(x(n), t(n), w) = −

k

t(n)

k

log p(ck|x)

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 51 / 62

SLIDE 116

Backpropagation

Efficient computation of the gradients by applying the chain rule

x max(0, W T

1 x + b1)

h1 max(0, W T

2 h1 + b2)

h2 W T

3 h2 + b3

y

∂ℓ ∂y

p(ck = 1|x) = exp(yk) C

j=1 exp(yj)

ℓ(x(n), t(n), w) = −

k

t(n)

k

log p(ck|x) Compute the derivative of loss w.r.t. the output ∂ℓ ∂y = p(c|x) − t

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 51 / 62

SLIDE 117

Backpropagation

Efficient computation of the gradients by applying the chain rule

x max(0, W T

1 x + b1)

h1 max(0, W T

2 h1 + b2)

h2 W T

3 h2 + b3

y

∂ℓ ∂y

p(ck = 1|x) = exp(yk) C

j=1 exp(yj)

ℓ(x(n), t(n), w) = −

k

t(n)

k

log p(ck|x) Compute the derivative of loss w.r.t. the output ∂ℓ ∂y = p(c|x) − t Note that the forward pass is necessary to compute ∂ℓ

∂y

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 51 / 62

SLIDE 118

Backpropagation

Efficient computation of the gradients by applying the chain rule

x max(0, W T

1 x + b1)

h1 max(0, W T

2 h1 + b2)

W T

3 h2 + b3 ∂ℓ ∂h2

y

∂ℓ ∂y

We have computed the derivative of loss w.r.t the output ∂ℓ ∂y = p(c|x) − t

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 52 / 62

SLIDE 119

Backpropagation

Efficient computation of the gradients by applying the chain rule

x max(0, W T

1 x + b1)

h1 max(0, W T

2 h1 + b2)

W T

3 h2 + b3 ∂ℓ ∂h2

y

∂ℓ ∂y

We have computed the derivative of loss w.r.t the output ∂ℓ ∂y = p(c|x) − t Given ∂ℓ

∂y if we can compute the Jacobian of each module

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 52 / 62

SLIDE 120

Backpropagation

Efficient computation of the gradients by applying the chain rule

x max(0, W T

1 x + b1)

h1 max(0, W T

2 h1 + b2)

W T

3 h2 + b3 ∂ℓ ∂h2

y

∂ℓ ∂y

We have computed the derivative of loss w.r.t the output ∂ℓ ∂y = p(c|x) − t Given ∂ℓ

∂y if we can compute the Jacobian of each module

∂ℓ ∂W3 =

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 52 / 62

SLIDE 121

Backpropagation

Efficient computation of the gradients by applying the chain rule

x max(0, W T

1 x + b1)

h1 max(0, W T

2 h1 + b2)

W T

3 h2 + b3 ∂ℓ ∂h2

y

∂ℓ ∂y

We have computed the derivative of loss w.r.t the output ∂ℓ ∂y = p(c|x) − t Given ∂ℓ

∂y if we can compute the Jacobian of each module

∂ℓ ∂W3 = ∂ℓ ∂y ∂y ∂W3 =

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 52 / 62

SLIDE 122

Backpropagation

Efficient computation of the gradients by applying the chain rule

x max(0, W T

1 x + b1)

h1 max(0, W T

2 h1 + b2)

W T

3 h2 + b3 ∂ℓ ∂h2

y

∂ℓ ∂y

We have computed the derivative of loss w.r.t the output ∂ℓ ∂y = p(c|x) − t Given ∂ℓ

∂y if we can compute the Jacobian of each module

∂ℓ ∂W3 = ∂ℓ ∂y ∂y ∂W3 = (p(c|x) − t)(h2)T

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 52 / 62

SLIDE 123

Backpropagation

Efficient computation of the gradients by applying the chain rule

x max(0, W T

1 x + b1)

h1 max(0, W T

2 h1 + b2)

W T

3 h2 + b3 ∂ℓ ∂h2

y

∂ℓ ∂y

We have computed the derivative of loss w.r.t the output ∂ℓ ∂y = p(c|x) − t Given ∂ℓ

∂y if we can compute the Jacobian of each module

∂ℓ ∂W3 = ∂ℓ ∂y ∂y ∂W3 = (p(c|x) − t)(h2)T ∂ℓ ∂h2 =

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 52 / 62

SLIDE 124

Backpropagation

Efficient computation of the gradients by applying the chain rule

x max(0, W T

1 x + b1)

h1 max(0, W T

2 h1 + b2)

W T

3 h2 + b3 ∂ℓ ∂h2

y

∂ℓ ∂y

We have computed the derivative of loss w.r.t the output ∂ℓ ∂y = p(c|x) − t Given ∂ℓ

∂y if we can compute the Jacobian of each module

∂ℓ ∂W3 = ∂ℓ ∂y ∂y ∂W3 = (p(c|x) − t)(h2)T ∂ℓ ∂h2 = ∂ℓ ∂y ∂y ∂h2 =

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 52 / 62

SLIDE 125

Backpropagation

Efficient computation of the gradients by applying the chain rule

x max(0, W T

1 x + b1)

h1 max(0, W T

2 h1 + b2)

W T

3 h2 + b3 ∂ℓ ∂h2

y

∂ℓ ∂y

We have computed the derivative of loss w.r.t the output ∂ℓ ∂y = p(c|x) − t Given ∂ℓ

∂y if we can compute the Jacobian of each module

∂ℓ ∂W3 = ∂ℓ ∂y ∂y ∂W3 = (p(c|x) − t)(h2)T ∂ℓ ∂h2 = ∂ℓ ∂y ∂y ∂h2 = (W3)T(p(c|x) − t)

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 52 / 62

SLIDE 126

Backpropagation

Efficient computation of the gradients by applying the chain rule

x max(0, W T

1 x + b1)

h1 max(0, W T

2 h1 + b2)

W T

3 h2 + b3 ∂ℓ ∂h2

y

∂ℓ ∂y

We have computed the derivative of loss w.r.t the output ∂ℓ ∂y = p(c|x) − t Given ∂ℓ

∂y if we can compute the Jacobian of each module

∂ℓ ∂W3 = ∂ℓ ∂y ∂y ∂W3 = (p(c|x) − t)(h2)T ∂ℓ ∂h2 = ∂ℓ ∂y ∂y ∂h2 = (W3)T(p(c|x) − t) Need to compute gradient w.r.t. inputs and parameters in each layer

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 52 / 62

SLIDE 127

Backpropagation

Efficient computation of the gradients by applying the chain rule

x max(0, W T

1 x + b1)

max(0, W T

2 h1 + b2) ∂ℓ ∂h1

W T

3 h2 + b3 ∂ℓ ∂h2

y

∂ℓ ∂y

∂ℓ ∂h2 = ∂ℓ ∂y ∂y ∂h2 = (W3)T(p(c|x) − t) Given

∂ℓ ∂h2 if we can compute the Jacobian of each module

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 53 / 62

SLIDE 128

Backpropagation

Efficient computation of the gradients by applying the chain rule

x max(0, W T

1 x + b1)

max(0, W T

2 h1 + b2) ∂ℓ ∂h1

W T

3 h2 + b3 ∂ℓ ∂h2

y

∂ℓ ∂y

∂ℓ ∂h2 = ∂ℓ ∂y ∂y ∂h2 = (W3)T(p(c|x) − t) Given

∂ℓ ∂h2 if we can compute the Jacobian of each module

∂ℓ ∂W2 =

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 53 / 62

SLIDE 129

Backpropagation

Efficient computation of the gradients by applying the chain rule

x max(0, W T

1 x + b1)

max(0, W T

2 h1 + b2) ∂ℓ ∂h1

W T

3 h2 + b3 ∂ℓ ∂h2

y

∂ℓ ∂y

∂ℓ ∂h2 = ∂ℓ ∂y ∂y ∂h2 = (W3)T(p(c|x) − t) Given

∂ℓ ∂h2 if we can compute the Jacobian of each module

∂ℓ ∂W2 = ∂ℓ ∂h2 ∂h2 ∂W2

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 53 / 62

SLIDE 130

Backpropagation

Efficient computation of the gradients by applying the chain rule

x max(0, W T

1 x + b1)

max(0, W T

2 h1 + b2) ∂ℓ ∂h1

W T

3 h2 + b3 ∂ℓ ∂h2

y

∂ℓ ∂y

∂ℓ ∂h2 = ∂ℓ ∂y ∂y ∂h2 = (W3)T(p(c|x) − t) Given

∂ℓ ∂h2 if we can compute the Jacobian of each module

∂ℓ ∂W2 = ∂ℓ ∂h2 ∂h2 ∂W2 ∂ℓ ∂h1 =

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 53 / 62

SLIDE 131

Backpropagation

Efficient computation of the gradients by applying the chain rule

x max(0, W T

1 x + b1)

max(0, W T

2 h1 + b2) ∂ℓ ∂h1

W T

3 h2 + b3 ∂ℓ ∂h2

y

∂ℓ ∂y

∂ℓ ∂h2 = ∂ℓ ∂y ∂y ∂h2 = (W3)T(p(c|x) − t) Given

∂ℓ ∂h2 if we can compute the Jacobian of each module

∂ℓ ∂W2 = ∂ℓ ∂h2 ∂h2 ∂W2 ∂ℓ ∂h1 = ∂ℓ ∂h2 ∂h2 ∂h1

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 53 / 62

SLIDE 132

28

Toy Code (Matlab): Neural Net Trainer

% F-PROP for i = 1 : nr_layers - 1 [h{i} jac{i}] = nonlinearity(W{i} * h{i-1} + b{i}); end h{nr_layers-1} = W{nr_layers-1} * h{nr_layers-2} + b{nr_layers-1}; prediction = softmax(h{l-1}); % CROSS ENTROPY LOSS loss = - sum(sum(log(prediction) .* target)) / batch_size; % B-PROP dh{l-1} = prediction - target; for i = nr_layers – 1 : -1 : 1 Wgrad{i} = dh{i} * h{i-1}'; bgrad{i} = sum(dh{i}, 2); dh{i-1} = (W{i}' * dh{i}) .* jac{i-1}; end % UPDATE for i = 1 : nr_layers - 1 W{i} = W{i} – (lr / batch_size) * Wgrad{i}; b{i} = b{i} – (lr / batch_size) * bgrad{i}; end Ranzato This code has a few bugs with indices...

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 54 / 62

SLIDE 133

Overfitting

The training data contains information about the regularities in the mapping from input to output. But it also contains noise

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 55 / 62

SLIDE 134

Overfitting

The training data contains information about the regularities in the mapping from input to output. But it also contains noise

◮ The target values may be unreliable. Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 55 / 62

SLIDE 135

Overfitting

The training data contains information about the regularities in the mapping from input to output. But it also contains noise

◮ The target values may be unreliable. ◮ There is sampling error: There will be accidental regularities just

because of the particular training cases that were chosen

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 55 / 62

SLIDE 136

Overfitting

The training data contains information about the regularities in the mapping from input to output. But it also contains noise

◮ The target values may be unreliable. ◮ There is sampling error: There will be accidental regularities just

because of the particular training cases that were chosen When we fit the model, it cannot tell which regularities are real and which are caused by sampling error.

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 55 / 62

SLIDE 137

Overfitting

The training data contains information about the regularities in the mapping from input to output. But it also contains noise

◮ The target values may be unreliable. ◮ There is sampling error: There will be accidental regularities just

because of the particular training cases that were chosen When we fit the model, it cannot tell which regularities are real and which are caused by sampling error.

◮ So it fits both kinds of regularity. Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 55 / 62

SLIDE 138

Overfitting

The training data contains information about the regularities in the mapping from input to output. But it also contains noise

◮ The target values may be unreliable. ◮ There is sampling error: There will be accidental regularities just

because of the particular training cases that were chosen When we fit the model, it cannot tell which regularities are real and which are caused by sampling error.

◮ So it fits both kinds of regularity. ◮ If the model is very flexible it can model the sampling error really well.

This is a disaster.

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 55 / 62

SLIDE 139

Preventing Overfitting

Use a model that has the right capacity:

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 56 / 62

SLIDE 140

Preventing Overfitting

Use a model that has the right capacity:

◮ enough to model the true regularities Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 56 / 62

SLIDE 141

Preventing Overfitting

Use a model that has the right capacity:

◮ enough to model the true regularities ◮ not enough to also model the spurious regularities (assuming they are

weaker)

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 56 / 62

SLIDE 142

Preventing Overfitting

Use a model that has the right capacity:

◮ enough to model the true regularities ◮ not enough to also model the spurious regularities (assuming they are

weaker) Standard ways to limit the capacity of a neural net:

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 56 / 62

SLIDE 143

Preventing Overfitting

Use a model that has the right capacity:

◮ enough to model the true regularities ◮ not enough to also model the spurious regularities (assuming they are

weaker) Standard ways to limit the capacity of a neural net:

◮ Limit the number of hidden units. Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 56 / 62

SLIDE 144

Preventing Overfitting

Use a model that has the right capacity:

◮ enough to model the true regularities ◮ not enough to also model the spurious regularities (assuming they are

weaker) Standard ways to limit the capacity of a neural net:

◮ Limit the number of hidden units. ◮ Limit the norm of the weights. Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 56 / 62

SLIDE 145

Preventing Overfitting

Use a model that has the right capacity:

◮ enough to model the true regularities ◮ not enough to also model the spurious regularities (assuming they are

weaker) Standard ways to limit the capacity of a neural net:

◮ Limit the number of hidden units. ◮ Limit the norm of the weights. ◮ Stop the learning before it has time to overfit. Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 56 / 62

SLIDE 146

Limiting the size of the Weights

Weight-decay involves adding an extra term to the cost function that penalizes the squared weights. C = ℓ + λ 2

i

w 2

i

Keeps weights small unless they have big error derivatives. ∂C ∂wi = ∂ℓ ∂wi + λwi

w C

when ∂C ∂wi = 0, wi = − 1 λ ∂ℓ ∂wi

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 57 / 62

SLIDE 147

The Effect of Weight-decay

It prevents the network from using weights that it does not need

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 58 / 62

SLIDE 148

The Effect of Weight-decay

It prevents the network from using weights that it does not need

◮ This can often improve generalization a lot. Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 58 / 62

SLIDE 149

The Effect of Weight-decay

It prevents the network from using weights that it does not need

◮ This can often improve generalization a lot. ◮ It helps to stop it from fitting the sampling error. Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 58 / 62

SLIDE 150

The Effect of Weight-decay

It prevents the network from using weights that it does not need

◮ This can often improve generalization a lot. ◮ It helps to stop it from fitting the sampling error. ◮ It makes a smoother model in which the output changes more slowly as

the input changes.

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 58 / 62

SLIDE 151

The Effect of Weight-decay

It prevents the network from using weights that it does not need

◮ This can often improve generalization a lot. ◮ It helps to stop it from fitting the sampling error. ◮ It makes a smoother model in which the output changes more slowly as

the input changes. But, if the network has two very similar inputs it prefers to put half the weight on each rather than all the weight on one → other form of weight decay?

w/2 w/2 w

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 58 / 62

SLIDE 152

Deciding How Much to Restrict the Capacity

How do we decide which regularizer to use and how strong to make it?

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 59 / 62

SLIDE 153

Deciding How Much to Restrict the Capacity

How do we decide which regularizer to use and how strong to make it? So use a separate validation set to do model selection.

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 59 / 62

SLIDE 154

Using a Validation Set

Divide the total dataset into three subsets:

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 60 / 62

SLIDE 155

Using a Validation Set

Divide the total dataset into three subsets:

◮ Training data is used for learning the parameters of the model. Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 60 / 62

SLIDE 156

Using a Validation Set

Divide the total dataset into three subsets:

◮ Training data is used for learning the parameters of the model. ◮ Validation data is not used for learning but is used for deciding what

type of model and what amount of regularization works best

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 60 / 62

SLIDE 157

Using a Validation Set

Divide the total dataset into three subsets:

◮ Training data is used for learning the parameters of the model. ◮ Validation data is not used for learning but is used for deciding what

type of model and what amount of regularization works best

◮ Test data is used to get a final, unbiased estimate of how well the

network works. We expect this estimate to be worse than on the validation data

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 60 / 62

SLIDE 158

Using a Validation Set

Divide the total dataset into three subsets:

◮ Training data is used for learning the parameters of the model. ◮ Validation data is not used for learning but is used for deciding what

type of model and what amount of regularization works best

◮ Test data is used to get a final, unbiased estimate of how well the

network works. We expect this estimate to be worse than on the validation data We could then re-divide the total dataset to get another unbiased estimate

f the true error rate.

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 60 / 62

SLIDE 159

Preventing Overfitting by Early Stopping

If we have lots of data and a big model, its very expensive to keep re-training it with different amounts of weight decay

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 61 / 62

SLIDE 160

Preventing Overfitting by Early Stopping

If we have lots of data and a big model, its very expensive to keep re-training it with different amounts of weight decay It is much cheaper to start with very small weights and let them grow until the performance on the validation set starts getting worse

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 61 / 62

SLIDE 161

Preventing Overfitting by Early Stopping

If we have lots of data and a big model, its very expensive to keep re-training it with different amounts of weight decay It is much cheaper to start with very small weights and let them grow until the performance on the validation set starts getting worse The capacity of the model is limited because the weights have not had time to grow big.

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 61 / 62

SLIDE 162

Why Early Stopping Works

utputs

inputs

When the weights are very small, every hidden unit is in its linear range.

◮ So a net with a large layer of hidden

units is linear.

◮ It has no more capacity than a linear

net in which the inputs are directly connected to the outputs! As the weights grow, the hidden units start using their non-linear ranges so the capacity grows.

Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 62 / 62