[PPT] - Lecture 11: Multi-layer Perceptron Forward Pass Backpropagation PowerPoint Presentation

SLIDE 1

Lecture 11:

−Multi-layer Perceptron −Forward Pass −Backpropagation

Aykut Erdem

November 2017 Hacettepe University

SLIDE 2

Administrative

Assignment 3 is out!

− It is due November 24, 2017 − You will implement backpropagation to train  

multi-layer neural networks 

− Dataset: Fashion-MNIST

2

SLIDE 3

3

A reminder about course projects

From now on, you are required to write regular (weekly)

blog posts about your progress on the course projects!

We will use medium.com

SLIDE 4

Last time… Linear classification

4

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

SLIDE 5

5

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Last time… Linear classification

SLIDE 6

6

Last time… Linear classification

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

SLIDE 7

Interactive web demo time….

7

http://vision.stanford.edu/teaching/cs231n/linear-classify-demo/

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

SLIDE 8

Last time… Perceptron

8

f(x) = X

i

wixi = hw, xi x1 x2 x3 xn . . .

utput

w1 wn

synaptic weights

slide by Alex Smola

SLIDE 9

This Week

Multi-layer perceptron 
Forward Pass
Backward Pass

9

SLIDE 10

Introduction

10

SLIDE 11

A brief history of computers

11

1970s 1980s 1990s 2000s 2010s Data

102 103 105 108 1011

RAM

? 1MB 100MB 10GB 1TB

CPU

? 10MF 1GF 100GF 1PF GPU

Data grows

at higher exponent

Moore’s law (silicon) vs. Kryder’s law (disks)
Early algorithms data bound, now CPU/RAM bound

deep nets kernel   methods deep nets

slide by Alex Smola

SLIDE 12

Not linearly separable data

Some datasets are not linearly separable!
e.g. XOR problem 
Nonlinear separation is trivial

12

slide by Alex Smola

SLIDE 13

Addressing non-linearly separable data

Two options:
Option 1: Non-linear features
Option 2: Non-linear classifiers

13

slide by Dhruv Batra

SLIDE 14

Option 1 — Non-linear features

14

Choose non-linear features, e.g.,
Typical linear features: w0 + Σi wi xi
Example of non-linear features:
Degree 2 polynomials, w0 + Σi wi xi + Σij wij xi xj
Classifier hw(x) still linear in parameters w
As easy to learn
Data is linearly separable in higher dimensional

spaces

Express via kernels

slide by Dhruv Batra

SLIDE 15

Option 2 — Non-linear classifiers

15

Choose a classifier hw(x) that is non-linear in

parameters w, e.g.,

Decision trees, neural networks,…
More general than linear classifiers
But, can often be harder to learn (non-convex
ptimization required)
Often very useful (outperforms linear classifiers)
In a way, both ideas are related

slide by Dhruv Batra

SLIDE 16

Biological Neurons

Soma (CPU)

Cell body - combines signals 

Dendrite (input bus)

Combines the inputs from   several other nerve cells 

Synapse (interface)

Interface and parameter store between neurons 

Axon (cable)

May be up to 1m long and will transport the activation signal to neurons at different locations

16

slide by Alex Smola

SLIDE 17

Recall: The Neuron Metaphor

Neurons
accept information from multiple inputs,
transmit information to other neurons.
Multiply inputs by weights along edges
Apply some function to the set of inputs at each

node

17

slide by Dhruv Batra

SLIDE 18

Types of Neuron

18

Linear Neuron

θ1 θ2 θD θ0 1 f(~ x, ✓)

y = θ0 + X

i

xiθi

slide by Dhruv Batra

SLIDE 19

Types of Neuron

19

Linear Neuron

θ1 θ2 θD θ0 1 f(~ x, ✓) θ1 θ2 θD θ0 1 f(~ x, ✓)

Perceptron

y = θ0 + X

i

xiθi

y = ⇢ 1 if z ≥ 0

therwise

z = θ0 + X

i

xiθi

slide by Dhruv Batra

SLIDE 20

Types of Neuron

20

Linear Neuron Logistic Neuron

θ1 θ2 θD θ0 1 f(~ x, ✓) θ1 θ2 θD θ0 1 f(~ x, ✓) θ1 θ2 θD θ0 1 f(~ x, ✓)

Perceptron

y = θ0 + X

i

xiθi

y = ⇢ 1 if z ≥ 0

therwise

z = θ0 + X

i

xiθi

z = θ0 + X

i

xiθi y = 1 1 + e−z

slide by Dhruv Batra

SLIDE 21

Types of Neuron

Potentially more. Requires a convex

loss function for gradient descent training.

21

Linear Neuron Logistic Neuron

θ1 θ2 θD θ0 1 f(~ x, ✓) θ1 θ2 θD θ0 1 f(~ x, ✓) θ1 θ2 θD θ0 1 f(~ x, ✓)

Perceptron

y = θ0 + X

i

xiθi

y = ⇢ 1 if z ≥ 0

therwise

z = θ0 + X

i

xiθi

z = θ0 + X

i

xiθi y = 1 1 + e−z

slide by Dhruv Batra

SLIDE 22

Limitation

A single “neuron” is still a linear decision

boundary

What to do?
Idea: Stack a bunch of them together!

22

slide by Dhruv Batra

SLIDE 23

Nonlinearities via Layers

Cascade neurons together
The output from one layer is the input to the next
Each layer has its own sets of weights

23

y1i(x) = σ(hw1i, xi) y2(x) = σ(hw2, y1i) y1i = k(xi, x) Kernels

Deep Nets

ptimize

all weights

slide by Alex Smola

SLIDE 24

Nonlinearities via Layers

24

y1i(x) = σ(hw1i, xi) y2i(x) = σ(hw2i, y1i) y3(x) = σ(hw3, y2i)

slide by Alex Smola

SLIDE 25

Representational Power

Neural network with at least one hidden layer is a universal

approximator (can represent any function).  

Proof in: Approximation by Superpositions of Sigmoidal Function, Cybenko, paper     

The capacity of the network increases with more hidden

units and more hidden layers

25

slide by Raquel Urtasun, Richard Zemel, Sanja Fidler

SLIDE 26

A simple example

Consider a neural network

with two layers of neurons.

neurons in the top layer

represent known shapes.

neurons in the bottom layer

represent pixel intensities. 

A pixel gets to vote if it has

ink on it.

Each inked pixel can vote

for several different shapes.  

The shape that gets the

most votes wins.

26

0 1 2 3 4 5 6 7 8 9

¡ ¡𝑔(∑𝑥𝑦)

𝑦 𝑦 𝑦 𝑦 … …

slide by Geoffrey Hinton

SLIDE 27

How to display the weights

27

Give each output unit its own “map” of the input image and display the weight coming from each pixel in the location of that pixel in the map. Use a black or white blob with the area representing the magnitude of the weight and the color representing the sign.

The input image

1 2 3 4 5 6 7 8 9 0

slide by Geoffrey Hinton

SLIDE 28

How to learn the weights

28

Show the network an image and increment the weights from active pixels to the correct class. Then decrement the weights from active pixels to whatever class the network guesses.

The image

The image

slide by Geoffrey Hinton

SLIDE 34

The learned weights

34

1 2 3 4 5 6 7 8 9 0

The details of the learning algorithm will be explained later.

The image

slide by Geoffrey Hinton

SLIDE 35

Why insufficient

A two layer network with a single winner in the top

layer is equivalent to having a rigid template for each shape.

The winner is the template that has the biggest
verlap with the ink. 
The ways in which hand-written digits vary are

much too complicated to be captured by simple template matches of whole shapes.

To capture all the allowable variations of a digit we

need to learn the features that it is composed of.

35

slide by Geoffrey Hinton

SLIDE 36

Multilayer Perceptron

36

Layer Representation
(typically) iterate between

linear mapping Wx and   nonlinear function

Loss function

to measure quality of  estimate so far

yi = Wixi xi+1 = σ(yi)

x1 x2 x3 x4 y

W1 W2 W3 W4

l(y, yi)

slide by Alex Smola

SLIDE 37

Forward Pass

37

SLIDE 38

Forward Pass: What does the Network Compute?

Output of the network can be written as:

(j indexing hidden units, k indexing the output units, D number of inputs)

Activation functions f , g : sigmoid/logistic, tanh, or rectified linear (ReLU)

38

slide by Raquel Urtasun, Richard Zemel, Sanja Fidler

X

k(x)

= g(wk0 +

J

X

j=1

hj(x)wkj)

σ(z) = 1 1 + exp(−z), tanh(z) = exp(z) − exp(−z) exp(z) + exp(−z), ReLU(z) = max(0, z) hj(x) = f (vj0 +

D

X

i=1

xivji)

SLIDE 39

Forward Pass in Python

Example code for a forward pass for a 3-layer network in Python:

Can be implemented efficiently using matrix operations
Example above: W1 is matrix of size 4 × 3, W2 is 4 × 4. What about

biases and W3?

39

slide by Raquel Urtasun, Richard Zemel, Sanja Fidler

[http://cs231n.github.io/neural-networks-1/]

SLIDE 40

Special Case

What is a single layer (no hiddens) network with a sigmoid act.

function?

Network:
Logistic regression!

40

slide by Raquel Urtasun, Richard Zemel, Sanja Fidler

k(x)

= 1 1 + exp(−zk) zk = wk0 +

J

X

j=1

xjwkj

SLIDE 41

Example

Classify image of handwritten digit (32x32 pixels): 4 vs non-4
How would you build your network?
For example, use one hidden layer and the sigmoid activation function:
How can we train the network, that is, adjust all the parameters w?

41

!vs.!all)?! ust!all!the!parameters! ,!to! ut!this!is!a!complicated!

k(x)

= 1 1 + exp(−zk) zk = wk0 +

J

X

j=1

hj(x)wkj

slide by Raquel Urtasun, Richard Zemel, Sanja Fidler

SLIDE 42

w∗ = argmin

w N

X

n=1

loss(o(n), t(n))

s: P

k 1 2(o(n) k

− t(n)

k )2

P

(n)

Training Neural Networks

Find weights:

      where o = f(x;w) is the output of a neural network

Define a loss function, e.g.:
Squared loss:
Cross-entropy loss:
Gradient descent:

      where η is the learning rate (and E is error/loss)

42

2 k

−

k

s: − P

k t(n) k

log o(n)

k

P wt+1 = wt − η ∂E ∂wt

slide by Raquel Urtasun, Richard Zemel, Sanja Fidler

SLIDE 43

Useful derivatives

43

name function derivative Sigmoid σ(z) =

1 1+exp(−z)

σ(z) · (1 − σ(z)) Tanh tanh(z) = exp(z)−exp(−z)

exp(z)+exp(−z)

1/ cosh2(z) ReLU ReLU(z) = max(0, z) ( 1, if z > 0 0, if z ≤ 0

slide by Raquel Urtasun, Richard Zemel, Sanja Fidler

SLIDE 44

Backpropagation and Neural Networks

44

SLIDE 45

Recap: Loss function/Optimization

45

3.45
8.87

0.09 2.9 4.48 8.02 3.78 1.06

0.36
0.72
0.51

6.04 5.31

4.22
4.19

3.58 4.49

4.37
2.09
2.93

3.42 4.64 2.65 5.1 2.64 5.55

4.34
1.5
4.79

6.14

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

1. Define a loss function that

quantifies our unhappiness with the scores across the training data. 

2. Come up with a way of

efficiently finding the parameters that minimize the loss function. (optimization)

TODO:

We defined a (linear) score function:

SLIDE 46

46

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Softmax Classifier (Multinomial Logistic Regression)

SLIDE 47

47

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Softmax Classifier (Multinomial Logistic Regression)

SLIDE 48

48

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Softmax Classifier (Multinomial Logistic Regression)

SLIDE 49

49

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Softmax Classifier (Multinomial Logistic Regression)

SLIDE 50

50

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Softmax Classifier (Multinomial Logistic Regression)

SLIDE 51

51

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Softmax Classifier (Multinomial Logistic Regression)

SLIDE 52

52

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Softmax Classifier (Multinomial Logistic Regression)

SLIDE 53

53

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Softmax Classifier (Multinomial Logistic Regression)

SLIDE 54

54

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Softmax Classifier (Multinomial Logistic Regression)

SLIDE 55

Softmax Classifier (Multinomial Logistic Regression)

55

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

SLIDE 56

Optimization

56

SLIDE 57

Gradient Descent

57

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

SLIDE 58

Mini-batch Gradient Descent

only use a small portion of the training set

to compute the gradient

58

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

SLIDE 59

Mini-batch Gradient Descent

only use a small portion of the training set

to compute the gradient

59 there are also more fancy update formulas (momentum, Adagrad, RMSProp, Adam, …)

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

SLIDE 60

The effects of different update form formulas

60

(image credits to Alec Radford)

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

SLIDE 61

61

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

SLIDE 62

62

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 11:

Administrative

multi-layer neural networks

A reminder about course projects

Last time… Linear classification

Last time… Linear classification

Last time… Linear classification

Interactive web demo time….

Last time… Perceptron

synaptic weights

This Week

Introduction

A brief history of computers

deep nets kernel methods deep nets

Not linearly separable data

Addressing non-linearly separable data

Option 1 — Non-linear features

spaces

Option 2 — Non-linear classifiers

parameters w, e.g.,

Biological Neurons

Recall: The Neuron Metaphor

node

Types of Neuron

θ1 θ2 θD θ0 1 f(~ x, ✓)

Types of Neuron

θ1 θ2 θD θ0 1 f(~ x, ✓) θ1 θ2 θD θ0 1 f(~ x, ✓)

Types of Neuron

θ1 θ2 θD θ0 1 f(~ x, ✓) θ1 θ2 θD θ0 1 f(~ x, ✓) θ1 θ2 θD θ0 1 f(~ x, ✓)

Types of Neuron

θ1 θ2 θD θ0 1 f(~ x, ✓) θ1 θ2 θD θ0 1 f(~ x, ✓) θ1 θ2 θD θ0 1 f(~ x, ✓)

Limitation

boundary

Nonlinearities via Layers

all weights

Nonlinearities via Layers

Representational Power

A simple example

with two layers of neurons.

ink on it.

most votes wins.

How to display the weights

1 2 3 4 5 6 7 8 9 0

How to learn the weights

1 2 3 4 5 6 7 8 9 0

1 2 3 4 5 6 7 8 9 0

1 2 3 4 5 6 7 8 9 0

1 2 3 4 5 6 7 8 9 0

1 2 3 4 5 6 7 8 9 0

1 2 3 4 5 6 7 8 9 0

The learned weights

1 2 3 4 5 6 7 8 9 0

Why insufficient

layer is equivalent to having a rigid template for each shape.

much too complicated to be captured by simple template matches of whole shapes.

Multilayer Perceptron

W1 W2 W3 W4

Forward Pass

Forward Pass: What does the Network Compute?

Forward Pass in Python

Special Case

Example

Training Neural Networks

Useful derivatives

Backpropagation and Neural Networks

Recap: Loss function/Optimization

Softmax Classifier (Multinomial Logistic Regression)

Softmax Classifier (Multinomial Logistic Regression)

Softmax Classifier (Multinomial Logistic Regression)

Softmax Classifier (Multinomial Logistic Regression)

Softmax Classifier (Multinomial Logistic Regression)

Softmax Classifier (Multinomial Logistic Regression)

Softmax Classifier (Multinomial Logistic Regression)

Softmax Classifier (Multinomial Logistic Regression)

Softmax Classifier (Multinomial Logistic Regression)

Softmax Classifier (Multinomial Logistic Regression)

Optimization

Gradient Descent

Mini-batch Gradient Descent

to compute the gradient

multi-layer neural networks 

deep nets kernel   methods deep nets