COMP24111: Machine Learning and Optimisation Chapter 5: Neural - - PowerPoint PPT Presentation

comp24111 machine learning and optimisation
SMART_READER_LITE
LIVE PREVIEW

COMP24111: Machine Learning and Optimisation Chapter 5: Neural - - PowerPoint PPT Presentation

COMP24111: Machine Learning and Optimisation Chapter 5: Neural Networks and Deep Learning Dr. Tingting Mu Email: tingting.mu@manchester.ac.uk Outline Single-layer perceptron, the perceptron algorithm. Multi-layer perceptron.


slide-1
SLIDE 1

COMP24111: Machine Learning and Optimisation

  • Dr. Tingting Mu

Email: tingting.mu@manchester.ac.uk

Chapter 5: Neural Networks and Deep Learning

slide-2
SLIDE 2

Outline

  • Single-layer perceptron, the perceptron algorithm.
  • Multi-layer perceptron.
  • Back-propagation method.
  • Convolutional neural network.
  • Deep learning.

1

slide-3
SLIDE 3

Neuron Structure

  • Simulating a neuron: each artificial neural network (ANN)

neuron receives multiple inputs, and generates one output.

2 Figure is from http://2centsapiece.blogspot.co.uk/2015/10/identifying-subatomic-particles-with.html

A neuron is an electrically excitable cell that processes and transmits information by electro-chemical signaling.

slide-4
SLIDE 4

Neuron Structure

  • Simulating a neuron: each artificial neural network (ANN)

neuron receives multiple inputs, and generates one output.

3 Figure is from http://2centsapiece.blogspot.co.uk/2015/10/identifying-subatomic-particles-with.html

Input signals sent from other neurons. If enough signals accumulate, the neuron fires a signal. Connection strengths determine how the signals are accumulated.

A neuron is an electrically excitable cell that processes and transmits information by electro-chemical signaling.

slide-5
SLIDE 5

History

  • 1951, the first randomly wired neural network learning machine

SNARC (Stochastic Neural Analog Reinforcement Calculator) designed by Marvin Lee Minsky (09/08/1927 – 24/01/2016, American cognitive scientist).

  • 1957, the perceptron algorithm was invented at the Cornell

Aeronautical Laboratory by Frank Rosenblatt (11/07/1928 - 11/07/1971, American psychologist notable in A.I.)

  • 1969, Neocognitron (a type of artificial neural network) proposed by

Kunihiko Fukushima has been used for handwritten character recognition, and served as the inspiration for convolutional neural networks.

4

slide-6
SLIDE 6

History

  • 1982, Hopfield network (a form of recurrent artificial neural network)

was popularized by John Hopfield in 1982, but described earlier by Little in 1974.

  • 1986, the process of backpropagation was described by David

Rumelhart, Geoff Hinton and Ronald J. Williams. But the basics of continuous backpropagation were derived in the context of control theory by Henry J. Kelley (1926-1988, Professor of Aerospace and Ocean Engineering) in 1960.

  • 1997, long short-term memory (LSTM) recurrent neural network was

invented by Sepp Hochreiter and Jürgen Schmidhuber invent, improving the efficiency and practicality of recurrent neural networks.

5

slide-7
SLIDE 7

Single Neuron Model

  • An ANN neuron: multiple inputs [x1, x2,…, xd] and one output y.

6

y =ϕ wixi +b

i=1 d

⎛ ⎝ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟

w1 w2 wd x1 x2 xd b (bias) w1x1 w2x2 wdxd

ϕ

y

neuron adder activation

Basic elements of a typical neuron include:

  • A set of synapses or
  • connections. Each of

these is characterised by a weight (strength).

  • An adder for summing

the input signals, weighted by the respective synapses.

  • An activation function,

which squashes the permissible amplitude range of the output signal. Given d input, a neuron is modeled by d+1 parameters.

slide-8
SLIDE 8

Types of Activation Function

  • Identify function:
  • Threshold function:
  • Sigmoid function (“S”-shaped curve):
  • Rectified linear unit (ReLU):

7

ϕ v

( ) =

1 if v ≥ 0 −1 if v < 0 ⎧ ⎨ ⎪ ⎩ ⎪ ϕ v

( ) =

1 1+exp −v

( )

∈ 0,+1

( )

  • r

ϕ v

( ) = tanh v ( ) =

exp 2v

( )−1

exp 2v

( )+1

∈ −1,+1

( )

ϕ v

( ) = v

ϕ v

( ) =

v if v ≥ 0 if v < 0 ⎧ ⎨ ⎪ ⎩ ⎪

  • 10
  • 8
  • 6
  • 4
  • 2
2 4 6 8 10

v

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(v)

  • 10
  • 8
  • 6
  • 4
  • 2
2 4 6 8 10

v

  • 1
  • 0.8
  • 0.6
  • 0.4
  • 0.2
0.2 0.4 0.6 0.8 1

(v)

  • 3
  • 2
  • 1
1 2 3

v

0.5 1 1.5 2 2.5 3

(v)

  • 3
  • 2
  • 1
1 2 3

v

  • 3
  • 2
  • 1
1 2 3

(v)

Identity Sigmoid Tanh ReLU 1 1

  • 1
  • 3
  • 2
  • 1
1 2 3

v

  • 1
  • 0.8
  • 0.6
  • 0.4
  • 0.2
0.2 0.4 0.6 0.8 1

(v)

Threshold

  • 1

1

slide-9
SLIDE 9

The Perceptron Algorithm

  • When the activation function is set as the

identify function, the single neuron model becomes the linear model we learned in the previous chapters. The neuron weights and bias are equivalent to the coefficient vector of the linear model.

  • When the activation function is set as the

threshold function, the model is still linear, and it is known as the perceptron of Rosenblatt.

  • The perceptron algorithm is for two-class

classification, and it occupies an import place in the history of pattern recognition algorithms.

8

  • 3
  • 2
  • 1
1 2 3

v

  • 3
  • 2
  • 1
1 2 3

(v)

Identity

model output = wTx +b

Activation function: ϕ v

( ) =

1 if v ≥ 0 −1 if v < 0 ⎧ ⎨ ⎪ ⎩ ⎪

  • 3
  • 2
  • 1
1 2 3

v

  • 1
  • 0.8
  • 0.6
  • 0.4
  • 0.2
0.2 0.4 0.6 0.8 1

(v)

Threshold

  • 1

1

slide-10
SLIDE 10

The Perceptron Algorithm

  • Training a perceptron:

9

perceptron output = 1 if wT ! x ≥ 0 −1 if wT ! x < 0 ⎧ ⎨ ⎪ ⎩ ⎪ w = b,w1,…wd

[ ], !

x = 1 x ⎡ ⎣ ⎢ ⎤ ⎦ ⎥

w(t+1) = w(t) −η∇O w(t)

( ) = w(t) +ηyi!

xi

Update using a misclassified sample in each iteration!

slide-11
SLIDE 11

The Perceptron Algorithm

  • Perceptron Training:
  • What weight changes do the following cases produce?

10

Initialise the weights (stored in w(0)) to random numbers in range -1 to +1. For t = 1 to NUM_ITERATIONS For each training sample (xi,yi) Calculate activation using current weight (stored in w(t)). Update weight (stored in w(t+1) ) by learning rule. end end

  • if... (true label = -1, activation output = -1).... then
  • if... (true label = +1, activation output = +1).... then
  • if... (true label = -1, activation output = +1).... then
  • if... (true label = +1, activation output = +1).... then

No change No change

Update using one misclassified sample in each iteration:

w(t+1) = w(t) +ηyi!

xi

Add – η!

xi

Add + η!

xi

slide-12
SLIDE 12

Why training like this?

  • Parameters stored in w are optimised by minimising an error function,

called perceptron criterion:

  • Stochastic gradient descent is used for training.
  • Estimate gradient using a misclassified sample:

11

  • We want to reduce the number of

misclassified samples, therefore to minimise the above error penalty. O w

( ) = −

yi wT ! xi

( )

i∈Misclassified Set

If a sample is correctly classified, applies an error penalty of zero; if incorrectly classified, applies an error penalty of the following quantity:

yi wT ! xi

( ) = −yi wT !

xi

( )

Oi w

( ) = −yiwT !

xi ⇒ ∂Oi w

( )

∂w = −yi! xi

w(t+1) = w(t) −η∇O w(t)

( ) = w(t) +ηyi!

xi

slide-13
SLIDE 13

One neuron can be used to construct a linear model.

12

x1 x2 xd y

y =ϕ wixi +b

i=1 d

⎛ ⎝ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟

w1 w2 wd x1 x2 xd b w1x1 w2x2 wdxd

ϕ

y

adder activation Input Layer It has only one layer (input layer), and is called a single layer perceptron. an input node

slide-14
SLIDE 14

Adding Hidden Layers!

  • The presence of hidden layers allows to formulate more complex

functions.

  • Each hidden node finds a partial solution to the problem to be

combined in the next layer.

13

x1 x2 xd y

hidden layer 1 hidden layer 2 input layer

x1 x2 xd y

input layer hidden layer 1

Example:

slide-15
SLIDE 15

Multilayer Perceptron

  • A multilayer perceptron (MLP), also called feedforward artificial neural

network, consists of at least three layers of nodes (input, hidden and

  • utput layers).

14

input layer hidden layer 1 hidden layer 2

  • utput

layer

  • Number of neurons in the

input layer is equal to the number of input features.

  • Number of hidden layers is

a hyperparameter to be set.

  • Numbers of neurons in

hidden layers are also hyperparameters to be set.

  • Number of neurons in
  • utput layer depends on

the task to be solved.

slide-16
SLIDE 16

Hidden layer Output layer

Multilayer Perceptron

  • An MLP example with one hidden layer consisting of 4 hidden neurons. It

takes 9 input features and returns 2 output variables (9 input neurons in the input layer, 2 output neurons in the output layer).

15

  • Output of the j-th neuron

in the hidden layer (j=1,2,3,4), for the n-th training sample:

  • Output of the k-th neuron

in the output layer (k=1,2), for the n-training sample: zj n

( ) =ϕ

wij

h

( )xi n

( )+ bj

(h) i=1 9

⎛ ⎝ ⎜ ⎞ ⎠ ⎟ yk n

( ) =ϕ

wjk

  • ( )zj n

( )+ bk

(o) j=1 4

⎛ ⎝ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟

Feed-forward information flow when computing the output variables.

slide-17
SLIDE 17

Hidden layer Output layer

Multilayer Perceptron

  • An MLP example with one hidden layer consisting of 4 hidden neurons. It

takes 9 input features and returns 2 output variables (9 input neurons in input layer, 2 output neuron in output layer).

16

  • Output of the j-th neuron

in the hidden layer (j=1,2,3,4), for the n-th training sample:

  • Output of the k-th neuron

in the output layer (k=1,2), for the n-training sample: zj n

( ) =ϕ

wij

h

( )xi n

( )+ bj

(h) i=1 9

⎛ ⎝ ⎜ ⎞ ⎠ ⎟ yk n

( ) =ϕ

wjk

  • ( )zj n

( )+ bk

(o) j=1 4

⎛ ⎝ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟

yk(n)

zj(n) Wjk

(o)

xi(n) Wij

(h)

9+1 =10 weights 4+1=5 weights

j k

Feed-forward information flow when computing the output variables.

How many weights in total?

slide-18
SLIDE 18

Hidden layer Output layer

Multilayer Perceptron

  • An MLP example with one hidden layer consisting of 4 hidden neurons. It

takes 9 input features and returns 2 output variables (9 input neurons in input layer, 2 output neuron in output layer).

17

  • Output of the j-th neuron

in the hidden layer (j=1,2,3,4), for the n-th training sample:

  • Output of the k-th neuron

in the output layer (k=1,2), for the n-training sample: zj n

( ) =ϕ

wij

h

( )xi n

( )+ bj

(h) i=1 9

⎛ ⎝ ⎜ ⎞ ⎠ ⎟ yk n

( ) =ϕ

wjk

  • ( )zj n

( )+ bk

(o) j=1 4

⎛ ⎝ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟

10 x 4 =40 weights 5 x 2 =10 weights A total of 40+10 =50 weights to be

  • ptimised in this neural network (including

bias parameters). Feed-forward information flow when computing the output variables.

yk(n)

zj(n) Wjk

(o)

xi(n) Wij

(h)

9+1 =10 weights 4+1=5 weights

j k

slide-19
SLIDE 19

Neural Network Training

  • Neural network training is the process of finding the
  • ptimal setting of the neural network weights.

18

input layer hidden layer 1 hidden layer 2

  • utput

layer

slide-20
SLIDE 20

Neural Network Training

  • Neural network training is the process of finding the
  • ptimal setting of the neural network weights WNN.

19

input layer hidden layer 1 hidden layer 2 hidden layer 3 prediction layer (new

  • utput

layer)

Original features x New features φ(x,WNN)

A neural network can be viewed as a powerful feature extractor to compute an effective representation for the sample, which helps the prediction task.

Loss(φ(x, WNN))

slide-21
SLIDE 21

Neural Network Training

  • Treating z=φ(x,WNN) as the new features and using these as the input
  • f a linear model, all the objective functions we learned in previous

chapters can be used to optimise the neural network weights.

– Sum-of-squares error (as used by the least squares model, Chapter 2) – A mixture of sum-of-squares error and a reguarlisation term (as used by the regularised least squares model, Chapter 2) – …

  • Training (optimisation) methods: stochastic gradient descent, mini-

batch gradient descent.

20

slide-22
SLIDE 22

Example: Sum-of-Squares Error

21

input layer hidden layer 1 hidden layer 2 hidden layer 3

Original features x New features z x1 x2 xd

z = φ(x,WNN)

z1 zD

slide-23
SLIDE 23

Example: Sum-of-Squares Error

22

input layer hidden layer 1 hidden layer 2 hidden layer 3

Original features x New features z x1 x2 xd

z = φ(x,WNN)

z1 zD

Least squares model (W,b)

ˆ y − y

2 2 = WTz +b− y 2 2

= WTϕ x,WNN

( )+b− y

2 2

slide-24
SLIDE 24

Example: Sum-of-Squares Error

23

input layer hidden layer 1 hidden layer 2 hidden layer 3

Original features x New features z x1 x2 xd

z = φ(x,WNN)

z1 zD

Least squares model (W,b)

ˆ y − y

2 2 = WTz +b− y 2 2

= WTϕ x,WNN

( )+b− y

2 2

O WNN ,W,b

( ) =

WTϕ xi,WNN

( )+b− yi 2

2 i=1 N

Train a network over N training samples :

xi

{ }i=1

N

slide-25
SLIDE 25

24

slide-26
SLIDE 26
  • Binary classification: we want to model the probability a sample

from class 1 and the probability a sample from class 0, based on a linear prediction function .

25

Cross-Entropy Loss Function:

: Given an observed sample x, the probability it is from class 1.

p y =1 x

( )

: Given an observed sample x, the probability it is from class 0.

p y = 0 x

( )

This can be done by using the logistic sigmoid function: p y =1 x

( ) =σ wT !

x

( ) =

1 1+exp −wT ! x

( )

p y = 0 x

( ) =1−σ wT !

x

( ) =

1 1+exp wT ! x

( )

σ x

( ) =

1 1+exp −x

( )

logistic sigmoid function

w0 + w1x1 + w2x2 +!+ wdxd = wT ! x

slide-27
SLIDE 27
  • Multi-class classification: we want to model the probability a sample from

class k (k= 1,2,…c) based on a linear prediction function:

  • Each sample has c binary output variables, each indicating whether the sample

belongs to a class:

26

Cross-Entropy Loss Function:

w0 + w1x1 + w2x2 +!+ wdxd = wT ! x

: Given an observed sample x, the probability it is from class k.

We model it by a softmax function:

p yk =1 x

( )

p yk =1 x

( ) =

exp ak

( )

exp a j

( )

j=1 c

where ak = wk

T !

x.

f x

( ) = exp x ( )

y = y1, y2,…, yc

[ ], yk ∈ 0,1

{ }

Construct c different linear functions for c classes.

slide-28
SLIDE 28
  • A commonly used loss function for training a linear model is the

cross-entropy loss. It is computed from the estimated probabilities based on a linear model as introduced in the previous slides:

– For binary classification: – For multi-class classification:

27

Cross-Entropy Loss Function:

O w

( ) = −

yi log p yi =1 xi

( )

( )

i=1 N

+ 1− yi

( )log p yi = 0 xi

( )

( )

i=1 N

⎛ ⎝ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ = − yi log σ wT ! xi

( )

( )

i=1 N

− 1− yi

( )log 1−σ wT !

xi

( )

( )

i=1 N

O W

( ) = −

yik log p yik =1 xi

( )

( )

k=1 c

i=1 N

= − yik log exp wk

T !

xi

( )

exp w j

T !

xi

( )

j=1 c

⎛ ⎝ ⎜ ⎜ ⎜ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ ⎟ ⎟ ⎟

k=1 c

i=1 N

A linear classifier trained using the cross-entropy loss is called logistic regression.

slide-29
SLIDE 29

Example: Two-class classification

28

  • Convert the output of the neural network into a single probability value

using the logistic sigmoid function.

  • Optimise the neural network weights and the prediction weight vector w

by minimising the cross-entropy loss.

z1 zD

input layer hidden layer 1 hidden layer 2 hidden layer 3

Original features x New features z x1 x2 xd

z = φ(x,WNN)

w

1 1+exp −wT ! z

( )

Probability

  • f whether

it is from a class Use the sigmoid function to build the prediction layer

slide-30
SLIDE 30

Example: Multi-class classification

29

w1 w2 wc

e

w1

T !

z

e

w j

T !

z j=1 c

e

w2

T !

z

e

w j

T !

z j=1 c

e

wc

T !

z

e

w j

T !

z j=1 c

red green purple

probabilities Use the softmax function to build the prediction layer

  • Convert the output of the neural network into a set of c probability

values using the softmax function.

  • Optimise the neural network weights and the prediction weight vectors

w1,…wc by minimising the cross-entropy loss.

input layer hidden layer 1 hidden layer 2 hidden layer 3

Original features x New features z z1 zD x1 x2 xd

z = φ(x,WNN)

slide-31
SLIDE 31

30

slide-32
SLIDE 32

Backpropagation

  • Technically, backpropagation is a method of calculating

the gradient of the loss function with respect to layers of the neural network weights.

  • It uses the chain rule to iteratively compute gradients for

each layer.

  • It can be viewed as a process of calculating the error

contribution of each neuron after processing a batch of training data.

31

Given z y x

( )

( ), chain rule: dz

dx = dz dy × dy dx

slide-33
SLIDE 33

32

input layer hidden layer 1 hidden layer 2 hidden layer 3

  • riginal features x

new features z

∂O ∂W

h

1

( ) ,

∂O ∂W

h2

( ) , ∂O

∂W

h3

( ) , ∂O

∂Wp = ?

O x,W

h

1

( ),W

h2

( ),W

h3

( ),Wp

( )

slide-34
SLIDE 34

33

input layer hidden layer 1 hidden layer 2 hidden layer 3

  • riginal features x

new features z

z z

h2

( ),W

h3

( )

( )

z

h2

( ) z

h

1

( ),W

h2

( )

( )

z

h

1

( ) x,W

h

1

( )

( )

∂O ∂W

h

1

( ) ,

∂O ∂W

h2

( ) , ∂O

∂W

h3

( ) , ∂O

∂Wp = ?

O z,Wp

( )

O x,W

h

1

( ),W

h2

( ),W

h3

( ),Wp

( )

slide-35
SLIDE 35

34

input layer hidden layer 1 hidden layer 2 hidden layer 3

  • riginal features x

new features z

z z

h2

( ),W

h3

( )

( )

z

h2

( ) z

h

1

( ),W

h2

( )

( )

z

h

1

( ) x,W

h

1

( )

( )

∂O ∂W

h

1

( ) ,

∂O ∂W

h2

( ) , ∂O

∂W

h3

( ) , ∂O

∂Wp = ?

O z,Wp

( )

∂O ∂Wp

∂O ∂W

h3

( )

= ∂O ∂z × ∂z ∂W

h3

( )

∂O ∂W

h2

( )

= ∂O ∂z × ∂z ∂z

h2

( ) × ∂z

h2

( )

∂W

h2

( )

∂O ∂W

h

1

( )

= ∂O ∂z × ∂z ∂z

h2

( ) × ∂z

h2

( )

∂z

h

1

( ) × ∂z

h

1

( )

∂W

h

1

( )

O x,W

h

1

( ),W

h2

( ),W

h3

( ),Wp

( )

slide-36
SLIDE 36

35

input layer hidden layer 1 hidden layer 2 hidden layer 3

  • riginal features x

new features z

∂O ∂Wp

∂O ∂W

h3

( )

= ∂O ∂z × ∂z ∂W

h3

( )

∂O ∂W

h2

( )

= ∂O ∂z × ∂z ∂z

h2

( ) × ∂z

h2

( )

∂W

h2

( )

∂O ∂W

h

1

( )

= ∂O ∂z × ∂z ∂z

h2

( ) × ∂z

h2

( )

∂z

h

1

( ) × ∂z

h

1

( )

∂W

h

1

( )

z z

h2

( ),W

h3

( )

( )

z

h2

( ) z

h

1

( ),W

h2

( )

( )

z

h

1

( ) x,W

h

1

( )

( )

∂O ∂W

h

1

( ) ,

∂O ∂W

h2

( ) , ∂O

∂W

h3

( ) , ∂O

∂Wp = ?

O z,Wp

( )

∂O ∂z

O x,W

h

1

( ),W

h2

( ),W

h3

( ),Wp

( )

slide-37
SLIDE 37

36

input layer hidden layer 1 hidden layer 2 hidden layer 3

  • riginal features x

new features z

∂O ∂Wp

∂O ∂W

h3

( )

= ∂O ∂z × ∂z ∂W

h3

( )

∂O ∂W

h2

( )

= ∂O ∂z × ∂z ∂z

h2

( ) × ∂z

h2

( )

∂W

h2

( )

∂O ∂W

h

1

( )

= ∂O ∂z × ∂z ∂z

h2

( ) × ∂z

h2

( )

∂z

h

1

( ) × ∂z

h

1

( )

∂W

h

1

( )

z z

h2

( ),W

h3

( )

( )

z

h2

( ) z

h

1

( ),W

h2

( )

( )

z

h

1

( ) x,W

h

1

( )

( )

∂O ∂W

h

1

( ) ,

∂O ∂W

h2

( ) , ∂O

∂W

h3

( ) , ∂O

∂Wp = ?

O z,Wp

( )

∂z ∂z

h2

( )

∂O ∂z

O x,W

h

1

( ),W

h2

( ),W

h3

( ),Wp

( )

slide-38
SLIDE 38

37

input layer hidden layer 1 hidden layer 2 hidden layer 3

  • riginal features x

new features z

∂O ∂Wp

∂O ∂W

h3

( )

= ∂O ∂z × ∂z ∂W

h3

( )

∂O ∂W

h2

( )

= ∂O ∂z × ∂z ∂z

h2

( ) × ∂z

h2

( )

∂W

h2

( )

∂O ∂W

h

1

( )

= ∂O ∂z × ∂z ∂z

h2

( ) × ∂z

h2

( )

∂z

h

1

( ) × ∂z

h

1

( )

∂W

h

1

( )

z z

h2

( ),W

h3

( )

( )

z

h2

( ) z

h

1

( ),W

h2

( )

( )

z

h

1

( ) x,W

h

1

( )

( )

∂O ∂W

h

1

( ) ,

∂O ∂W

h2

( ) , ∂O

∂W

h3

( ) , ∂O

∂Wp = ?

O z,Wp

( )

∂z ∂z

h2

( )

∂z

h2

( )

∂z

h

1

( )

∂O ∂z

O x,W

h

1

( ),W

h2

( ),W

h3

( ),Wp

( )

slide-39
SLIDE 39

Convolutional Neural Network

  • The main recipe of a convolutional neural network (CNN)

includes:

– 2D/3D Neurons: The CNN supports layers that have neurons arranged in 2 dimensions (width and height) or 3 dimensions (width, height and depth). – Convolutional Layers – Pooling Layers – Fully connected Layers

  • Its training is based on backpropagation and stochastic gradient

descent.

38

width height

2D Neurons 3D Neurons

The CNN notes are prepared by consulting Lecture 5, CS231.

slide-40
SLIDE 40

Convolutional Layer

  • Local connectivity: Each neuron inside a layer is connected to only a

small region of the previous layer, called a receptive field.

  • The output of a neuron is a number computed from the output of those

neurons from the corresponding local region and the weights of the convolutional filter: y =Activation(wTx+b)

  • Weight sharing: One same filter of the size of the local region slides
  • ver all spatial locations (in different words, slides over all the neurons

in the layer).

39

layer h layer h-1 layer h layer h-1

slide-41
SLIDE 41

Convolutional Layer

  • We start from the simpler case of 2D neurons.

40

h11 h12 h13 h14 h15 h16 h17 h21 h22 h23 h24 h25 h26 h27 h31 h32 h33 h34 h35 h36 h37 h41 h42 h43 h44 h45 h46 h47 h51 h52 h53 h54 h55 h56 h57 h61 h62 h63 h64 h65 h66 h67 h71 h72 h73 h74 h75 h76 h77 w1 w2 w3 w6 w5 w4 w7 w8 w9 A layer of 7x7 neurons. Each neuron’s

  • utput is denoted by hij. This

corresponds to a 7x7 input data. A 3x3 filter.

7 7 3 3

slide-42
SLIDE 42

Convolutional Layer

  • The operation for applying the filter to a local region of neurons:

41

h11 h12 h13 h14 h15 h16 h17 h21 h22 h23 h24 h25 h26 h27 h31 h32 h33 h34 h35 h36 h37 h41 h42 h43 h44 h45 h46 h47 h51 h52 h53 h54 h55 h56 h57 h61 h62 h63 h64 h65 h66 h67 h71 h72 h73 h74 h75 h76 h77

x = h

11,h 12,h 13,h21,h22,h23,h31,h32,h33

⎡ ⎣ ⎤ ⎦

T

w = w1,w2,w3,w4,w5,w6,w7,w8,w9 ⎡ ⎣ ⎤ ⎦

T

  • utput = Activation wTx +b

( )

w1 w2 w3 w6 w5 w4 w7 w8 w9 A commonly used activation function is ReLu

  • 3
  • 2
  • 1
1 2 3

v

0.5 1 1.5 2 2.5 3

(v)

ReLU

Applying the filter to the local region.

slide-43
SLIDE 43

Convolutional Layer

  • Slide the filter over all spatial locations.

42

Input: 7x7 Filter: 3 x 3

stride 1 Output 1! 7 7 3 3

slide-44
SLIDE 44

Convolutional Layer

  • Slide the filter over all spatial locations.

43

stride 1 Output 2! 7 7 3 3

Input: 7x7 Filter: 3 x 3

slide-45
SLIDE 45

Convolutional Layer

  • Slide the filter over all spatial locations.

44

stride 1 Output 3! 7 7 3 3

Input: 7x7 Filter: 3 x 3

slide-46
SLIDE 46

Convolutional Layer

  • Slide the filter over all spatial locations.

45

stride 1 Output 4! 7 7 3 3

Input: 7x7 Filter: 3 x 3

slide-47
SLIDE 47

Convolutional Layer

  • Slide the filter over all spatial locations.

46

stride 1 Output 5! 7 7 3 3

Input: 7x7 Filter: 3 x 3

slide-48
SLIDE 48

Convolutional Layer

  • Slide the filter over all spatial locations.

47

stride 1 Move down. Output 6! 7 7 3 3

Input: 7x7 Filter: 3 x 3

slide-49
SLIDE 49

Convolutional Layer

  • Slide the filter over all spatial locations.

48

stride 1 Output 7! 7 7 3 3

Input: 7x7 Filter: 3 x 3

slide-50
SLIDE 50

Convolutional Layer

  • Slide the filter over all spatial locations.

49

stride 1 Output 8! 7 7 3 3

Input: 7x7 Filter: 3 x 3

slide-51
SLIDE 51

Convolutional Layer

  • Slide the filter over all spatial locations.

50

stride 1 Output 9! 7 7 3 3

Input: 7x7 Filter: 3 x 3

slide-52
SLIDE 52

Convolutional Layer

  • Slide the filter over all spatial locations.

51

stride 1 Output 10! 7 7 3 3

Input: 7x7 Filter: 3 x 3

slide-53
SLIDE 53

Convolutional Layer

  • Slide the filter over all spatial locations.

52

stride 1 Following this sliding process, the

  • utput is a 5 x 5 activation map.

7 7 5 5 3 3 activation map

Input: 7x7 Filter: 3 x 3

slide-54
SLIDE 54

Convolutional Layer

  • Slide the filter over all spatial locations.

53

stride 2 Output 1! 7 7 3 3

Input: 7x7 Filter: 3 x 3

slide-55
SLIDE 55

Convolutional Layer

  • Slide the filter over all spatial locations.

54

stride 2 Output 2! 7 7 3 3

Input: 7x7 Filter: 3 x 3

slide-56
SLIDE 56

Convolutional Layer

  • Slide the filter over all spatial locations.

55

stride 2 Output 3! 7 7 3 3

Input: 7x7 Filter: 3 x 3

slide-57
SLIDE 57

Convolutional Layer

  • Slide the filter over all spatial locations.

56

stride 2 Output 4! 7 7 3 3

Input: 7x7 Filter: 3 x 3

slide-58
SLIDE 58

Convolutional Layer

  • Slide the filter over all spatial locations.

57

stride 2 Output 5! 7 7 3 3

Input: 7x7 Filter: 3 x 3

slide-59
SLIDE 59

Convolutional Layer

  • Slide the filter over all spatial locations.

58

stride 2 Output 6! 7 7 3 3

Input: 7x7 Filter: 3 x 3

slide-60
SLIDE 60

Convolutional Layer

  • Slide the filter over all spatial locations.

59

stride 2 Following this sliding process, the output is a 3x3 activation map. 7 7 3 3 3 3 activation map

Input: 7x7 Filter: 3 x 3

slide-61
SLIDE 61

Convolutional Layer

  • Slide the filter over all spatial locations.

60

stride S N N F F F F

  • The sliding process results in an activation

map of size:

  • You will need to set appropriate filter size

and stride size so that (N-F)/S is an integer.

N − F S +1 ⎛ ⎝ ⎜ ⎞ ⎠ ⎟× N − F S +1 ⎛ ⎝ ⎜ ⎞ ⎠ ⎟

Input: N x N Filter: F x F

slide-62
SLIDE 62

Convolutional Layer

  • Let’s look at the case of 3D neurons.
  • The resulting activation map based on one convolutional filter is

2D of size:

61

Red cube is one layer of 3D neurons: N1 (width) x N2 (height) x d (depth) Green cube is a convolutional filter: F1(width) x F2 (height) x d (depth)

Example: Given a layer of 16 x 16 x 3 neurons, a 2 x 2 x 3 convolutional filter, The

  • utput of each neuron is denoted by

hijk, and the weight by wi. For instance, the output of the first neuron in the next layer is computed as follows: x = h

111,h 121,h211,h221,h 112,h 122,h212,h222,h 113,h 123,h213,h223

⎡ ⎣ ⎤ ⎦

T

w = w1,w2,w3,w4,w5,w6,w7,w8,w9,w10,w11,w12 ⎡ ⎣ ⎤ ⎦

T

  • utput = Activation wTx +b

( )

16 16 3 3 2 2

N1 − F

1

S +1 ⎛ ⎝ ⎜ ⎞ ⎠ ⎟× N2 − F2 S +1 ⎛ ⎝ ⎜ ⎞ ⎠ ⎟

stride S

slide-63
SLIDE 63

Convolutional Layer

  • Usually, a set of T (T>1) convolutional filters are applied to
  • btain T separate activation maps.

For instance, when applying six 5 x 5 x 3 filters to a 32 x 32 x 3 input with stride 1, it results in 6 activation maps of size 28 x 28.

62

32−5 1 +1 ⎛ ⎝ ⎜ ⎞ ⎠ ⎟× 32−5 1 +1 ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ = 28×28

32 32 3 5 5 3 28 28 6

slide-64
SLIDE 64

Convolutional Layer

  • Many connected convolutional layers:

63

Different numbers of filters are used in different layers.

slide-65
SLIDE 65

Pooling Layer

  • The pooling layer takes the activation maps returned by a

convolutional layer as the input, and reduces the size of each activation map separately.

  • It can be viewed as an operation of down-sampling.
  • A pooling filter slides over all the spatial locations in an activation map

in the same way as a 2D convolutional filter.

64

1 3 2 5 4 1 2 1 1 1 3 1 2 2 2 4 2 6 5 3 4 2 1 3 1 1 1 5 2 4 5 1 2 4 6 4 2 5 max(1,3,4,1) =4 Apply a 2 x 2 max pooling filter with stride 2 An example

  • f the max

pooling filter. reduced map 3 x 3

  • riginal map

7 x 7

slide-66
SLIDE 66

Fully Connected Layer

  • The output of a convolutional (or pooling) layer is a set of

activation maps, sored in a 3D array.

  • These values in the 3D array can be moved to a 1D array. For

instance: 32 x 32 x 5 array è 5120 x1 array

  • The fully connected layers are equivalent to a multilayer

perceptron (MLP) taking the generated 1D array as the input.

65

32 32 5 1 5120 1 15

MLP

slide-67
SLIDE 67

CNN Architecture

  • A CNN is a sequence of convolutional (and pooling) layers, followed by

fully connected layers in the end:

  • Its unique architecture (2,3D neurons, local connectivity, etc. ) makes it

suitable for processing 2,3D data with typical local patterns, particularly images (n x n x 3 input where 3 corresponds to the R, G and B channels).

66 Example image from MathWorks: https://ww2.mathworks.cn/solutions/deep-learning/convolutional-neural-network.html

slide-68
SLIDE 68

Deep Learning

  • Deep learning refers to techniques for

learning using neural networks.

  • Deep learning is considered as a kind of

representation (feature) learning techniques.

67

The two diagrams are from Figs. 1.5 and 1.4 of Deep Learning book (I. Goodfellow, et al. 2016).

more hidden layers

Example: AlexNet contains a total of 5 convolutional layers and 3 fully connected layers.

slide-69
SLIDE 69

Popular Neural Networks

  • Convolutional neural networks is used to automatically learn a good

feature vector for an image from its pixels.

– NeuralStyle, https://github.com/jcjohnson/neural-style – DeepDream, https://deepdreamgenerator.com

  • Recurrent neural network (RNN) is particularly useful for learning from

sequential data. Each neuron can use its internal memory to maintain information about the previous input. This makes it suitable for processing natural languages, speech, music, etc.

– PoemGenerator, https://github.com/dvictor/lstm-poetry

  • There are neural networks suitable for processing videos, and joint

language/text and image learning.

– NeuralTalk, http://cs.stanford.edu/people/karpathy/neuraltalk/ – TalkingMachines, https://deepmind.com/blog/wavenet-generative-model-raw-audio/

68 Another example: a system learns from images, sound, etc., https://teachablemachine.withgoogle.com

slide-70
SLIDE 70

Summary

  • We have learned the basic knowledge on neural networks:

– Single-layer perceptron – Multi-layer perceptron – Back-propagation

  • We have learned the architecture of a convolutional neural network.
  • We have learned the concept of deep learning.

69

slide-71
SLIDE 71

70