COMP24111: Machine Learning and Optimisation
- Dr. Tingting Mu
COMP24111: Machine Learning and Optimisation Chapter 5: Neural - - PowerPoint PPT Presentation
COMP24111: Machine Learning and Optimisation Chapter 5: Neural Networks and Deep Learning Dr. Tingting Mu Email: tingting.mu@manchester.ac.uk Outline Single-layer perceptron, the perceptron algorithm. Multi-layer perceptron.
1
2 Figure is from http://2centsapiece.blogspot.co.uk/2015/10/identifying-subatomic-particles-with.html
A neuron is an electrically excitable cell that processes and transmits information by electro-chemical signaling.
3 Figure is from http://2centsapiece.blogspot.co.uk/2015/10/identifying-subatomic-particles-with.html
Input signals sent from other neurons. If enough signals accumulate, the neuron fires a signal. Connection strengths determine how the signals are accumulated.
A neuron is an electrically excitable cell that processes and transmits information by electro-chemical signaling.
4
5
6
i=1 d
neuron adder activation
Basic elements of a typical neuron include:
these is characterised by a weight (strength).
the input signals, weighted by the respective synapses.
which squashes the permissible amplitude range of the output signal. Given d input, a neuron is modeled by d+1 parameters.
7
ϕ v
1 if v ≥ 0 −1 if v < 0 ⎧ ⎨ ⎪ ⎩ ⎪ ϕ v
1 1+exp −v
∈ 0,+1
ϕ v
exp 2v
exp 2v
∈ −1,+1
ϕ v
ϕ v
v if v ≥ 0 if v < 0 ⎧ ⎨ ⎪ ⎩ ⎪
v
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1(v)
v
(v)
v
0.5 1 1.5 2 2.5 3(v)
v
(v)
Identity Sigmoid Tanh ReLU 1 1
v
(v)
Threshold
1
8
v
(v)
Identity
Activation function: ϕ v
1 if v ≥ 0 −1 if v < 0 ⎧ ⎨ ⎪ ⎩ ⎪
v
(v)
Threshold
1
9
Update using a misclassified sample in each iteration!
10
Initialise the weights (stored in w(0)) to random numbers in range -1 to +1. For t = 1 to NUM_ITERATIONS For each training sample (xi,yi) Calculate activation using current weight (stored in w(t)). Update weight (stored in w(t+1) ) by learning rule. end end
No change No change
Update using one misclassified sample in each iteration:
Add – η!
Add + η!
called perceptron criterion:
11
misclassified samples, therefore to minimise the above error penalty. O w
yi wT ! xi
i∈Misclassified Set
If a sample is correctly classified, applies an error penalty of zero; if incorrectly classified, applies an error penalty of the following quantity:
Oi w
xi ⇒ ∂Oi w
∂w = −yi! xi
12
y =ϕ wixi +b
i=1 d
⎛ ⎝ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟
adder activation Input Layer It has only one layer (input layer), and is called a single layer perceptron. an input node
13
hidden layer 1 hidden layer 2 input layer
input layer hidden layer 1
Example:
14
input layer hidden layer 1 hidden layer 2
layer
input layer is equal to the number of input features.
a hyperparameter to be set.
hidden layers are also hyperparameters to be set.
the task to be solved.
Hidden layer Output layer
takes 9 input features and returns 2 output variables (9 input neurons in the input layer, 2 output neurons in the output layer).
15
in the hidden layer (j=1,2,3,4), for the n-th training sample:
in the output layer (k=1,2), for the n-training sample: zj n
wij
h
( )xi n
(h) i=1 9
⎛ ⎝ ⎜ ⎞ ⎠ ⎟ yk n
wjk
(o) j=1 4
⎛ ⎝ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟
Feed-forward information flow when computing the output variables.
Hidden layer Output layer
takes 9 input features and returns 2 output variables (9 input neurons in input layer, 2 output neuron in output layer).
16
in the hidden layer (j=1,2,3,4), for the n-th training sample:
in the output layer (k=1,2), for the n-training sample: zj n
wij
h
( )xi n
(h) i=1 9
⎛ ⎝ ⎜ ⎞ ⎠ ⎟ yk n
wjk
(o) j=1 4
⎛ ⎝ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟
yk(n)
zj(n) Wjk
(o)
xi(n) Wij
(h)
9+1 =10 weights 4+1=5 weights
j k
Feed-forward information flow when computing the output variables.
How many weights in total?
Hidden layer Output layer
takes 9 input features and returns 2 output variables (9 input neurons in input layer, 2 output neuron in output layer).
17
in the hidden layer (j=1,2,3,4), for the n-th training sample:
in the output layer (k=1,2), for the n-training sample: zj n
wij
h
( )xi n
(h) i=1 9
⎛ ⎝ ⎜ ⎞ ⎠ ⎟ yk n
wjk
(o) j=1 4
⎛ ⎝ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟
10 x 4 =40 weights 5 x 2 =10 weights A total of 40+10 =50 weights to be
bias parameters). Feed-forward information flow when computing the output variables.
yk(n)
zj(n) Wjk
(o)
xi(n) Wij
(h)
9+1 =10 weights 4+1=5 weights
j k
18
input layer hidden layer 1 hidden layer 2
layer
19
input layer hidden layer 1 hidden layer 2 hidden layer 3 prediction layer (new
layer)
Original features x New features φ(x,WNN)
A neural network can be viewed as a powerful feature extractor to compute an effective representation for the sample, which helps the prediction task.
Loss(φ(x, WNN))
– Sum-of-squares error (as used by the least squares model, Chapter 2) – A mixture of sum-of-squares error and a reguarlisation term (as used by the regularised least squares model, Chapter 2) – …
20
21
input layer hidden layer 1 hidden layer 2 hidden layer 3
Original features x New features z x1 x2 xd
z = φ(x,WNN)
z1 zD
22
input layer hidden layer 1 hidden layer 2 hidden layer 3
Original features x New features z x1 x2 xd
z = φ(x,WNN)
z1 zD
Least squares model (W,b)
2 2 = WTz +b− y 2 2
2 2
23
input layer hidden layer 1 hidden layer 2 hidden layer 3
Original features x New features z x1 x2 xd
z = φ(x,WNN)
z1 zD
Least squares model (W,b)
2 2 = WTz +b− y 2 2
2 2
2 i=1 N
Train a network over N training samples :
N
24
25
: Given an observed sample x, the probability it is from class 1.
p y =1 x
: Given an observed sample x, the probability it is from class 0.
p y = 0 x
This can be done by using the logistic sigmoid function: p y =1 x
σ x
1 1+exp −x
logistic sigmoid function
class k (k= 1,2,…c) based on a linear prediction function:
belongs to a class:
26
: Given an observed sample x, the probability it is from class k.
We model it by a softmax function:
j=1 c
T !
f x
( ) = exp x ( )
Construct c different linear functions for c classes.
– For binary classification: – For multi-class classification:
27
O w
yi log p yi =1 xi
i=1 N
+ 1− yi
i=1 N
⎛ ⎝ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ = − yi log σ wT ! xi
i=1 N
− 1− yi
xi
i=1 N
O W
yik log p yik =1 xi
k=1 c
i=1 N
= − yik log exp wk
T !
xi
exp w j
T !
xi
j=1 c
⎛ ⎝ ⎜ ⎜ ⎜ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ ⎟ ⎟ ⎟
k=1 c
i=1 N
A linear classifier trained using the cross-entropy loss is called logistic regression.
28
z1 zD
input layer hidden layer 1 hidden layer 2 hidden layer 3
Original features x New features z x1 x2 xd
z = φ(x,WNN)
w
1 1+exp −wT ! z
( )
Probability
it is from a class Use the sigmoid function to build the prediction layer
29
w1 w2 wc
e
w1
T !
z
e
w j
T !
z j=1 c
e
w2
T !
z
e
w j
T !
z j=1 c
e
wc
T !
z
e
w j
T !
z j=1 c
red green purple
probabilities Use the softmax function to build the prediction layer
input layer hidden layer 1 hidden layer 2 hidden layer 3
Original features x New features z z1 zD x1 x2 xd
z = φ(x,WNN)
30
31
32
input layer hidden layer 1 hidden layer 2 hidden layer 3
new features z
∂O ∂W
h
1
( ) ,
∂O ∂W
h2
( ) , ∂O
∂W
h3
( ) , ∂O
∂Wp = ?
O x,W
h
1
( ),W
h2
( ),W
h3
( ),Wp
33
input layer hidden layer 1 hidden layer 2 hidden layer 3
new features z
z z
h2
( ),W
h3
( )
z
h2
( ) z
h
1
( ),W
h2
( )
z
h
1
( ) x,W
h
1
( )
∂O ∂W
h
1
( ) ,
∂O ∂W
h2
( ) , ∂O
∂W
h3
( ) , ∂O
∂Wp = ?
O x,W
h
1
( ),W
h2
( ),W
h3
( ),Wp
34
input layer hidden layer 1 hidden layer 2 hidden layer 3
new features z
z z
h2
( ),W
h3
( )
z
h2
( ) z
h
1
( ),W
h2
( )
z
h
1
( ) x,W
h
1
( )
∂O ∂W
h
1
( ) ,
∂O ∂W
h2
( ) , ∂O
∂W
h3
( ) , ∂O
∂Wp = ?
∂O ∂W
h3
( )
= ∂O ∂z × ∂z ∂W
h3
( )
∂O ∂W
h2
( )
= ∂O ∂z × ∂z ∂z
h2
( ) × ∂z
h2
( )
∂W
h2
( )
∂O ∂W
h
1
( )
= ∂O ∂z × ∂z ∂z
h2
( ) × ∂z
h2
( )
∂z
h
1
( ) × ∂z
h
1
( )
∂W
h
1
( )
O x,W
h
1
( ),W
h2
( ),W
h3
( ),Wp
35
input layer hidden layer 1 hidden layer 2 hidden layer 3
new features z
∂O ∂W
h3
( )
= ∂O ∂z × ∂z ∂W
h3
( )
∂O ∂W
h2
( )
= ∂O ∂z × ∂z ∂z
h2
( ) × ∂z
h2
( )
∂W
h2
( )
∂O ∂W
h
1
( )
= ∂O ∂z × ∂z ∂z
h2
( ) × ∂z
h2
( )
∂z
h
1
( ) × ∂z
h
1
( )
∂W
h
1
( )
z z
h2
( ),W
h3
( )
z
h2
( ) z
h
1
( ),W
h2
( )
z
h
1
( ) x,W
h
1
( )
∂O ∂W
h
1
( ) ,
∂O ∂W
h2
( ) , ∂O
∂W
h3
( ) , ∂O
∂Wp = ?
O x,W
h
1
( ),W
h2
( ),W
h3
( ),Wp
36
input layer hidden layer 1 hidden layer 2 hidden layer 3
new features z
∂O ∂W
h3
( )
= ∂O ∂z × ∂z ∂W
h3
( )
∂O ∂W
h2
( )
= ∂O ∂z × ∂z ∂z
h2
( ) × ∂z
h2
( )
∂W
h2
( )
∂O ∂W
h
1
( )
= ∂O ∂z × ∂z ∂z
h2
( ) × ∂z
h2
( )
∂z
h
1
( ) × ∂z
h
1
( )
∂W
h
1
( )
z z
h2
( ),W
h3
( )
z
h2
( ) z
h
1
( ),W
h2
( )
z
h
1
( ) x,W
h
1
( )
∂O ∂W
h
1
( ) ,
∂O ∂W
h2
( ) , ∂O
∂W
h3
( ) , ∂O
∂Wp = ?
h2
( )
O x,W
h
1
( ),W
h2
( ),W
h3
( ),Wp
37
input layer hidden layer 1 hidden layer 2 hidden layer 3
new features z
∂O ∂W
h3
( )
= ∂O ∂z × ∂z ∂W
h3
( )
∂O ∂W
h2
( )
= ∂O ∂z × ∂z ∂z
h2
( ) × ∂z
h2
( )
∂W
h2
( )
∂O ∂W
h
1
( )
= ∂O ∂z × ∂z ∂z
h2
( ) × ∂z
h2
( )
∂z
h
1
( ) × ∂z
h
1
( )
∂W
h
1
( )
z z
h2
( ),W
h3
( )
z
h2
( ) z
h
1
( ),W
h2
( )
z
h
1
( ) x,W
h
1
( )
∂O ∂W
h
1
( ) ,
∂O ∂W
h2
( ) , ∂O
∂W
h3
( ) , ∂O
∂Wp = ?
h2
( )
h2
( )
h
1
( )
O x,W
h
1
( ),W
h2
( ),W
h3
( ),Wp
38
width height
2D Neurons 3D Neurons
The CNN notes are prepared by consulting Lecture 5, CS231.
39
layer h layer h-1 layer h layer h-1
40
h11 h12 h13 h14 h15 h16 h17 h21 h22 h23 h24 h25 h26 h27 h31 h32 h33 h34 h35 h36 h37 h41 h42 h43 h44 h45 h46 h47 h51 h52 h53 h54 h55 h56 h57 h61 h62 h63 h64 h65 h66 h67 h71 h72 h73 h74 h75 h76 h77 w1 w2 w3 w6 w5 w4 w7 w8 w9 A layer of 7x7 neurons. Each neuron’s
corresponds to a 7x7 input data. A 3x3 filter.
7 7 3 3
41
h11 h12 h13 h14 h15 h16 h17 h21 h22 h23 h24 h25 h26 h27 h31 h32 h33 h34 h35 h36 h37 h41 h42 h43 h44 h45 h46 h47 h51 h52 h53 h54 h55 h56 h57 h61 h62 h63 h64 h65 h66 h67 h71 h72 h73 h74 h75 h76 h77
11,h 12,h 13,h21,h22,h23,h31,h32,h33
T
T
w1 w2 w3 w6 w5 w4 w7 w8 w9 A commonly used activation function is ReLu
v
0.5 1 1.5 2 2.5 3(v)
ReLU
Applying the filter to the local region.
42
Input: 7x7 Filter: 3 x 3
stride 1 Output 1! 7 7 3 3
43
stride 1 Output 2! 7 7 3 3
Input: 7x7 Filter: 3 x 3
44
stride 1 Output 3! 7 7 3 3
Input: 7x7 Filter: 3 x 3
45
stride 1 Output 4! 7 7 3 3
Input: 7x7 Filter: 3 x 3
46
stride 1 Output 5! 7 7 3 3
Input: 7x7 Filter: 3 x 3
47
stride 1 Move down. Output 6! 7 7 3 3
Input: 7x7 Filter: 3 x 3
48
stride 1 Output 7! 7 7 3 3
Input: 7x7 Filter: 3 x 3
49
stride 1 Output 8! 7 7 3 3
Input: 7x7 Filter: 3 x 3
50
stride 1 Output 9! 7 7 3 3
Input: 7x7 Filter: 3 x 3
51
stride 1 Output 10! 7 7 3 3
Input: 7x7 Filter: 3 x 3
52
stride 1 Following this sliding process, the
7 7 5 5 3 3 activation map
Input: 7x7 Filter: 3 x 3
53
stride 2 Output 1! 7 7 3 3
Input: 7x7 Filter: 3 x 3
54
stride 2 Output 2! 7 7 3 3
Input: 7x7 Filter: 3 x 3
55
stride 2 Output 3! 7 7 3 3
Input: 7x7 Filter: 3 x 3
56
stride 2 Output 4! 7 7 3 3
Input: 7x7 Filter: 3 x 3
57
stride 2 Output 5! 7 7 3 3
Input: 7x7 Filter: 3 x 3
58
stride 2 Output 6! 7 7 3 3
Input: 7x7 Filter: 3 x 3
59
stride 2 Following this sliding process, the output is a 3x3 activation map. 7 7 3 3 3 3 activation map
Input: 7x7 Filter: 3 x 3
60
stride S N N F F F F
N − F S +1 ⎛ ⎝ ⎜ ⎞ ⎠ ⎟× N − F S +1 ⎛ ⎝ ⎜ ⎞ ⎠ ⎟
Input: N x N Filter: F x F
61
Red cube is one layer of 3D neurons: N1 (width) x N2 (height) x d (depth) Green cube is a convolutional filter: F1(width) x F2 (height) x d (depth)
Example: Given a layer of 16 x 16 x 3 neurons, a 2 x 2 x 3 convolutional filter, The
hijk, and the weight by wi. For instance, the output of the first neuron in the next layer is computed as follows: x = h
111,h 121,h211,h221,h 112,h 122,h212,h222,h 113,h 123,h213,h223
⎡ ⎣ ⎤ ⎦
T
w = w1,w2,w3,w4,w5,w6,w7,w8,w9,w10,w11,w12 ⎡ ⎣ ⎤ ⎦
T
16 16 3 3 2 2
1
stride S
For instance, when applying six 5 x 5 x 3 filters to a 32 x 32 x 3 input with stride 1, it results in 6 activation maps of size 28 x 28.
62
32 32 3 5 5 3 28 28 6
63
Different numbers of filters are used in different layers.
64
1 3 2 5 4 1 2 1 1 1 3 1 2 2 2 4 2 6 5 3 4 2 1 3 1 1 1 5 2 4 5 1 2 4 6 4 2 5 max(1,3,4,1) =4 Apply a 2 x 2 max pooling filter with stride 2 An example
pooling filter. reduced map 3 x 3
7 x 7
65
32 32 5 1 5120 1 15
MLP
66 Example image from MathWorks: https://ww2.mathworks.cn/solutions/deep-learning/convolutional-neural-network.html
learning using neural networks.
representation (feature) learning techniques.
67
The two diagrams are from Figs. 1.5 and 1.4 of Deep Learning book (I. Goodfellow, et al. 2016).
more hidden layers
Example: AlexNet contains a total of 5 convolutional layers and 3 fully connected layers.
– NeuralStyle, https://github.com/jcjohnson/neural-style – DeepDream, https://deepdreamgenerator.com
– PoemGenerator, https://github.com/dvictor/lstm-poetry
– NeuralTalk, http://cs.stanford.edu/people/karpathy/neuraltalk/ – TalkingMachines, https://deepmind.com/blog/wavenet-generative-model-raw-audio/
68 Another example: a system learns from images, sound, etc., https://teachablemachine.withgoogle.com
– Single-layer perceptron – Multi-layer perceptron – Back-propagation
69
70