In Introductio ion to Neural l Networks
I2DL: Prof. Niessner, Prof. Leal-Taixé 1
In Introductio ion to Neural l Networks I2DL: Prof. Niessner, - - PowerPoint PPT Presentation
In Introductio ion to Neural l Networks I2DL: Prof. Niessner, Prof. Leal-Taix 1 Lecture 2 Recap I2DL: Prof. Niessner, Prof. Leal-Taix 2 Lin inear Regression = a supervised le learn rning method to find a lin linear r model of
I2DL: Prof. Niessner, Prof. Leal-Taixé 1
2 I2DL: Prof. Niessner, Prof. Leal-Taixé
I2DL: Prof. Niessner, Prof. Leal-Taixé 3
= a supervised le learn rning method to find a lin linear r model of the form
Goal: find a model that explains a target y given the input x ො 𝑧𝑗 = 𝜄0 +
𝑘=1 𝑒
𝑦𝑗𝑘𝜄
𝑘 = 𝜄0 + 𝑦𝑗1𝜄1 + 𝑦𝑗2𝜄2 + ⋯ + 𝑦𝑗𝑒𝜄𝑒
𝜄0
I2DL: Prof. Niessner, Prof. Leal-Taixé 4
Minimization
ℒ 𝑧𝑗, ෝ 𝑧𝑗 = −𝑧𝑗 ∙ log ෝ 𝑧𝑗 + (1 − 𝑧𝑗) ∙ log[1 − ෝ 𝑧𝑗]) 𝒟 𝜾 = −
𝑗=1 𝑜
(𝑧𝑗 ∙ log ෝ 𝑧𝑗 + (1 − 𝑧𝑗) ∙ log[1 − ෝ 𝑧𝑗])
ෝ 𝑧𝑗 = 𝜏(𝑦𝑗𝜾)
I2DL: Prof. Niessner, Prof. Leal-Taixé 5
y=1 y=0 Predictions are guaranteed to be within [0;1] Predictions can exceed the range
→ in the case of classification [0;1] this becomes a real issue
I2DL: Prof. Niessner, Prof. Leal-Taixé 6
Data points Model parameters Labels (ground truth) Estimation Loss function Optimization 𝜾 ෝ 𝒛 𝒛 𝒚
𝒈𝒋 =
𝒌
𝑥𝑙,𝑘 𝑦𝑘,𝑗 𝒈 = 𝑿 𝒚
I2DL: Prof. Niessner, Prof. Leal-Taixé 7
(Matrix Notation)
On CIFAR-10 On ImageNet
I2DL: Prof. Niessner, Prof. Leal-Taixé 8
Source:: Li/Karpathy/Johnson
I2DL: Prof. Niessner, Prof. Leal-Taixé 9
Logistic Regression Linear Separation Impossible!
– Multiply with another weight matrix 𝑿𝟑
𝒈 = 𝑿𝟑 ⋅ 𝒈 𝒈 = 𝑿𝟑 ⋅ 𝑿 ⋅ 𝒚
𝑿 = 𝑿𝟑 ⋅ 𝑿 𝒈 = 𝑿 𝒚
I2DL: Prof. Niessner, Prof. Leal-Taixé 10
– 2-layers: 𝒈 = 𝑿𝟑 max(𝟏, 𝑿𝟐𝒚) – 3-layers: 𝒈 = 𝑿𝟒 max(𝟏, 𝑿𝟑 max(𝟏, 𝑿𝟐𝒚)) – 4-layers: 𝒈 = 𝑿𝟓 tanh (𝑿𝟒, max(𝟏, 𝑿𝟑 max(𝟏, 𝑿𝟐𝒚))) – 5-layers: 𝒈 = 𝑿𝟔𝜏(𝑿𝟓 tanh(𝑿𝟒, max(𝟏, 𝑿𝟑 max(𝟏, 𝑿𝟐𝒚)))) – … up to hundreds of layers
I2DL: Prof. Niessner, Prof. Leal-Taixé 11
I2DL: Prof. Niessner, Prof. Leal-Taixé 12
I2DL: Prof. Niessner, Prof. Leal-Taixé 13
Source: http://beamlab.org/deeplearning/2017/02/23/deep_learning_101_part1.html
I2DL: Prof. Niessner, Prof. Leal-Taixé 14
Logistic Regression Neural Networks
I2DL: Prof. Niessner, Prof. Leal-Taixé 15
On CIFAR-10 Visualizing activations of first layer.
Source: ConvNetJS
1-layer network: 𝒈 = 𝑿𝒚 𝒚 𝑿 128 × 128 = 16384 𝒈 10
I2DL: Prof. Niessner, Prof. Leal-Taixé 16
Why is this structure useful? 2-layer network: 𝒈 = 𝑿𝟑 max(𝟏, 𝑿𝟐𝒚) 𝒚 𝒊 𝑿𝟐 128 × 128 = 16384 1000 𝒈 𝑿2 10
I2DL: Prof. Niessner, Prof. Leal-Taixé 17
2-layer network: 𝒈 = 𝑿𝟑 max(𝟏, 𝑿𝟐𝒚) 𝒚 𝒊 𝑿𝟐 128 × 128 = 16384 1000 𝒈 𝑿2 10 Input Layer
Hidden Layer Output Layer
𝑔(𝑋
0,0𝑦 + 𝑐0,0)
𝑦1 𝑦2 𝑦3
𝑔(𝑋
0,1𝑦 + 𝑐0,1)
𝑔(𝑋
0,2𝑦 + 𝑐0,2)
𝑔(𝑋
0,3𝑦 + 𝑐0,3)
𝑔(𝑋
1,0𝑦 + 𝑐1,0)
𝑔(𝑋
1,1𝑦 + 𝑐1,1)
𝑔(𝑋
1,2𝑦 + 𝑐1,2)
𝑔(𝑋
2,0𝑦 + 𝑐2,0)
I2DL: Prof. Niessner, Prof. Leal-Taixé 18
I2DL: Prof. Niessner, Prof. Leal-Taixé 19
Source: https://towardsdatascience.com/training-deep-neural-networks-9fdb1964b964
Sigmoid: 𝜏 𝑦 =
1 (1+𝑓−𝑦)
tanh: tanh 𝑦 ReLU: max 0, 𝑦 Leaky ReLU: max 0.1𝑦, 𝑦 Maxout max 𝑥1
𝑈𝑦 + 𝑐1, 𝑥2 𝑈𝑦 + 𝑐2
ELU f x = ቊ 𝑦 if 𝑦 > 0 α e𝑦 − 1 if 𝑦 ≤ 0 Parametric ReLU: max 𝛽𝑦, 𝑦
I2DL: Prof. Niessner, Prof. Leal-Taixé 20
𝒈 = 𝑿𝟒 ⋅ (𝑿𝟑 ⋅ 𝑿𝟐 ⋅ 𝒚 ))
I2DL: Prof. Niessner, Prof. Leal-Taixé 21
Why activation functions? Simply concatenating linear layers would be so much cheaper...
I2DL: Prof. Niessner, Prof. Leal-Taixé 22
Why organize a neural network into layers?
I2DL: Prof. Niessner, Prof. Leal-Taixé 23
Credit: Stanford CS 231n
I2DL: Prof. Niessner, Prof. Leal-Taixé 24
Credit: Stanford CS 231n
Artificial neural networks are insp spired by the brain, but not even close in terms of complexity! The comparison is great for the media and news articles however...
I2DL: Prof. Niessner, Prof. Leal-Taixé 25
𝑔(𝑋
0,0𝑦 + 𝑐0,0)
𝑦1 𝑦2 𝑦3
𝑔(𝑋
0,1𝑦 + 𝑐0,1)
𝑔(𝑋
0,2𝑦 + 𝑐0,2)
𝑔(𝑋
0,3𝑦 + 𝑐0,3)
𝑔(𝑋
1,0𝑦 + 𝑐1,0)
𝑔(𝑋
1,1𝑦 + 𝑐1,1)
𝑔(𝑋
1,2𝑦 + 𝑐1,2)
𝑔(𝑋
2,0𝑦 + 𝑐2,0)
I2DL: Prof. Niessner, Prof. Leal-Taixé 26
– Given a dataset with ground truth training pairs [𝑦𝑗; 𝑧𝑗], – Find optimal weights 𝑿 using stochastic gradient descent, such that the loss function is minimized
more later)
I2DL: Prof. Niessner, Prof. Leal-Taixé 27
I2DL: Prof. Niessner, Prof. Leal-Taixé 28
nodes.
log(), exp() …
I2DL: Prof. Niessner, Prof. Leal-Taixé 29
mult sum 𝑔 𝑦, 𝑧, 𝑨
I2DL: Prof. Niessner, Prof. Leal-Taixé 30
Initialization 𝑦 = 1, 𝑧 = −3, 𝑨 = 4 mult sum 𝑔 = −8 1 −3 4 𝑒 = −2 1 −3 4
I2DL: Prof. Niessner, Prof. Leal-Taixé 31
sum
𝒈 = 𝑿𝟔𝜏(𝑿𝟓 tanh(𝑿𝟒, max(𝟏, 𝑿𝟑 max(𝟏, 𝑿𝟐𝒚))))
I2DL: Prof. Niessner, Prof. Leal-Taixé 32
A neural network can be represented as a computational graph...
– it has compute nodes (operations) – it has edges that connect nodes (data flow) – it is directional – it can be organized into ‘layers’
I2DL: Prof. Niessner, Prof. Leal-Taixé 33
I2DL: Prof. Niessner, Prof. Leal-Taixé 34
𝑨𝑙
(2) = 𝑗
𝑦𝑗𝑥𝑗𝑙
(2) + 𝑐𝑗 (2)
𝑏𝑙
(2) = 𝑔(𝑨𝑙 2 )
𝑨𝑙
(3) = 𝑗
𝑏𝑗
(2)𝑥𝑗𝑙 (3) + 𝑐𝑗 (3)
… 𝑦1 𝑦2 𝑦3
+1
𝑨1
(2)
𝑨2
(2)
𝑨3
(2)
𝑨1
(3)
𝑨2
(3)
𝑥11
(2)
𝑥12
(2)
𝑥13
(2)
𝑥21
(2)
𝑥22
(2)
𝑥23
(2)
𝑥31
(2)
𝑐1
(2)
𝑐2
(2)
𝑐3
(2)
𝑥33
(2)
𝑥11
(3)
𝑥12
(3)
𝑥21
(3)
𝑥22
(3)
𝑥31
(3)
𝑥32
(3)
𝑐1
(3)
𝑐2
(3)
𝑥32
(2)
𝑏1
(2)
𝑏2
(2)
𝑏3
(2)
𝑔 𝑔 𝑔
+1
I2DL: Prof. Niessner, Prof. Leal-Taixé 35
[Szegedy et al.,CVPR’15] Going Deeper with Convolutions
meanings:
– The multiplication of 𝑿𝒋 and 𝒚: encode input information – The activation function: select the key features
I2DL: Prof. Niessner, Prof. Leal-Taixé 36
Source; https://www.zybuluo.com/liuhui0803/note/981434
meanings:
– The convolutional layers: extract useful features with shared weights
I2DL: Prof. Niessner, Prof. Leal-Taixé 37
Source: https://www.zcfy.cc/original/understanding-convolutions-colah-s-blog
meanings:
– The convolutional layers: extract useful features with shared weights
I2DL: Prof. Niessner, Prof. Leal-Taixé 38
Source: https://www.zybuluo.com/liuhui0803/note/981434
I2DL: Prof. Niessner, Prof. Leal-Taixé 39
I2DL: Prof. Niessner, Prof. Leal-Taixé 40
Are these reasonably close? Inputs Neural Network Outputs Targets
We need a way to describe how close the network's
I2DL: Prof. Niessner, Prof. Leal-Taixé 41
Idea: calculate a ‘distance’ between prediction and target!
prediction target large distance! prediction target small distance! bad prediction good prediction
the goodness of f th the pre redic ictio ions (or equivalently, the network's performance) Intuitively, ...
– a larg rge loss ind indicates bad d pre redic ictions/performance (→ performance needs to be improved by training the model) – the choice of the loss function depends on the concrete problem or the distribution of the target variable
I2DL: Prof. Niessner, Prof. Leal-Taixé 42
𝑀 𝒛, ෝ 𝒛; 𝜾 = 1 𝑜
𝑗 𝑜
| 𝑧𝑗 − ෝ 𝑧𝑗| 1
𝑀 𝒛, ෝ 𝒛; 𝜾 = 1 𝑜
𝑗 𝑜
| 𝑧𝑗 − ෝ 𝑧𝑗| 2
2
I2DL: Prof. Niessner, Prof. Leal-Taixé 43
I2DL: Prof. Niessner, Prof. Leal-Taixé 44
Yes! (0.8) No! (0.2) The network predicts the probability of the input belonging to the "yes" class!
𝑀 𝒛, ෝ 𝒛; 𝜾 = −
𝑗=1 𝑜
(𝑧𝑗 ∙ log ෝ 𝑧𝑗 + (1 − 𝑧𝑗) ∙ log[1 − ෝ 𝑧𝑗])
I2DL: Prof. Niessner, Prof. Leal-Taixé 45
= loss function for multi-class classification
dog (0.1) rabbit (0.2) duck (0.7) …
𝑀 𝒛, ෝ 𝒛; 𝜾 = −
𝑗=1 𝑜
𝑙=1 𝑙
(𝑧𝑗𝑙 ∙ logො 𝑧𝑗𝑙)
This generalizes the binary case from the slide before!
I2DL: Prof. Niessner, Prof. Leal-Taixé 46
𝒛
𝒛
– minimize the loss <=> find better predictions – predictions are generated by the NN – find better predictions <=> find better NN
I2DL: Prof. Niessner, Prof. Leal-Taixé 47
Prediction Targets Bad prediction! Loss Training time
t1
I2DL: Prof. Niessner, Prof. Leal-Taixé 48
Prediction Targets Bad prediction! Loss Training time
t2
I2DL: Prof. Niessner, Prof. Leal-Taixé 49
Prediction Targets Bad prediction! Loss Training time t3
I2DL: Prof. Niessner, Prof. Leal-Taixé 50
Training time Loss
I2DL: Prof. Niessner, Prof. Leal-Taixé 51
Parameters 𝜾 Loss 𝑀(𝒛, 𝑔
𝜾 𝒚 )
𝜾
Plotting loss curves against model parameters
I2DL: Prof. Niessner, Prof. Leal-Taixé 52
Optimization! We train compute graphs with some optimization techniques!
𝒛 = 𝑀(𝒛, 𝑔
𝜾 𝒚 )
𝜾(𝒚)
– minimize the loss w. r. t. 𝜾
I2DL: Prof. Niessner, Prof. Leal-Taixé 53
Gradient Descent Loss function 𝜾 Loss 𝑀(𝒛, 𝑔
𝜾 𝒚 )
𝜾
𝜾 𝒚
w.r.t. 𝜾
I2DL: Prof. Niessner, Prof. Leal-Taixé 54
𝜾 𝒚
w.r.t. 𝜾
𝜾 Loss 𝑀(𝒛, 𝑔
𝜾 𝒚 )
𝜾
𝛂𝜾𝑀(𝒛, 𝑔
𝜾 𝒚 )
𝜾 = 𝜾 − 𝛽 𝛂𝜾𝑀 𝒛, 𝑔
𝜾 𝒚
𝜾∗ = arg min 𝑀 𝒛, 𝑔
𝜾 𝒚
Learning rate
𝒈𝜾 𝒚 = 𝑿𝒚 , 𝜾 = 𝑿
Later 𝜾 = {𝑿, 𝒄}
𝒛; 𝜾 =
1 𝑜 σ𝑗 𝑜 | 𝑧𝑗 − ෝ
𝑧𝑗| 2
2
I2DL: Prof. Niessner, Prof. Leal-Taixé 55
𝒛; 𝜾 =
1 𝑜 σ𝑗 𝑜 | 𝑧𝑗 − 𝑿 ⋅ 𝑦𝑗| 2 2
I2DL: Prof. Niessner, Prof. Leal-Taixé 56
𝑦 * Multiply 𝑋 𝑧 𝑀 Gradient flow
𝒈𝜾 𝒚 = 𝑿𝒚 , 𝜾 = 𝑿
𝒛; 𝜾 =
1 𝑜 σ𝑗 𝑜 | 𝑿 ⋅ 𝒚𝒋 − 𝒛𝒋| 2 2
𝜾 𝒚
=
1 𝑜 σ𝑗 𝑜 𝑿 ⋅ 𝒚𝒋 − 𝒛𝒋 ⋅ 𝒚𝒋 𝑼
I2DL: Prof. Niessner, Prof. Leal-Taixé 57
𝒈 = 𝑿𝟔𝜏(𝑿𝟓 tanh(𝑿𝟒, max(𝟏, 𝑿𝟑 max(𝟏, 𝑿𝟐𝒚))))
𝜾 𝒚
– Need to propagate gradients from end to first layer (𝑿𝟐).
I2DL: Prof. Niessner, Prof. Leal-Taixé 58
I2DL: Prof. Niessner, Prof. Leal-Taixé 59
𝑦 * Multiply 𝑋
1
𝑋
2
max(𝟏, ) * Multiply 𝑧 𝑀 Gradient flow
𝒈 = 𝑿𝟔𝜏(𝑿𝟓 tanh(𝑿𝟒, max(𝟏, 𝑿𝟑 max(𝟏, 𝑿𝟐𝒚))))
𝜾 𝒚
– Need to propagate gradients from end to first layer (𝑿𝟐)
gradients
– Compute graphs come in handy!
I2DL: Prof. Niessner, Prof. Leal-Taixé 60
– Easy to compute using compute graphs
– Newtons method – L-BFGS – Adaptive moments – Conjugate gradient
I2DL: Prof. Niessner, Prof. Leal-Taixé 61
– Many options (more in the next lectures)
– Nice because can easily modularize complex functions
I2DL: Prof. Niessner, Prof. Leal-Taixé 62
xt Lectu ture:
– Backpropagation and optimization of Neural Networks
for r updates on website/moodle regarding exercises
I2DL: Prof. Niessner, Prof. Leal-Taixé 63
I2DL: Prof. Niessner, Prof. Leal-Taixé 64
– http://cs231n.github.io/optimization-1/ – http://www.deeplearningbook.org/contents/optimizatio n.html
– Pa Patte ttern Recogni nitio ion and nd Machin hine Learn rning – C. Bishop – http://www.deeplearningbook.org/
I2DL: Prof. Niessner, Prof. Leal-Taixé 65