Neural Networks, Chapter 11 in ESL II
STK-IN4300 Statistical Learning Methods in Data Science Odd Kolbjørnsen, oddkol@math.uio.no
Neural Networks, Chapter 11 in ESL II STK-IN4300 Statistical - - PowerPoint PPT Presentation
Neural Networks, Chapter 11 in ESL II STK-IN4300 Statistical Learning Methods in Data Science Odd Kolbjrnsen, oddkol@math.uio.no Learning today Neural nets Projection pursuit What is it? How to solve it: Stagewise Neural nets
STK-IN4300 Statistical Learning Methods in Data Science Odd Kolbjørnsen, oddkol@math.uio.no
3
STK-IN 4300 Lecture 4- Neural nets
– with enough data and the correct algorithm you will get it right eventually…
4
STK-IN 4300 Lecture 4- Neural nets
Constructed example from: Ribeiro et.al (2016)“Why Should I Trust You?” Explaining the Predictions of Any Classifier
𝑗=1 𝑂
𝑧𝑗 = 𝑧𝑗1, … , 𝑧𝑗𝐿 𝑈
𝑗=1 𝑂
𝑙=1 𝐿
𝑙 𝑦𝑗 2
𝑀 𝑍, መ 𝑔 𝑌 =
𝑗=1 𝑂
𝑙=1 𝐿
𝑧𝑗𝑙 − መ 𝑔
𝑙 𝑦𝑗 2
𝑀 𝑍, መ 𝑔 𝑌 =
𝑗=1 𝑂
𝑙=1 𝐿
− log መ 𝑔
𝑙 𝑦𝑗 ⋅ 𝑧𝑗𝑙
5
STK-IN4300 Lecture 6- Neural nets
Cross-entropy
Squared error Squared error (common) 𝑧𝑗𝑙 = ቊ0 if 𝑧𝑗 ≠ 𝑙 1 if 𝑧𝑗 = 𝑙
መ 𝑔
𝑙 𝑦𝑗 ≈ Prob(𝑧𝑗𝑙 = 1)
General form
Neural nets are defined by a specific form of the model 𝑔 𝑌
6
Derived feature (number m), 𝑊
𝑛 = 𝑥𝑛 𝑈 𝑌, (size 1×1 )
Unknown functions (ℝ → ℝ) Unknown Weight (size p×1) Unit vector Features = Explanatory Variables (size p×1)
Friedman and Tukey (1974) Friedman and Stuetzle (1981)
STK-IN 4300 Lecture 4- Neural nets
– Smoothing spline, Local linear (or polynomial) regression, Kernel smoothing, K-nearest…
– 𝑥𝑈𝑦𝑗 ≈ 𝑥old
𝑈 𝑦𝑗 + ′(𝑥old 𝑈 𝑦𝑗) 𝑥 − 𝑥old 𝑈 𝑈𝑦𝑗
7
Solve for 𝑥 using weighted regression: weight = ′ 𝑥old
𝑈 𝑦𝑗 2
𝑗=1 𝑂
𝑧𝑗 − 𝑥𝑈𝑦𝑗
2 ≈ 𝑗=1 𝑂
𝑧𝑗 − 𝑥old
𝑈 𝑦𝑗 − ′(𝑥old 𝑈 𝑦𝑗) 𝑥 − 𝑥old 𝑈 𝑈𝑦𝑗 2
=
𝑗=1 𝑂
′ 𝑥old
𝑈 𝑦𝑗 2 𝑧𝑗 − 𝑥old 𝑈 𝑦𝑗
′ 𝑥old
𝑈 𝑦𝑗
+ 𝑥old
𝑈 𝑦𝑗 − 𝑥𝑈 𝑦𝑗 2
STK-IN 4300 Lecture 4- Neural nets
[ ො 𝑛 ⋅ , ෝ 𝑥𝑛] = argmin
𝑛(⋅),𝑥𝑛
𝑗=1 𝑂
𝑧𝑗,𝑛 − 𝑛 𝑥𝑛
𝑈 𝑦𝑗 2
𝑛 ⋅ and ෝ 𝑥𝑛
𝑛 ෝ 𝑥𝑛
𝑈 𝑦𝑗
𝑛=1 𝑁
𝑈 𝑌)
8
𝑛=1 𝑁
𝑈 𝑌)
STK-IN 4300 Lecture 4- Neural nets
– Local regression or smoothing splines
𝑔 𝑦𝑗 + ො 𝑛 ෝ 𝑥𝑛𝑦𝑗
𝑗. (and center the results)
1. When the model does not improve appreciably 2. Use cross validation to determine M
9
STK-IN 4300 Lecture 4- Neural nets
10
STK-IN 4300 Lecture 4- Neural nets
11
𝑤 =
𝑗=0 𝑞
𝛽𝑗𝑦𝑗
𝑗=0 𝑞
STK-IN 4300 Lecture 4- Neural nets
– ArcTan – Rectified linear ReLu – Gaussian (NB not monotone gives different behavior)
12
Illustrations from: https://en.wikipedia.org/wiki/Activation_function
STK-IN 4300 Lecture 4- Neural nets
13
Derived feature (number m), 𝑎𝑛= 𝛽𝑛
𝑈 𝑌 + 𝛽0, (size 1×1 )
Activation function (ℝ → ℝ) Unknown Weight (size (p×1) Not unit vector Feature = Explanatory variables (size p×1)
s=10 s=1 s=0.5
𝑈 𝑌 + 𝛽0
«PP – Feature»
𝑛 + 𝛽0
STK-IN 4300 Lecture 4- Neural nets
14
Note! With respect to model definition Feed forward means:
graph are directional
from input to output
We will however traverse the graph in the
as well ….
𝑙 𝑌 = 𝑛=1 𝑁
𝑈 𝑌 + 𝛽0
STK-IN 4300 Lecture 4- Neural nets
15
𝑈 𝑌 ,
𝑈𝑎,
𝑙():
𝑙 𝑌 = 𝑈𝑙
𝑁
𝑈 𝑌 + 𝛽0,𝑛
𝑙 𝑌 = exp 𝑈𝑙 σ𝑘=1
𝐿
exp(𝑈𝑘)
exp 𝛾0,𝑙+𝛾𝑙
𝑈𝑎
σ𝑘=1
𝐿
exp(𝛾0,𝑘+𝛾𝑘
𝑈𝑎)
STK-IN 4300 Lecture 4- Neural nets
16
𝑛=1 𝑁𝑂𝑂
𝑈 𝑌 + 𝛽0
𝑛=1 𝑁𝑄𝑄
𝑈 𝑌)
𝑈 𝑌)
𝑈 𝑌 + 𝛽0
STK-IN 4300 Lecture 4- Neural nets
𝑗=1 𝑂
𝑙=1 𝐿
𝑙 𝑦𝑗 2
𝑗=1 𝑂
𝑙=1 𝐿
𝑙 𝑦𝑗 2
17
Contribution of the i’th data record Quadratic loss K output varaibles
STK-IN 4300 Lecture 4- Neural nets
18
𝑘
𝑗=1 𝑂 𝜖𝑆𝑗 𝜄
𝑘
𝑘
𝑗=1 𝑂
STK-IN 4300 Lecture 4- Neural nets
𝑙 𝑦𝑗
′ 𝛾𝑙 𝑈𝑨𝑗 𝑨𝑛,𝑗
𝑙=1 𝐿
𝑙 𝑦𝑗
′ 𝛾𝑙 𝑈𝑨𝑗 𝛾𝑙𝑛 𝜏′(𝛽𝑛 𝑈 𝑦𝑗)𝑦𝑗,𝑚
19
𝑙 𝑌 = 𝑙 𝑈
𝑈 𝑦𝑗) 𝑙=1 𝐿
STK-IN 4300 Lecture 4- Neural nets
𝑙 𝑦𝑗
′ 𝛾𝑙 𝑈𝑨𝑗 ,
𝑈 𝑦𝑗) 𝑙=1 𝐿
20
(𝑠+1) = 𝛾𝑙,𝑛 (𝑠) − 𝛿𝑠 𝑗=1 𝑂
(𝑠+1) = 𝛽𝑛,𝑚 (𝑠) − 𝛿𝑠 𝑗=1 𝑂
STK-IN 4300 Lecture 4- Neural nets
– Perform a random partition of training data in to batches:{𝐶𝑘}𝑘=1
#Batches
– For all batches cycle over the data in this batch to update data – Repeat
21
(𝑠+1) = 𝛾𝑙,𝑛 (𝑠) − 𝛿𝑠 𝑗=1 𝑂
(𝑠+1) = 𝛽𝑛,𝑚 (𝑠) − 𝛿𝑠 𝑗=1 𝑂
(𝑠+1) = 𝛾𝑙,𝑛 (𝑠) − 𝛿𝑠 𝑗∈B𝑘
(𝑠+1) = 𝛽𝑛,𝑚 (𝑠) − 𝛿𝑠 𝑗∈𝐶𝑘
STK-IN 4300 Lecture 4- Neural nets
2 < ∞ ,
1 𝑠
22
(𝑠) = 𝛾𝑙,𝑛 (𝑠−1) − 𝛿𝑠
(𝑠) = 𝛽𝑛,𝑚 (𝑠−1) − 𝛿𝑠
STK-IN 4300 Lecture 4- Neural nets
23
STK-IN 4300 Lecture 4- Neural nets
– Characteristics
– first layer taking input – last layer outputs – Intermediate layers are hidden layers (no connection to the world outside)
– Self organizing map (SOM), output is not defined (unsupervised) – Recurrent neural network (RNN), many forms – Hopfield networks (RNN with symmetric connections) – Boltzmann machine networks (Markov random fields) – Convolutional neural nets (locally connected)
24
STK-IN 4300 Lecture 4- Neural nets
25
1
1
1
1
2
2
2
2
3
3
3
3
𝑅
𝑅
𝑅
𝑅
1
2
𝐿
STK-IN 4300 Lecture 4- Neural nets
26
STK-IN 4300 Lecture 4- Neural nets
27
size: (1 × 𝑟𝑅−1) size: (1 × 1)
size: (1 × 𝑞) size: (1 × 1)
STK-IN 4300 Lecture 4- Neural nets
[Top] = −2 𝑧𝑙 − 𝑔 𝑙 𝑦
′ 𝛾𝑙 𝑈𝑨[𝑅]
28
[𝑅] = 𝜏 𝑅 ′ 𝛽𝑛 𝑅 𝑈𝑨 𝑅−1
𝑙=1 𝐿
Top ,
[𝑅−1] = 𝜏 𝑅−1 ′ 𝛽𝑛 𝑅−1 𝑈𝑨 𝑅−2
𝑛=1 𝑟𝑅
𝑅 𝜀𝑛 [𝑅] ,
STK-IN 4300 Lecture 4- Neural nets
2 #input features
29
STK-IN 4300 Lecture 4- Neural nets
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun (2015) Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification
– During training set 𝑦𝑛,𝑚 = 0 with probability 𝑞 – During evaluation weight is set to 𝛽𝑛,𝑚 ⋅ 𝑞
30
2
2
STK-IN 4300 Lecture 4- Neural nets
Srivastava, et al. (2014) Dropout: A Simple Way to Prevent Neural Networks from Overfitting Journal of Machine Learning Research 15 (2014) 1929-1958
31
STK 4300 Lecture 4- Neural nets
32
STK-IN 4300 Lecture 4- Neural nets
33
STK-IN 4300 Lecture 4- Neural nets
34
STK-IN 4300 Lecture 4- Neural nets
35
Var(𝜁1) ≈ 4
Weight decay = 0.0005
STK-IN 4300 Lecture 4- Neural nets
36
Weight decay = 0.0005
STK-IN 4300 Lecture 4- Neural nets
37
Var(𝜁)
STK-IN 4300 Lecture 4- Neural nets
38
STK-IN 4300 Lecture 4- Neural nets
39
(per feature) CNN use multiple features 10 features => # = 100 param
STK-IN 4300 Lecture 4- Neural nets
40
https://www.youtube.com/watch?v=vT1JzLTH4G4&list=PL3FW7Lu3i5JvHM8ljYj-zLfQRF3EO8sYv
Series:
STK-IN 4300 Lecture 4- Neural nets
41
STK-IN 4300 Lecture 4- Neural nets