Neural Networks, Chapter 11 in ESL II STK-IN4300 Statistical - - PowerPoint PPT Presentation

neural networks chapter 11 in esl ii
SMART_READER_LITE
LIVE PREVIEW

Neural Networks, Chapter 11 in ESL II STK-IN4300 Statistical - - PowerPoint PPT Presentation

Neural Networks, Chapter 11 in ESL II STK-IN4300 Statistical Learning Methods in Data Science Odd Kolbjrnsen, oddkol@math.uio.no Learning today Neural nets Projection pursuit What is it? How to solve it: Stagewise Neural nets


slide-1
SLIDE 1

Neural Networks, Chapter 11 in ESL II

STK-IN4300 Statistical Learning Methods in Data Science Odd Kolbjørnsen, oddkol@math.uio.no

slide-2
SLIDE 2

Learning today Neural nets

  • Projection pursuit

– What is it? – How to solve it: Stagewise

  • Neural nets

– What is it? – Graphical display – Connection to Projection pursuit – How to solve it: Backpropagation – Stochastic Gradient decent – Deep and wide – CNN

  • Example

3

  • 25. September 2019

STK-IN 4300 Lecture 4- Neural nets

slide-3
SLIDE 3
  • Used for prediction
  • Universal approximation

– with enough data and the correct algorithm you will get it right eventually…

  • Used for both «regression type» and «classification» type problems
  • Many versions and forms, currently deep learning is a hot topic
  • Often portrayed as fully automatic, but tailoring might help
  • Perform highly advanced analysis
  • Can create utterly complex models which are hard to decipher and

hard to use for knowledge transfer.

  • The network provide good prediction,

but is it for the right reasons?

Neural network

4

  • 25. September 2019

STK-IN 4300 Lecture 4- Neural nets

Constructed example from: Ribeiro et.al (2016)“Why Should I Trust You?” Explaining the Predictions of Any Classifier

slide-4
SLIDE 4

In neural nets training is based on minimization of a loss function over the training set

𝑀 𝑍, መ 𝑔 𝑌 = ෍

𝑗=1 𝑂

𝑀(𝑧𝑗 , መ 𝑔 𝑦𝑗 )

  • Target might be multi dimensional

𝑧𝑗 = 𝑧𝑗1, … , 𝑧𝑗𝐿 𝑈

  • Continuous response («regression type»)

𝑀 𝑍, መ 𝑔 𝑌 = ෍

𝑗=1 𝑂

𝑙=1 𝐿

𝑧𝑗𝑙 − መ 𝑔

𝑙 𝑦𝑗 2

  • Discrete (K –classes)

𝑀 𝑍, መ 𝑔 𝑌 = ෍

𝑗=1 𝑂

𝑙=1 𝐿

𝑧𝑗𝑙 − መ 𝑔

𝑙 𝑦𝑗 2

𝑀 𝑍, መ 𝑔 𝑌 = ෍

𝑗=1 𝑂

𝑙=1 𝐿

− log መ 𝑔

𝑙 𝑦𝑗 ⋅ 𝑧𝑗𝑙

5

  • 25. september 2019

STK-IN4300 Lecture 6- Neural nets

Cross-entropy

  • r deviance

Squared error Squared error (common) 𝑧𝑗𝑙 = ቊ0 if 𝑧𝑗 ≠ 𝑙 1 if 𝑧𝑗 = 𝑙

መ 𝑔

𝑙 𝑦𝑗 ≈ Prob(𝑧𝑗𝑙 = 1)

General form

Neural nets are defined by a specific form of the model 𝑔 𝑌

slide-5
SLIDE 5

Projection pursuit Regression

6

𝑔 𝑌 = ෍

𝑛=1 𝑁

𝑕𝑛(𝑥𝑛

𝑈 𝑌)

Derived feature (number m), 𝑊

𝑛 = 𝑥𝑛 𝑈 𝑌, (size 1×1 )

Unknown functions (ℝ → ℝ) Unknown Weight (size p×1) Unit vector Features = Explanatory Variables (size p×1)

Friedman and Tukey (1974) Friedman and Stuetzle (1981)

Ridge functions are constant along directions orthogonal to the directional unit vector 𝑥

  • 25. September 2019

STK-IN 4300 Lecture 4- Neural nets

slide-6
SLIDE 6

Fitting Projection pursuit: M=1

  • M = 1 model, known as the single index model in econometrics:

– 𝑔 𝑌 = 𝑕 𝑥𝑈𝑌 = 𝑕(𝑊), 𝑊= 𝑥𝑈𝑌

  • If 𝑥 is known fitting ො

𝑕(𝑤) is just a 1D smoothing problem

– Smoothing spline, Local linear (or polynomial) regression, Kernel smoothing, K-nearest…

  • If 𝑕() is known fitting ෝ

𝑥 is obtained by quasi-Newton search

– 𝑕 𝑥𝑈𝑦𝑗 ≈ 𝑕 𝑥old

𝑈 𝑦𝑗 + 𝑕′(𝑥old 𝑈 𝑦𝑗) 𝑥 − 𝑥old 𝑈 𝑈𝑦𝑗

– Minimize the objective function (with approximation inserted)

7

Solve for 𝑥 using weighted regression: weight = 𝑕′ 𝑥old

𝑈 𝑦𝑗 2

Iterate

𝑗=1 𝑂

𝑧𝑗 − 𝑕 𝑥𝑈𝑦𝑗

2 ≈ ෍ 𝑗=1 𝑂

𝑧𝑗 − 𝑕 𝑥old

𝑈 𝑦𝑗 − 𝑕′(𝑥old 𝑈 𝑦𝑗) 𝑥 − 𝑥old 𝑈 𝑈𝑦𝑗 2

= ෍

𝑗=1 𝑂

𝑕′ 𝑥old

𝑈 𝑦𝑗 2 𝑧𝑗 − 𝑕 𝑥old 𝑈 𝑦𝑗

𝑕′ 𝑥old

𝑈 𝑦𝑗

+ 𝑥old

𝑈 𝑦𝑗 − 𝑥𝑈 𝑦𝑗 2

  • 25. September 2019

STK-IN 4300 Lecture 4- Neural nets

slide-7
SLIDE 7

Fitting Projection pursuit, 𝑁 > 1

  • Stage-wise (greedy)

– Set 𝑧𝑗,1 = 𝑧𝑗 – For 𝑛 = 1, … , 𝑁

  • Assume there is just one function to match (as previous page)
  • Minimize Loss with respect to 𝑧𝑗,𝑛 to obtain 𝑕𝑛() and 𝑥𝑛

[ ො 𝑕𝑛 ⋅ , ෝ 𝑥𝑛] = argmin

𝑕𝑛(⋅),𝑥𝑛

𝑗=1 𝑂

𝑧𝑗,𝑛 − 𝑕𝑛 𝑥𝑛

𝑈 𝑦𝑗 2

  • Store ො

𝑕𝑛 ⋅ and ෝ 𝑥𝑛

  • Subtract estimate from data 𝑧𝑗,𝑛+1 = 𝑧𝑗,𝑛 − ො

𝑕𝑛 ෝ 𝑥𝑛

𝑈 𝑦𝑗

– Final prediction:

መ 𝑔 𝑌 = ෍

𝑛=1 𝑁

ො 𝑕𝑛(ෝ 𝑥𝑛

𝑈 𝑌)

8

𝑔 𝑌 = ෍

𝑛=1 𝑁

𝑕𝑛(𝑥𝑛

𝑈 𝑌)

  • 25. September 2019

STK-IN 4300 Lecture 4- Neural nets

slide-8
SLIDE 8

Implementation details

  • 1. Need a smoothing method with efficient evaluation of 𝑕(𝑤) and 𝑕′ 𝑤

– Local regression or smoothing splines

2. 𝑕𝑛(𝑤) from previous steps can be readjusted using a backfitting procedure (Chapter 9), but it is unclear if this improves the performance

  • 1. Set 𝑠𝑗 = 𝑧𝑗 − መ

𝑔 𝑦𝑗 + ො 𝑕𝑛 ෝ 𝑥𝑛𝑦𝑗

  • 2. Re-estimate 𝑕𝑛(⋅) from 𝑠

𝑗. (and center the results)

  • 3. Do this repeatedly for 𝑛 = 1, … 𝑁, 1 … 𝑁, …
  • 3. It is not common to readjust ෝ

𝑥𝑛, as this is computationally demanding

  • 4. Stopping criterion for number of terms to include.

1. When the model does not improve appreciably 2. Use cross validation to determine M

9

  • 25. September 2019

STK-IN 4300 Lecture 4- Neural nets

slide-9
SLIDE 9

Example

  • Train data: 1000
  • Two terms:

10

  • 25. September 2019

STK-IN 4300 Lecture 4- Neural nets

slide-10
SLIDE 10

Neural network

  • Simplified model of a nerve system

11

⋮ 𝑦𝑞 Perceptron:

𝑦0

𝑦1 𝛽0 𝛽1 𝛽… 𝛽𝑞 Input Weights Net input Activation function Output 𝜏 𝑤

𝑤 = ෍

𝑗=0 𝑞

𝛽𝑗𝑦𝑗

𝑦0 = 1 𝜏 ෍

𝑗=0 𝑞

𝛽𝑗𝑦𝑗

  • 25. September 2019

STK-IN 4300 Lecture 4- Neural nets

slide-11
SLIDE 11

Activation functions

  • Initially: The binary step function used
  • Next: Sigmoid = Logistic = Soft step
  • Now: there is a «rag bag» of alternatives some

more suited than others for specific tasks

– ArcTan – Rectified linear ReLu – Gaussian (NB not monotone gives different behavior)

12

𝜏 𝑤

Illustrations from: https://en.wikipedia.org/wiki/Activation_function

Smooth Smooth Continious Smooth #1 (and variants) #2 never

  • 25. September 2019

STK-IN 4300 Lecture 4- Neural nets

slide-12
SLIDE 12

Single layer feed-forward Neural nets

13

𝑔 𝑌 = ෍

𝑛=1 𝑁

𝛾𝑛𝜏 𝛽𝑛

𝑈 𝑌 + 𝛽0

Derived feature (number m), 𝑎𝑛= 𝛽𝑛

𝑈 𝑌 + 𝛽0, (size 1×1 )

Activation function (ℝ → ℝ) Unknown Weight (size (p×1) Not unit vector Feature = Explanatory variables (size p×1)

Sigmoid 𝜏 𝑡 ⋅ 𝑤

s=10 s=1 s=0.5

«Bias» or «Shift» 𝜏 𝛽𝑈𝑌 + 𝛽0 = 𝜏 𝑡𝑛 ⋅ 𝑥𝑛

𝑈 𝑌 + 𝛽0

𝑥𝑛 = 𝛽𝑛 𝑡𝑛 , 𝑡𝑛 = 𝛽𝑛 unit vector scale

«PP – Feature»

= 𝜏 𝑡𝑛 ⋅ 𝑊

𝑛 + 𝛽0

  • 25. September 2019

STK-IN 4300 Lecture 4- Neural nets

slide-13
SLIDE 13

Graphical display of single hidden layer feed forward neural network

14

Note! With respect to model definition Feed forward means:

  • Connections in the

graph are directional

  • The direction goes

from input to output

Input Hidden Output

We will however traverse the graph in the

  • pposite direction

as well ….

𝛽 𝛾

𝑔

𝑙 𝑌 = ෍ 𝑛=1 𝑁

𝛾𝑙,𝑛𝜏 𝛽𝑛

𝑈 𝑌 + 𝛽0

𝜏(⋅)

  • 25. September 2019

STK-IN 4300 Lecture 4- Neural nets

  • r W1
  • r W2
slide-14
SLIDE 14

Output layer is often «different»

15

𝑎𝑛 = 𝜏 𝛽0,𝑛 + 𝛽𝑛

𝑈 𝑌 ,

𝑛 = 1, … 𝑁 𝑈𝑙 = 𝛾0,𝑙, +𝛾𝑙

𝑈𝑎,

𝑙 = 1 … 𝐿

Hidden layer: Output layer: Some alternatives for 𝑔

𝑙():

Transform 𝜏(𝑈𝑙) Same as «hidden» layers Identity 𝑈𝑙 Common in regression setting Joint transform 𝑕𝑙 (𝑈) Common for classification, e.g. softmax 𝑔

𝑙 𝑌 = 𝑈𝑙

= 𝛾0,𝑙 + σ𝑛=1

𝑁

𝛾𝑙,𝑛𝜏 𝛽𝑛

𝑈 𝑌 + 𝛽0,𝑛

𝑔

𝑙 𝑌 = exp 𝑈𝑙 σ𝑘=1

𝐿

exp(𝑈𝑘)

=

exp 𝛾0,𝑙+𝛾𝑙

𝑈𝑎

σ𝑘=1

𝐿

exp(𝛾0,𝑘+𝛾𝑘

𝑈𝑎)

Softmax Identity 𝑌 𝑎 𝑈&𝑍

  • 25. September 2019

STK-IN 4300 Lecture 4- Neural nets

slide-15
SLIDE 15

Comparision Projection pursuit (PP) and Neural nets (NN)

16

𝑔 𝑌 = ෍

𝑛=1 𝑁𝑂𝑂

𝛾𝑛𝜏 𝛽𝑛

𝑈 𝑌 + 𝛽0

𝑔 𝑌 = ෍

𝑛=1 𝑁𝑄𝑄

𝑕𝑛(𝑥𝑛

𝑈 𝑌)

𝑕𝑛(𝑥𝑛

𝑈 𝑌)

vs 𝛾𝑛𝜏 𝑡𝑛 ⋅ 𝑥𝑛

𝑈 𝑌 + 𝛽0

𝑡𝑛 = | 𝛽 |

  • The flexibility of 𝑕𝑛 is much larger than what is obtained with 𝑡𝑛 and

𝛽0 which are the additional parameters of neural nets

  • There are usually less terms in PP than NN, i.e. 𝑁𝑄𝑄 ≪ 𝑁𝑂𝑂
  • Both methods are powerful for regression and classification
  • Effective in problems with high signal to noise ratio
  • Suited for prediction without interpretation
  • Identifiability of weights an open question and creates

problems in interpretations

  • The fitting procedures are different
  • 25. September 2019

STK-IN 4300 Lecture 4- Neural nets

slide-16
SLIDE 16

Fitting neural networks

𝑆 𝜄 = 𝑀 𝑍, መ 𝑔 𝑌 = ෍

𝑗=1 𝑂

𝑙=1 𝐿

𝑧𝑗𝑙 − መ 𝑔

𝑙 𝑦𝑗 2

= ෍

𝑗=1 𝑂

𝑆𝑗(𝜄) 𝑆𝑗(𝜄) = ෍

𝑙=1 𝐿

𝑧𝑗𝑙 − መ 𝑔

𝑙 𝑦𝑗 2

17

𝐿 𝑁 𝑞 𝜄: Statistical slang for all parameters Here: 𝛽0,𝑛, 𝛽𝑛 , # parameters: 𝑞 + 1 𝑁 𝛾0,𝑙, 𝛾𝑙 , # parameters: 𝑁 + 1 𝐿 𝜷 𝜸 𝑞 + 1 𝑁 𝐿 𝑁 + 1 The “standard” approach:

  • Minimize the loss
  • Use steepest decent to

solve this minimization problem

  • The key to success is the

efficient way of computing the gradient

Contribution of the i’th data record Quadratic loss K output varaibles

  • 25. September 2019

STK-IN 4300 Lecture 4- Neural nets

slide-17
SLIDE 17

Steepest decent

  • Minimize 𝑆(𝜄) wrt 𝜄,

– Initialize: 𝜄(0) – Iterate: 𝜄

𝑘 (𝑠+1) = 𝜄 𝑘 (𝑠) − 𝛿𝑠

อ 𝜖𝑆 𝜄 𝜖𝜄

𝑘 𝜄=𝜄(𝑠)

18

Learning rate 𝜖𝑆 𝜄 𝜖𝜄

𝑘

= ෍

𝑗=1 𝑂 𝜖𝑆𝑗 𝜄

𝜖𝜄

𝑘

𝜖𝑆𝑗 𝜄 𝜖𝜄

𝑘

we compute term per data record (easily aggregated from parallel computation)

𝑆 𝜄 = ෍

𝑗=1 𝑂

𝑆𝑗(𝜄)

  • 25. September 2019

STK-IN 4300 Lecture 4- Neural nets

slide-18
SLIDE 18

Squared error loss

𝜖𝑆𝑗 𝜄 𝜖𝛾𝑙,𝑛 = −2 𝑧𝑗,𝑙 − 𝑔

𝑙 𝑦𝑗

𝑕𝑙

′ 𝛾𝑙 𝑈𝑨𝑗 𝑨𝑛,𝑗

= 𝜀𝑙,𝑗 ⋅ 𝑨𝑛,𝑗 𝜖𝑆𝑗 𝜄 𝜖𝛽𝑛,𝑚 = − ෍

𝑙=1 𝐿

2 𝑧𝑗𝑙 − 𝑔

𝑙 𝑦𝑗

𝑕𝑙

′ 𝛾𝑙 𝑈𝑨𝑗 𝛾𝑙𝑛 𝜏′(𝛽𝑛 𝑈 𝑦𝑗)𝑦𝑗,𝑚

= 𝑡𝑛,𝑗 ⋅ 𝑦𝑗,𝑚

19

𝑔

𝑙 𝑌 = 𝑕𝑙 𝑈

Output layer: Back propagation equation 𝑡𝑛,𝑗= 𝜏′(𝛽𝑛

𝑈 𝑦𝑗) ෍ 𝑙=1 𝐿

𝛾𝑙𝑛𝜀𝑙,𝑗 Hidden layer:

  • 25. September 2019

STK-IN 4300 Lecture 4- Neural nets

slide-19
SLIDE 19

Backpropagation (delta rule)

  • At top level. compute:

𝜀𝑙,𝑗 = −2 𝑧𝑗,𝑙 − 𝑔

𝑙 𝑦𝑗

𝑕𝑙

′ 𝛾𝑙 𝑈𝑨𝑗 ,

∀(𝑗, 𝑙)

  • At hidden level, compute:

𝑡𝑛,𝑗 = 𝜏′(𝛽𝑛

𝑈 𝑦𝑗) ෍ 𝑙=1 𝐿

𝛾𝑙,𝑛𝜀𝑙,𝑗 , ∀(𝑗, 𝑛)

  • Evaluate:

𝜖𝑆𝑗 𝜄 𝜖𝛾𝑙,𝑛 = 𝜀𝑙,𝑗𝑨𝑛,𝑗 & 𝜖𝑆𝑗 𝜄 𝜖𝛽𝑛,𝑚 = 𝑡𝑛,𝑗𝑦𝑗,𝑚

  • Update : 𝛿𝑠 is fixed

20

𝛾𝑙,𝑛

(𝑠+1) = 𝛾𝑙,𝑛 (𝑠) − 𝛿𝑠 ෍ 𝑗=1 𝑂

ቤ 𝜖𝑆𝑗 𝜖𝛾𝑙,𝑛 𝜄=𝜄(𝑠) 𝛽𝑛,𝑚

(𝑠+1) = 𝛽𝑛,𝑚 (𝑠) − 𝛿𝑠 ෍ 𝑗=1 𝑂

ቤ 𝜖𝑆𝑗 𝜖𝛽 𝑛,𝑚 𝜄=𝜄(𝑠)

  • 25. September 2019

STK-IN 4300 Lecture 4- Neural nets

slide-20
SLIDE 20

Stochastic gradient decent

  • Equations above updates with all data at the same time
  • The form invites to update estimate using fractions of data

– Perform a random partition of training data in to batches:{𝐶𝑘}𝑘=1

#Batches

– For all batches cycle over the data in this batch to update data – Repeat

  • One iteration is one update of the parameter (using one batch)
  • One Epoch is one scan through all data (using all batches in the partition)

21

𝛾𝑙,𝑛

(𝑠+1) = 𝛾𝑙,𝑛 (𝑠) − 𝛿𝑠 ෍ 𝑗=1 𝑂

ቤ 𝜖𝑆𝑗 𝜖𝛾𝑙,𝑛 𝜄=𝜄(𝑠) 𝛽𝑛,𝑚

(𝑠+1) = 𝛽𝑛,𝑚 (𝑠) − 𝛿𝑠 ෍ 𝑗=1 𝑂

ቤ 𝜖𝑆𝑗 𝜖𝛽 𝑛,𝑚 𝜄=𝜄(𝑠) 𝛾𝑙,𝑛

(𝑠+1) = 𝛾𝑙,𝑛 (𝑠) − 𝛿𝑠 ෍ 𝑗∈B𝑘

ቤ 𝜖𝑆𝑗 𝜖𝛾𝑙,𝑛 𝜄=𝜄(𝑠) 𝛽𝑛,𝑚

(𝑠+1) = 𝛽𝑛,𝑚 (𝑠) − 𝛿𝑠 ෍ 𝑗∈𝐶𝑘

ቤ 𝜖𝑆𝑗 𝜖𝛽 𝑛,𝑚 𝜄=𝜄(𝑠)

  • 25. September 2019

STK-IN 4300 Lecture 4- Neural nets

slide-21
SLIDE 21

Online learning (Extreme case Batch size =1)

  • Learning based on one data point at the time
  • You might re-iterate (for several epochs) when completed or if

you have an abundance of data just take on new data as they come along (hence the name)

  • For convergence: 𝛿𝑠 → 0, as σ𝛿𝑠 → ∞ and σ𝛿𝑠

2 < ∞ ,

e.g. 𝛿𝑠 =

1 𝑠

22

𝛾𝑙,𝑛

(𝑠) = 𝛾𝑙,𝑛 (𝑠−1) − 𝛿𝑠

ቤ 𝜖𝑆𝑗 𝜖𝛾𝑙,𝑛 𝜄=𝜄 𝑠−1 𝛽𝑛,𝑚

(𝑠) = 𝛽𝑛,𝑚 (𝑠−1) − 𝛿𝑠

ቤ 𝜖𝑆𝑗 𝜖𝛽 𝑛,𝑚 𝜄=𝜄(𝑠−1)

  • 25. September 2019

STK-IN 4300 Lecture 4- Neural nets

slide-22
SLIDE 22

Other methods can be used

  • Still use the Backpropagation to get the derivative
  • Conjugate Gradient

– Method for minimizing a quadratic form – Need «restart» for nonlinear problems

  • Variable metric methods

– E.g. Quasi newton methods

23

  • 25. September 2019

STK-IN 4300 Lecture 4- Neural nets

slide-23
SLIDE 23

Neural networks

  • Deep neural nets are currently «hot-topic»
  • Deep means many hidden layers
  • Multilayer feed forward

– Characteristics

  • Network is arranged in layers,

– first layer taking input – last layer outputs – Intermediate layers are hidden layers (no connection to the world outside)

  • A node in one layer is connected to every node in the next layer (Fully connected)
  • There are no connections among nodes in the same layer
  • Other types:

– Self organizing map (SOM), output is not defined (unsupervised) – Recurrent neural network (RNN), many forms – Hopfield networks (RNN with symmetric connections) – Boltzmann machine networks (Markov random fields) – Convolutional neural nets (locally connected)

24

  • 25. September 2019

STK-IN 4300 Lecture 4- Neural nets

slide-24
SLIDE 24

Graphical display of feed forward neural network

25

𝑌1 𝑌2 𝑌3 𝑌𝑞 𝑎1

1

𝑎2

1

𝑎3

1

𝑎𝑟1

1

𝑎1

2

𝑎1

2

𝑎3

2

𝑎𝑟2

2

𝑎1

3

𝑎1

3

𝑎3

3

𝑎𝑟3

3

𝑎1

𝑅

𝑎1

𝑅

𝑎3

𝑅

𝑎𝑟𝑅

𝑅

𝑍

1

𝑍

2

Input Output Hidden layer 1 Hidden layer 2 Hidden layer 3 Hidden layer Q ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 𝑍

𝐿

⋮ … … … … 𝛽1 𝛽2 𝛽3 𝛽𝑅 … 𝛾 𝜏𝑅(⋅) 𝑕𝑙(⋅) 𝜏3(⋅) 𝜏2(⋅) 𝜏1(⋅) 𝑕2(⋅) 𝑕1(⋅) Depth of NN = Q hidden layers Width of NN may vary from layer to layer

  • 25. September 2019

STK-IN 4300 Lecture 4- Neural nets

slide-25
SLIDE 25

Graphical display of feed forward neural network

26

𝒚 𝒜1 𝒜2 𝒜3 𝒜𝑅 𝒛 Input Output Hidden layer 1 Hidden layer 2 Hidden layer 3 Hidden layer Q … 𝑿1 𝑿2 𝑿3 𝑿𝑅 𝜏𝑅(⋅) 𝜏3(⋅) 𝜏2(⋅) 𝜏1(⋅) 𝑕 (⋅) Depth of NN = Q hidden layers

  • 25. September 2019

STK-IN 4300 Lecture 4- Neural nets

Matrix notation: Size of 𝑿1: 𝑞 + 1 × 𝑟1 Size of 𝑿𝑠: ((𝑟𝑠−1 + 1) × 𝑟𝑠) Bias 𝒜𝑠 = 𝑿𝑠𝒜𝑠−1 𝜏(⋅)

slide-26
SLIDE 26

Nested definition (Do not try to write this in a closed form… )

27

𝑔

𝑙 𝑌

= 𝛾0,𝑙 + ෍

𝑛=1 𝑟𝑅

𝛾𝑛,𝑙𝑎[𝑅] 𝑎𝑗

[𝑅]

= 𝜏𝑅 𝛽𝑗

[𝑅]𝑈

⋅ 𝑎[𝑅−1] + 𝛽0,𝑗

[𝑅]

𝑎𝑗

[𝑅−1]= 𝜏𝑅−1 𝛽𝑗 [𝑅−1]𝑈

⋅ 𝑎[𝑅−2] + 𝛽0,𝑗

[𝑅−1]

⋮ 𝑎𝑗

[1]= 𝜏1 𝛽𝑗 1𝑈 ⋅ 𝑌 + 𝛽0 1

𝑗 = 1, … , 𝑟𝑅−1

size: (1 × 𝑟𝑅−1) size: (1 × 1)

𝑗 = 1, … , 𝑟𝑅 𝑗 = 1, … , 𝑟1

size: (1 × 𝑞) size: (1 × 1)

𝑙 = 1, … , 𝐿 Number of outputs Number of input Width of layer Q Width of layer 1 Use computational graphs

  • 25. September 2019

STK-IN 4300 Lecture 4- Neural nets

slide-27
SLIDE 27

Training Neural networks

  • Backpropagation can still be used

Traversing the graph backwards, (record number (𝑗) suppressed)

𝜀𝑙

[Top] = −2 𝑧𝑙 − 𝑔 𝑙 𝑦

𝑕𝑙

′ 𝛾𝑙 𝑈𝑨[𝑅]

28

𝜀𝑛

[𝑅] = 𝜏 𝑅 ′ 𝛽𝑛 𝑅 𝑈𝑨 𝑅−1

𝑙=1 𝐿

𝛾𝑙,𝑛𝜀𝑙

Top ,

𝑛 = 1, … , 𝑟𝑅 𝜀𝑚

[𝑅−1] = 𝜏 𝑅−1 ′ 𝛽𝑛 𝑅−1 𝑈𝑨 𝑅−2

𝑛=1 𝑟𝑅

𝛽𝑛,𝑚

𝑅 𝜀𝑛 [𝑅] ,

𝑚 = 1, … , 𝑟𝑅−1

  • 25. September 2019

STK-IN 4300 Lecture 4- Neural nets

Use computational graphs

slide-28
SLIDE 28

Scaling of input & Starting values

  • Standardize input variables to avoid numerical scaling issues
  • Mean 0
  • Standard deviation 1
  • Choose weights

– Close to zero (model almost linear) – Do not choose zero (then it does not get started) – Too large values generally gives bad results – Common to use several random starting point

  • Weights rule of thumb: (With standardized input )

– Book suggest: weights from a uniform distribution uniform in −0.7 0.7

– Also common (ReLu): weigths 𝑂 0,

2 #input features

29

  • 25. September 2019

STK-IN 4300 Lecture 4- Neural nets

Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun (2015) Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification

slide-29
SLIDE 29

Avoiding overfitting

  • Early stopping

– Since starting regime is close to linear we will usually end up with something close to linear if we stop early – Use a validation set to select when to stop

  • Regularization by weight decay

– Minimize: 𝑆 𝜄 + 𝜇𝐾(𝜄) – Use cross validation to select 𝜇

  • Dropouts

– During training set 𝑦𝑛,𝑚 = 0 with probability 𝑞 – During evaluation weight is set to 𝛽𝑛,𝑚 ⋅ 𝑞

30

Penalty term e.g. 𝐾 𝜄 = σ𝛽𝑛,𝑚

2

+ σ𝛾𝑙,𝑛

2

  • 25. September 2019

STK-IN 4300 Lecture 4- Neural nets

Non-linear Linear

Srivastava, et al. (2014) Dropout: A Simple Way to Prevent Neural Networks from Overfitting Journal of Machine Learning Research 15 (2014) 1929-1958

slide-30
SLIDE 30

31

  • 11. september 2018

STK 4300 Lecture 4- Neural nets

Prediction Weights

Effect of weight decay

slide-31
SLIDE 31

Multiple Minima

  • 𝑆 𝜄 is not convex and have many local minima
  • Try many starting configurations

– choose solution with lowest penalize error – Use the average prediction of the collection of networks

  • NB do not average weights as these are not well ordered
  • Use bagging, i.e. average predictions of networks

trained from random perturbations of training data

32

  • 25. September 2019

STK-IN 4300 Lecture 4- Neural nets

slide-32
SLIDE 32

Number of hidden Units and layers

  • Number of units

– From 5-100 units is common – Increase with number of input and training data – Better to have too many than to few – Start with a large number and use weight decay (regularization)

  • Number of hidden layers

– Guided by background knowledge – Models hierarchical features at different levels of resolution – Trial and error – Trend in CNN smaller filters more layers

33

  • 25. September 2019

STK-IN 4300 Lecture 4- Neural nets

slide-33
SLIDE 33

What is Deep learning

  • There is no universally agreed upon threshold of depth

dividing shallow learning from deep learning, but most researchers in the field agree that deep learning has multiple nonlinear layers (CAP > 2), 3 layers and more.

  • «Deep learning» Hinton et al 2006 (3 layers)
  • «Very Deep learning» Simonyan et al. 2014 (16+ layers)
  • «Extremely Deep» He et al 2016 (50 -> 1000)
  • Schmidhuber 2015 considers more than 10 layers

to be very deep learning

34

  • 25. September 2019

STK-IN 4300 Lecture 4- Neural nets

slide-34
SLIDE 34

Example simulated data

35

  • Var 𝑔 𝑌

Var(𝜁1) ≈ 4

  • Training data size: 100 samples
  • Test data size : 10 000 samples

Weight decay = 0.0005

Linear regression

  • 25. September 2019

STK-IN 4300 Lecture 4- Neural nets

slide-35
SLIDE 35

Weight decay vs number of hidden units

36

Weight decay = 0.0005

Both the approach to select a high number of functions and optimize the weight decay and the approach to fix the weight decay and optimize the number of hidden units gives good result

  • 25. September 2019

STK-IN 4300 Lecture 4- Neural nets

slide-36
SLIDE 36

Examples of simulated data

37

  • Var 𝑔 𝑌

Var(𝜁)

≈ 4

  • Training data size: 100 samples
  • Test data size : 10 000 samples

NN does not always work: All cases are worse than the mean And it gets even worse as the number of units increase

  • 25. September 2019

STK-IN 4300 Lecture 4- Neural nets

slide-37
SLIDE 37

Alternative models for neural networks

38

«Hand crafting» NN might help By reducing the number of parameters to be estimated.

  • Setting weights to zero (localize)
  • Setting weights equal

(Convolutional NN)

  • 25. September 2019

STK-IN 4300 Lecture 4- Neural nets

slide-38
SLIDE 38

Neural net reduction in number of parameters (time series example)

39

Layer i Layer i+1 𝒚 𝐴 = 𝑿𝒚 𝑿 (100 × 100) Fully connected Localized (±4) Convolutional (±4) 10100 1000 # (100+1)×100 Bias (9+1)×100 10 (9+1)×1

(per feature) CNN use multiple features 10 features => # = 100 param

  • 25. September 2019

STK-IN 4300 Lecture 4- Neural nets

slide-39
SLIDE 39

https://www.youtube.com/watch?v=bNb2fEVKeEo

40

https://www.youtube.com/watch?v=vT1JzLTH4G4&list=PL3FW7Lu3i5JvHM8ljYj-zLfQRF3EO8sYv

Series:

  • 25. September 2019

STK-IN 4300 Lecture 4- Neural nets

# filters? Size of filter? Subsample? Padding? Per Layer (CONV) POOLING layer? Activation function? # Layers

slide-40
SLIDE 40

Learning today Neural nets

  • Projection pursuit

– What is it? – How to solve it: Stagewise

  • Neural nets

– What is it? – Graphical display – Connection to Projection pursuit – How to solve it: Backpropagation – Stochastic Gradient search – Deep – wide – convolutional

41

  • 25. September 2019

STK-IN 4300 Lecture 4- Neural nets