Learning From Data Lecture 22 Neural Networks and Overfitting - - PowerPoint PPT Presentation

learning from data lecture 22 neural networks and
SMART_READER_LITE
LIVE PREVIEW

Learning From Data Lecture 22 Neural Networks and Overfitting - - PowerPoint PPT Presentation

Learning From Data Lecture 22 Neural Networks and Overfitting Approximation vs. Generalization Regularization and Early Stopping Minimizing E in More Efficienty M. Magdon-Ismail CSCI 4100/6100 recap: Neural Networks and Fitting the Data


slide-1
SLIDE 1

Learning From Data Lecture 22 Neural Networks and Overfitting

Approximation vs. Generalization Regularization and Early Stopping Minimizing Ein More Efficienty

  • M. Magdon-Ismail

CSCI 4100/6100

slide-2
SLIDE 2

recap: Neural Networks and Fitting the Data

Forward Propagation:

x = x(0)

W(1)

− → s(1)

θ

− → x(1)

W(2)

− → s(2) · · ·

W(L)

− → s(L)

θ

− → x(L) = h(x) s(ℓ) = (W(ℓ))tx(ℓ−1) x(ℓ) =

  • 1

θ(s(ℓ))

  • (Compute h and Ein)

Choose W = {W(1), W(1), . . . , W(L)} to minimize Ein Gradient descent: W(t + 1) ← W(t) − η∇Ein(W(t)) Compute gradient − → need ∂e ∂W(ℓ) − → need δ(ℓ) = ∂e ∂s(ℓ) ∂e ∂W(ℓ) = x(ℓ−1)(δ(ℓ))t Backpropagation:

δ(1) ← − δ(2) · · · ← − δ(L−1) ← − δ(L)

δ(ℓ) = θ′(s(ℓ)) ⊗

  • W(ℓ+1)δ(ℓ+1)d(ℓ)

1

SGD Gradient Descent log10(iteration) log10(error) 2 4 6

  • 4
  • 3
  • 2
  • 1

Average Intensity Symmetry

c A M L Creator: Malik Magdon-Ismail

Neural Networks and Overfitting: 2 /15

2-layer neural network − →

slide-3
SLIDE 3

2-Layer Neural Network

. . .

v3 v5 h(x) v1 v2 v4 vm x w0 w1 w2 w3 wm w4 w5

h(x) = θ

 w0 +

m

  • j=1

wjθ (vj

tx)

 

c A M L Creator: Malik Magdon-Ismail

Neural Networks and Overfitting: 3 /15

Tunable Transform − →

slide-4
SLIDE 4

The Neural Network has a Tunable Transform

Neural Network Nonlinear Transform k-RBF-Network h(x) = θ

  • w0 +

m

  • j=1

wjθ (vj

tx)

  • h(x) = θ

 w0 +

˜ d

  • j=1

wjΦj(x)   h(x) = θ

  • w0 +

k

  • j=1

wjφ (| | x − µj | |)

  • Ein = O

1

m

approximation

c A M L Creator: Malik Magdon-Ismail

Neural Networks and Overfitting: 4 /15

Generalization − →

slide-5
SLIDE 5

Generalization

MLP: dvc = O(md log(md)) m = √ N

(convergence to optimal for MLP, just like k-NN)

semi-parametric because you still have to learn parameters.

tanh : dvc = O(md(m + d))

c A M L Creator: Malik Magdon-Ismail

Neural Networks and Overfitting: 5 /15

Regularization – weight decay − →

slide-6
SLIDE 6

Regularization – Weight Decay

Eaug(w) = 1 N

N

  • n=1

(h(xn; w) − yn)2 + λ N

  • ℓ,i,j

(w(ℓ)

ij )2

∂Eaug(w) ∂W(ℓ) = ∂Ein(w) ∂W(ℓ) + 2λ N W(ℓ)

↑ backpropagation

c A M L Creator: Malik Magdon-Ismail

Neural Networks and Overfitting: 6 /15

Digits data − →

slide-7
SLIDE 7

Weight Decay with Digits Data

No Weight Decay Weight Decay, λ = 0.01

Average Intensity Symmetry Average Intensity Symmetry

c A M L Creator: Malik Magdon-Ismail

Neural Networks and Overfitting: 7 /15

Early Stopping − →

slide-8
SLIDE 8

Early Stopping

Gradient Descent

w0 w1 = w0 − η g0

| | g0 | |

H1

H1 = {w : | | w − w0 | | ≤ η}

w0 w1 w2 H2

H2 = H1 ∪ {w : | | w − w1 | | ≤ η}

w0 w1 w2 w3 H3

H3 = H2 ∪ {w : | | w − w2 | | ≤ η}

Each iteration explores a larger H H1 ⊂ H2 ⊂ H3 ⊂ H4 ⊂ · · ·

Ein(wt) Ω(dvc(Ht)) Eout(wt) iteration, t Error t∗

w(0) w(t∗)

contour of constant Ein

c A M L Creator: Malik Magdon-Ismail

Neural Networks and Overfitting: 8 /15

Early stopping on digits data − →

slide-9
SLIDE 9

Early Stopping on Digits Data

iteration, t log10(error) Ein Eval t∗

102 103 104 105 106

  • 1.6
  • 1.4
  • 1.2
  • 1

Use a validation set to determine t∗ Output w∗, do not retrain with all the data till t∗.

Average Intensity Symmetry

c A M L Creator: Malik Magdon-Ismail

Neural Networks and Overfitting: 9 /15

Minimizing Ein− →

slide-10
SLIDE 10

Minimizing Ein

  • 1. Use regression for classification
  • 2. Use better algorithms than gradient descent

conjugate gradients gradient descent

  • ptimization time (sec)

log10(error)

0.1 1 10 102 103 104

  • 8
  • 6
  • 4
  • 2

c A M L Creator: Malik Magdon-Ismail

Neural Networks and Overfitting: 10 /15

Beefing up gradient descent − →

slide-11
SLIDE 11

Beefing Up Gradient Descent

Determine the gradient g

weights, w in-sample error, Ein Ein(w) weights, w in-sample error, Ein Ein(w)

Shallow: use large η. Deep: use small η.

c A M L Creator: Malik Magdon-Ismail

Neural Networks and Overfitting: 11 /15

Variable learning rate − →

slide-12
SLIDE 12

Variable Learning Rate Gradient Descent

1: Initialize w(0), and η0 at t = 0. Set α > 1 and β < 1. 2: while stopping criterion has not been met do 3:

Let g(t) = ∇Ein(w(t)), and set v(t) = −g(t).

4:

if Ein(w(t) + ηtv(t)) < Ein(w(t)) then

5:

accept: w(t + 1) = w(t) + ηtv(t); increment η: ηt+1 = αηt.

α ∈ [1.05, 1.1]

6:

else

7:

reject: w(t + 1) = w(t); decrease η: ηt+1 = βηt.

β ∈ [0.7, 0.8]

8:

end if

9:

Iterate to the next step, t ← t + 1.

10: end while c A M L Creator: Malik Magdon-Ismail

Neural Networks and Overfitting: 12 /15

Steepest Descent - Line Search − →

slide-13
SLIDE 13

Steepest Descent - Line Search

1: Initialize w(0) and set t = 0; 2: while stopping criterion has not been met do 3:

Let g(t) = ∇Ein(w(t)), and set v(t) = −g(t).

4:

Let η∗ = argminη Ein(w(t) + ηv(t)).

5:

w(t + 1) = w(t) + η∗v(t).

6:

Iterate to the next step, t ← t + 1.

7: end while

How to accomplish the line search (step 4)? Simple bisection (binary search) suffices in practice

η2 η3 η1 ¯ η E(η1) E(η3) E(η2)

w1 w2

w∗

w(t) w(t + 1)

contour of constant Ein

v(t) −g(t + 1)

c A M L Creator: Malik Magdon-Ismail

Neural Networks and Overfitting: 13 /15

Comparison of optimization heuristics − →

slide-14
SLIDE 14

Comparison of Optimization Heuristics

  • ptimization time (sec)

log10(error) gradient descent variable η steepest descent

0.1 1 10 102 103 104

  • 5
  • 4
  • 3
  • 2
  • 1

Optimization Time Method 10 sec 1,000 sec 50,000 sec Gradient Descent 0.122 0.0214 0.0113 Stochastic Gradient Descent 0.0203 0.000447 1.6310 × 10−5 Variable Learning Rate 0.0432 0.0180 0.000197 Steepest Descent 0.0497 0.0194 0.000140

c A M L Creator: Malik Magdon-Ismail

Neural Networks and Overfitting: 14 /15

Conjugate gradients − →

slide-15
SLIDE 15

Conjugate Gradients

  • 1. Line search just like steepest descent.
  • 2. Choose a better direction than −g

w1 w2

w(t) w(t + 1)

contour of constant Ein

v(t) v(t + 1)

conjugate gradients steepest descent

  • ptimization time (sec)

log10(error)

0.1 1 10 102 103 104

  • 8
  • 6
  • 4
  • 2

Optimization Time Method 10 sec 1,000 sec 50,000 sec Stochastic Gradient Descent 0.0203 0.000447 1.6310 × 10−5 Steepest Descent 0.0497 0.0194 0.000140 Conjugate Gradients 0.0200 1.13 × 10−6 2.73 × 10−9 There are better algorithms (eg. Levenberg-Marquardt), but we will stop here

c A M L Creator: Malik Magdon-Ismail

Neural Networks and Overfitting: 15 /15