Learning From Data Lecture 11 Overfitting What is Overfitting - - PowerPoint PPT Presentation

learning from data lecture 11 overfitting
SMART_READER_LITE
LIVE PREVIEW

Learning From Data Lecture 11 Overfitting What is Overfitting - - PowerPoint PPT Presentation

Learning From Data Lecture 11 Overfitting What is Overfitting When does Overfitting Occur Stochastic and Deterministic Noise M. Magdon-Ismail CSCI 4100/6100 recap: Nonlinear Transforms X -space is R d d Z -space is R 1 1


slide-1
SLIDE 1

Learning From Data Lecture 11 Overfitting

What is Overfitting When does Overfitting Occur Stochastic and Deterministic Noise

  • M. Magdon-Ismail

CSCI 4100/6100

slide-2
SLIDE 2

recap: Nonlinear Transforms

Φ

− →

  • 1. Original data

xn ∈ X

  • 2. Transform the data

zn = Φ(xn) ∈ Z

‘Φ−1’

← −

  • 4. Classify in X -space

g(x) = ˜ g(Φ(x)) = sign( ˜ wtΦ(x))

  • 3. Separate data in Z-space

˜ g(z) = sign( ˜ wtz) X -space is Rd Z-space is R

˜ d

x =     1 x1 . . . xd     z = Φ(x) =     1 Φ1(x) . . . Φ ˜

d(x)

    =     1 z1 . . . z ˜

d

    x1, x2, . . . , xN z1, z2, . . . , zN y1, y2, . . . , yN y1, y2, . . . , yN no weights ˜ w =     w0 w1 . . . w ˜

d

    dvc = d + 1 dvc = d + 1 g(x) = sign( ˜ wtΦ(x))

c A M L Creator: Malik Magdon-Ismail

Overfitting: 2 /25

Digits data − →

slide-3
SLIDE 3

recap: Digits Data “1” Versus “All”

Average Intensity Symmetry Average Intensity Symmetry

Linear model Ein = 2.13% Eout = 2.38% 3rd order polynomial model Ein = 1.75% Eout = 1.87%

c A M L Creator: Malik Magdon-Ismail

Overfitting: 3 /25

Superstitions − →

slide-4
SLIDE 4

Superstitions – Myth or Reality?

  • Paraskevedekatriaphobia – fear of Friday the 13th.

– Are future Friday the 13ths really more dangerous?

  • OCD [medical journal, citation lost, can you find it?]

the subjects performs an action which leads to a good outcome and thereby generalizes it as cause and effect: the action will always give good results. Having overfit the data, the subject compulsively engages in that activity.

Humans are overfitting machines, very good at “finding coincidences”.

c A M L Creator: Malik Magdon-Ismail

Overfitting: 4 /25

Simple illustration − →

slide-5
SLIDE 5

An Illustration of Overfitting on a Simple Example

Quadratic f 5 data points A little noise (measurement error) 5 data points → 4th order polynomial fit

x y Data Target

Classic overfitting: simple target with excessively complex H. The noise did us in. (why?)

c A M L Creator: Malik Magdon-Ismail

Overfitting: 5 /25

Classic overfitting − →

slide-6
SLIDE 6

An Illustration of Overfitting on a Simple Example

Quadratic f 5 data points A little noise (measurement error) 5 data points → 4th order polynomial fit

x y Data Target Fit

Classic overfitting: simple target with excessively complex H.

Ein ≈ 0; Eout ≫ 0

The noise did us in. (why?)

c A M L Creator: Malik Magdon-Ismail

Overfitting: 6 /25

What is overfitting? − →

slide-7
SLIDE 7

What is Overfitting?

Fitting the data more than is warranted

c A M L Creator: Malik Magdon-Ismail

Overfitting: 7 /25

Is it bad generalization? − →

slide-8
SLIDE 8

Overfitting is Not Just Bad Generalization

in-sample error

  • ut-of-sample error

bad generalization VC dimension, dvc Error

VC Analysis: Covers bad generalization but with lots of slack – the VC bound is loose

c A M L Creator: Malik Magdon-Ismail

Overfitting: 8 /25

Beyond bad generalization − →

slide-9
SLIDE 9

Overfitting is Not Just Bad Generalization

in-sample error

  • ut-of-sample error
  • verfitting

VC dimension, dvc Error

Overfitting: Going for lower and lower Ein results in higher and higher Eout

c A M L Creator: Malik Magdon-Ismail

Overfitting: 9 /25

Case study: simple and complex f − →

slide-10
SLIDE 10

Case Study: 2nd vs 10th Order Polynomial Fit

x y Data Target x y Data Target

10th order f with noise. 50th order f with no noise. H2: 2nd order polynomial fit H10: 10th order polynomial fit

← − special case of linear models with feature transform x → (1, x, x2, · · · ).

Which model do you pick for which problem and why?

c A M L Creator: Malik Magdon-Ismail

Overfitting: 10 /25

H2 versus H10 − →

slide-11
SLIDE 11

Case Study: 2nd vs 10th Order Polynomial Fit

x y Data Target x y Data Target

10th order f with noise. 50th order f with no noise. H2: 2nd order polynomial fit H10: 10th order polynomial fit

← − special case of linear models with feature transform x → (1, x, x2, · · · ).

Which model do you pick for which problem and why?

c A M L Creator: Malik Magdon-Ismail

Overfitting: 11 /25

H2 wins for both cases − →

slide-12
SLIDE 12

Case Study: 2nd vs 10th Order Polynomial Fit

x y Data 2nd Order Fit 10th Order Fit x y Data 2nd Order Fit 10th Order Fit

simple noisy target 2nd Order 10th Order Ein 0.050 0.034 Eout 0.127 9.00 complex noiseless target 2nd Order 10th Order Ein 0.029 10−5 Eout 0.120 7680

Go figure: Simpler H is better even for the more complex target with no noise.

c A M L Creator: Malik Magdon-Ismail

Overfitting: 12 /25

Is there really no noise − →

slide-13
SLIDE 13

Is there Really “No Noise” with the Complex f?

x y Data Target x y Data Target

Simple f with noise. Complex f with no noise. H should match quantity and quality of data, not f

c A M L Creator: Malik Magdon-Ismail

Overfitting: 13 /25

Look only at the data − →

slide-14
SLIDE 14

Is there Really “No Noise” with the Complex f?

x y x y

Simple f with noise. Complex f with no noise. H should match quantity and quality of data, not f

c A M L Creator: Malik Magdon-Ismail

Overfitting: 14 /25

Learning curves for H2, H10 − →

slide-15
SLIDE 15

When is H2 Better than H10?

Learning curves for H2 Learning curves for H10

Number of Data Points, N Expected Error Eout Ein Number of Data Points, N Expected Error Eout Ein

Overfitting: Eout(H10) > Eout(H2)

c A M L Creator: Malik Magdon-Ismail

Overfitting: 15 /25

Overfit measure σ2 vs. N − →

slide-16
SLIDE 16

Overfit Measure: Eout(H10) − Eout(H2)

Number of Data Points, N Noise Level, σ2

80 100 120

  • 0.2
  • 0.1

0.1 0.2 1 2

c A M L Creator: Malik Magdon-Ismail

Overfitting: 16 /25

Overfit measure Qf vs. N − →

slide-17
SLIDE 17

Overfit Measure: Eout(H10) − Eout(H2)

Number of Data Points, N Noise Level, σ2

80 100 120

  • 0.2
  • 0.1

0.1 0.2 1 2

Number of Data Points, N Target Complexity, Qf

80 100 120

  • 0.2
  • 0.1

0.1 0.2 25 50 75 100

Number of data points ↑ Overfitting ↓ Noise ↑ Overfitting ↑ Target complexity ↑ Overfitting ↑

c A M L Creator: Malik Magdon-Ismail

Overfitting: 17 /25

Define ‘noise’ − →

slide-18
SLIDE 18

Noise

That part of y we cannot model

it has two sources . . .

c A M L Creator: Malik Magdon-Ismail

Overfitting: 18 /25

Stochastic noise − →

slide-19
SLIDE 19

Stochastic Noise — Data Error

We would like to learn from ◦: yn = f(xn) Unfortunately, we only observe ◦: yn = f(xn) + ‘stochastic noise’ ↑

no one can model this

x y y = f(x)

  • stoch. noise

Stochastic Noise: fluctuations/measurement errors we cannot model.

c A M L Creator: Malik Magdon-Ismail

Overfitting: 19 /25

Deterministic noise − →

slide-20
SLIDE 20

Deterministic Noise — Model Error

We would like to learn from ◦: yn = h∗(xn) Unfortunately, we only observe ◦: yn = f(xn) = h∗(xn) + ‘deterministic noise’ ↑

H cannot model this

x y

best approximation to f in H

h∗(x) y = f(x)

  • det. noise

Deterministic Noise: the part of f we cannot model.

c A M L Creator: Malik Magdon-Ismail

Overfitting: 20 /25

Both hurt learning − →

slide-21
SLIDE 21

Stochastic & Deterministic Noise Hurt Learning

Stochastic Noise

x y f(x) y = f(x)+stoch. noise

source: random measurement errors re-measure yn

stochastic noise changes.

change H

stochastic noise the same.

Deterministic Noise

x y h∗ y = h∗(x)+det. noise

source: learner’s H cannot model f re-measure yn

deterministic noise the same.

change H

deterministic noise changes.

We have single D and fixed H so we cannot distinguish

c A M L Creator: Malik Magdon-Ismail

Overfitting: 21 /25

Stochastic noise and bias-var − →

slide-22
SLIDE 22

Noise and the Bias-Variance Decomposition

y = f(x) + ǫ ↑

measurement error

E[Eout(x)] = ED,ǫ[(g(x) − f(x) − ǫ)2]

= ED,ǫ[(g(x) − f(x))2 + 2(g(x) − f(x))ǫ + ǫ2] ↓ ↓ ↓ bias + var σ2

c A M L Creator: Malik Magdon-Ismail

Overfitting: 22 /25

bias-var-σ2 and noise − →

slide-23
SLIDE 23

Noise and the Bias-Variance Decomposition

y = f(x) + ǫ ↑

measurement error

E[Eout(x)] =

σ2 + bias + var ↑ ↑ ↑

stochastic deterministic indirect noise noise impact

  • f noise

c A M L Creator: Malik Magdon-Ismail

Overfitting: 23 /25

Noise causes overfitting − →

slide-24
SLIDE 24

Noise is the Culprit

Overfitting is the disease Noise is the cause

Learning is led astray by fitting the noise more than the signal

Cures Regularization: Putting on the brakes. Validation: A reality check from peeking at Eout (the bottom line).

c A M L Creator: Malik Magdon-Ismail

Overfitting: 24 /25

Regularization teaser − →

slide-25
SLIDE 25

Regularization

no regularization regularization!

x y Data Target Fit

c A M L Creator: Malik Magdon-Ismail

Overfitting: 25 /25

Regularization teaser − →

slide-26
SLIDE 26

Regularization

no regularization regularization!

x y Data Target Fit x y

c A M L Creator: Malik Magdon-Ismail

Overfitting: 26 /25