The Paradox of Overfitting Volker Nannen February 1, 2003 - - PowerPoint PPT Presentation

the paradox of overfitting
SMART_READER_LITE
LIVE PREVIEW

The Paradox of Overfitting Volker Nannen February 1, 2003 - - PowerPoint PPT Presentation

The Paradox of Overfitting Volker Nannen February 1, 2003 Artificial Intelligence Rijksuniversiteit Groningen Contents 1. MDL theory 2. Experimental Verification 3. Results MDL theory 1.1 the problem The paradox of overfitting:


slide-1
SLIDE 1

The Paradox of Overfitting

Volker Nannen February 1, 2003 Artificial Intelligence Rijksuniversiteit Groningen

slide-2
SLIDE 2
slide-3
SLIDE 3

Contents

  • 1. MDL – theory
  • 2. Experimental Verification
  • 3. Results
slide-4
SLIDE 4

MDL – theory

slide-5
SLIDE 5

1.1 the problem

The paradox of overfitting: Complex models contain more information on the training data but less information on future data.

  • 2 -
slide-6
SLIDE 6

1.2 model selection

Machine learning uses models to describe reality.

  • 3 -
slide-7
SLIDE 7

1.2 model selection Models can be

  • statistical distributions
  • polynomials
  • Markov chains
  • neural networks
  • decision trees
  • etc.
  • 4 -
slide-8
SLIDE 8

1.2 model selection This work uses polynomial models.

mk = pk(x) = a0 + · · · + ak xk (1)

Polynomials are

  • well understood
  • used throughout mathematics
  • suffer badly from overfitting
  • 5 -
slide-9
SLIDE 9

1.3 mean squared error The mean squared error of a model m on a sample s = {(x1, y1) . . . (xn, yn)} (2)

  • f size n is

σ2

f = 1

n

n

  • i=0
  • m(xi) − yi

2 (3)

  • 6 -
slide-10
SLIDE 10

1.3 mean squared error The error on the training sample is called training error. The error on future samples is called generalization error. We want to minimize the generalization error.

  • 7 -
slide-11
SLIDE 11

1.4 an example of overfitting

An example of overfitting: regression in the two-dimensional plane

  • 8 -
slide-12
SLIDE 12

1.4 an example of overfitting Continuous signal + noise, 300 point sample.

  • 9 -
slide-13
SLIDE 13

1.4 an example of overfitting 6 degree polynomial, σ2 = 13.8

  • 10 -
slide-14
SLIDE 14

1.4 an example of overfitting 17 degree polynomial, σ2 = 5.8

  • 11 -
slide-15
SLIDE 15

1.4 an example of overfitting 43 degree polynomial, σ2 = 1.5

  • 12 -
slide-16
SLIDE 16

1.4 an example of overfitting 100 degree polynomial, σ2 = 0.6

  • 13 -
slide-17
SLIDE 17

1.4 an example of overfitting 3,000 point test sample. σ2

t = 1012

  • 14 -
slide-18
SLIDE 18

1.4 an example of overfitting Generalization error on this 3,000 point test sample. 6 degree: σ2 = 16, 17 degree: σ2 = 8.6, 43 degree: σ2 = 2.7, 100 degree: σ2 = 1012.

  • 15 -
slide-19
SLIDE 19

1.5 Minimum Description Length

Rissanens hypothesis: Minimum Description Length prevents overfitting.

  • 16 -
slide-20
SLIDE 20

1.5 Minimum Description Length MDL minimizes the code length

min

m

  • l(s|m) + l(m)
  • (4)

This is a two-part code: l(m) is the code length of the model and l(s|m) is the code length of the data given the model.

  • 17 -
slide-21
SLIDE 21

1.5 Minimum Description Length We only look at the least square model per degree

min

k

  • n log ˆ

σmk + l(m)

  • (5)

Rissanen’s original estimation:

min

k

  • n log ˆ

σmk + k log √n

  • (6)

This is too weak.

  • 18 -
slide-22
SLIDE 22

1.5 Minimum Description Length Mixture MDL is a modern version of MDL. min

k

  • − log
  • mk∈Mk

p(Mk = mk) p(s|mk) d mk

  • (7)

p(Mk = mk) is a prior distribution over models in Mk. Barron & Liang provide a simple algorithm based on the uniform prior (2002).

  • 19 -
slide-23
SLIDE 23

Experimental Verification

slide-24
SLIDE 24

2.1 the problem Problems with experiments on model selection:

  • shortage of appropriate data
  • inefficient setup of experiments
  • insufficient visualization
  • few tangible results
  • 21 -
slide-25
SLIDE 25

2.2 the solution

Solution: The Statistical Data Viewer an advanced tool for statistical experiments.

  • 22 -
slide-26
SLIDE 26

2.3 A simple experiment

A simple experiment: the sinus wave

  • 23 -
slide-27
SLIDE 27

2.3 A simple experiment A new project

  • 24 -
slide-28
SLIDE 28

2.3 A simple experiment A new process and sample

  • 25 -
slide-29
SLIDE 29

2.3 A simple experiment Selecting a method for a model

  • 26 -
slide-30
SLIDE 30

2.3 A simple experiment Analyzing the generalization error

  • 27 -
slide-31
SLIDE 31

2.3 A simple experiment Analysis, cross validation, mixture MDL and Rissanen’s MDL. Optimum at 0 degrees.

  • 28 -
slide-32
SLIDE 32

2.3 A simple experiment 150 point sample. Optimum at 17 degrees.

  • 29 -
slide-33
SLIDE 33

2.3 A simple experiment 300 point sample. Optimum at 18 degrees.

  • 30 -
slide-34
SLIDE 34

Results

slide-35
SLIDE 35

3.1 achievements Achievements:

  • generic problem space (files, broad selection
  • f online signals, drawing by hand)
  • graphical object oriented setup of experiments

(no scripting)

  • graphics integrated into the control structure
  • simple programming interfaces
  • 32 -
slide-36
SLIDE 36

3.2 Conclusion Conclusion for all experiments:

  • Rissanens original version usually overfits.
  • Mixture MDL can prevent overfitting.
  • smoothing is important for model selection.
  • Mixture MDL cannot deal with non-uniform support.

(but cross validation can do it!)

  • Mixture MDL can deal with different types of noise.

(i.i.d. assumption can be relaxed!)

  • The structure of a prediction graph contains valuable

information by itself and MDL can reproduce it.

  • 33 -
slide-37
SLIDE 37

3.3 further research Further research:

  • The structure of the generalization error
  • Other types of data
  • Other types of models
  • Improved interfaces
  • 34 -
slide-38
SLIDE 38

volker.nannen.com/mdl