Lecture 6: Overfitting Princeton University COS 495 Instructor: - - PowerPoint PPT Presentation

โ–ถ
lecture 6 overfitting
SMART_READER_LITE
LIVE PREVIEW

Lecture 6: Overfitting Princeton University COS 495 Instructor: - - PowerPoint PPT Presentation

Machine Learning Basics Lecture 6: Overfitting Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data , : 1 i.i.d. from distribution


slide-1
SLIDE 1

Machine Learning Basics Lecture 6: Overfitting

Princeton University COS 495 Instructor: Yingyu Liang

slide-2
SLIDE 2

Review: machine learning basics

slide-3
SLIDE 3

Math formulation

  • Given training data ๐‘ฆ๐‘—, ๐‘ง๐‘— : 1 โ‰ค ๐‘— โ‰ค ๐‘œ i.i.d. from distribution ๐ธ
  • Find ๐‘ง = ๐‘”(๐‘ฆ) โˆˆ ๐“˜ that minimizes เท 

๐‘€ ๐‘” =

1 ๐‘œ ฯƒ๐‘—=1 ๐‘œ

๐‘š(๐‘”, ๐‘ฆ๐‘—, ๐‘ง๐‘—)

  • s.t. the expected loss is small

๐‘€ ๐‘” = ๐”ฝ ๐‘ฆ,๐‘ง ~๐ธ[๐‘š(๐‘”, ๐‘ฆ, ๐‘ง)]

slide-4
SLIDE 4

Machine learning 1-2-3

  • Collect data and extract features
  • Build model: choose hypothesis class ๐“˜ and loss function ๐‘š
  • Optimization: minimize the empirical loss
slide-5
SLIDE 5

Machine learning 1-2-3

  • Collect data and extract features
  • Build model: choose hypothesis class ๐“˜ and loss function ๐‘š
  • Optimization: minimize the empirical loss

Feature mapping Gradient descent; convex optimization Occamโ€™s razor

Maximum Likelihood

slide-6
SLIDE 6

Overfitting

slide-7
SLIDE 7

Linear vs nonlinear models

๐‘ฆ1 ๐‘ฆ2 ๐‘ฆ1

2

๐‘ฆ2

2

2๐‘ฆ1๐‘ฆ2 2๐‘‘๐‘ฆ1 2๐‘‘๐‘ฆ2 ๐‘‘ ๐‘ง = sign(๐‘ฅ๐‘ˆ๐œš(๐‘ฆ) + ๐‘) Polynomial kernel

slide-8
SLIDE 8

Linear vs nonlinear models

  • Linear model: ๐‘” ๐‘ฆ = ๐‘0 + ๐‘1๐‘ฆ
  • Nonlinear model: ๐‘” ๐‘ฆ = ๐‘0 + ๐‘1๐‘ฆ + ๐‘2๐‘ฆ2 + ๐‘3๐‘ฆ3 + โ€ฆ + ๐‘๐‘ ๐‘ฆ๐‘
  • Linear model โІ Nonlinear model (since can always set ๐‘๐‘— = 0 (๐‘— >

1))

  • Looks like nonlinear model can always achieve same/smaller error
  • Why one use Occamโ€™s razor (choose a smaller hypothesis class)?
slide-9
SLIDE 9

Example: regression using polynomial curve

๐‘ข = sin 2๐œŒ๐‘ฆ + ๐œ—

Figure from Machine Learning and Pattern Recognition, Bishop

slide-10
SLIDE 10

Example: regression using polynomial curve

Figure from Machine Learning and Pattern Recognition, Bishop

๐‘ข = sin 2๐œŒ๐‘ฆ + ๐œ— Regression using polynomial of degree M

slide-11
SLIDE 11

Example: regression using polynomial curve

Figure from Machine Learning and Pattern Recognition, Bishop

๐‘ข = sin 2๐œŒ๐‘ฆ + ๐œ—

slide-12
SLIDE 12

Example: regression using polynomial curve

Figure from Machine Learning and Pattern Recognition, Bishop

๐‘ข = sin 2๐œŒ๐‘ฆ + ๐œ—

slide-13
SLIDE 13

Example: regression using polynomial curve

๐‘ข = sin 2๐œŒ๐‘ฆ + ๐œ—

Figure from Machine Learning and Pattern Recognition, Bishop

slide-14
SLIDE 14

Example: regression using polynomial curve

Figure from Machine Learning and Pattern Recognition, Bishop

slide-15
SLIDE 15

Prevent overfitting

  • Empirical loss and expected loss are different
  • Also called training error and test/generalization error
  • Larger the data set, smaller the difference between the two
  • Larger the hypothesis class, easier to find a hypothesis that fits the

difference between the two

  • Thus has small training error but large test error (overfitting)
  • Larger data set helps!
  • Throwing away useless hypotheses also helps!
slide-16
SLIDE 16

Prevent overfitting

  • Empirical loss and expected loss are different
  • Also called training error and test error
  • Larger the hypothesis class, easier to find a hypothesis that fits the

difference between the two

  • Thus has small training error but large test error (overfitting)
  • Larger the data set, smaller the difference between the two
  • Throwing away useless hypotheses also helps!
  • Larger data set helps!

Use prior knowledge/model to prune hypotheses Use experience/data to prune hypotheses

slide-17
SLIDE 17

Prior v.s. data

slide-18
SLIDE 18

Prior vs experience

  • Super strong prior knowledge: ๐“˜ = {๐‘”โˆ—}
  • No data is needed!

๐‘”โˆ—: the best function

slide-19
SLIDE 19

Prior vs experience

  • Super strong prior knowledge: ๐“˜ = {๐‘”โˆ—, ๐‘”

1}

  • A few data points suffices to detect ๐‘”โˆ—

๐‘”โˆ—: the best function

slide-20
SLIDE 20

Prior vs experience

  • Super larger data set: infinite data
  • Hypothesis class ๐“˜ can be all functions!

๐‘”โˆ—: the best function

slide-21
SLIDE 21

Prior vs experience

  • Practical scenarios: finite data, ๐“˜ of median capacity,๐‘”โˆ— in/not in ๐“˜

๐‘”โˆ—: the best function ๐“˜1 ๐“˜2

slide-22
SLIDE 22

Prior vs experience

  • Practical scenarios lie between the two extreme cases

๐“˜ = {๐‘”โˆ—} Infinite data practice

slide-23
SLIDE 23

General Phenomenon

Figure from Deep Learning, Goodfellow, Bengio and Courville

slide-24
SLIDE 24

Cross validation

slide-25
SLIDE 25

Model selection

  • How to choose the optimal capacity?
  • e.g., choose the best degree for polynomial curve fitting
  • Cannot be done by training data alone
  • Create held-out data to approx. the test error
  • Called validation data set
slide-26
SLIDE 26

Model selection: cross validation

  • Partition the training data into several groups
  • Each time use one group as validation set

Figure from Machine Learning and Pattern Recognition, Bishop

slide-27
SLIDE 27

Model selection: cross validation

  • Also used for selecting other hyper-parameters for model/algorithm
  • E.g., learning rate, stopping criterion of SGD, etc.
  • Pros: general, simple
  • Cons: computationally expensive; even worse when there are more

hyper-parameters