lecture 6 overfitting

Lecture 6: Overfitting Princeton University COS 495 Instructor: - PowerPoint PPT Presentation

Machine Learning Basics Lecture 6: Overfitting Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data , : 1 i.i.d. from distribution


  1. Machine Learning Basics Lecture 6: Overfitting Princeton University COS 495 Instructor: Yingyu Liang

  2. Review: machine learning basics

  3. Math formulation β€’ Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸 1 β€’ Find 𝑧 = 𝑔(𝑦) ∈ π“˜ that minimizes ΰ·  π‘œ π‘œ Οƒ 𝑗=1 𝑀 𝑔 = π‘š(𝑔, 𝑦 𝑗 , 𝑧 𝑗 ) β€’ s.t. the expected loss is small 𝑀 𝑔 = 𝔽 𝑦,𝑧 ~𝐸 [π‘š(𝑔, 𝑦, 𝑧)]

  4. Machine learning 1-2-3 β€’ Collect data and extract features β€’ Build model: choose hypothesis class π“˜ and loss function π‘š β€’ Optimization: minimize the empirical loss

  5. Feature mapping Machine learning 1-2-3 Maximum Likelihood β€’ Collect data and extract features β€’ Build model: choose hypothesis class π“˜ and loss function π‘š β€’ Optimization: minimize the empirical loss Occam’s razor Gradient descent; convex optimization

  6. Overfitting

  7. Linear vs nonlinear models 2 𝑦 1 2 𝑦 2 𝑦 1 2𝑦 1 𝑦 2 𝑧 = sign(π‘₯ π‘ˆ 𝜚(𝑦) + 𝑐) 2𝑑𝑦 1 𝑦 2 2𝑑𝑦 2 𝑑 Polynomial kernel

  8. Linear vs nonlinear models β€’ Linear model: 𝑔 𝑦 = 𝑏 0 + 𝑏 1 𝑦 β€’ Nonlinear model: 𝑔 𝑦 = 𝑏 0 + 𝑏 1 𝑦 + 𝑏 2 𝑦 2 + 𝑏 3 𝑦 3 + … + 𝑏 𝑁 𝑦 𝑁 β€’ Linear model βŠ† Nonlinear model (since can always set 𝑏 𝑗 = 0 (𝑗 > 1) ) β€’ Looks like nonlinear model can always achieve same/smaller error β€’ Why one use Occam’s razor (choose a smaller hypothesis class)?

  9. Example: regression using polynomial curve 𝑒 = sin 2πœŒπ‘¦ + πœ— Figure from Machine Learning and Pattern Recognition , Bishop

  10. Example: regression using polynomial curve 𝑒 = sin 2πœŒπ‘¦ + πœ— Regression using polynomial of degree M Figure from Machine Learning and Pattern Recognition , Bishop

  11. Example: regression using polynomial curve 𝑒 = sin 2πœŒπ‘¦ + πœ— Figure from Machine Learning and Pattern Recognition , Bishop

  12. Example: regression using polynomial curve 𝑒 = sin 2πœŒπ‘¦ + πœ— Figure from Machine Learning and Pattern Recognition , Bishop

  13. Example: regression using polynomial curve 𝑒 = sin 2πœŒπ‘¦ + πœ— Figure from Machine Learning and Pattern Recognition , Bishop

  14. Example: regression using polynomial curve Figure from Machine Learning and Pattern Recognition , Bishop

  15. Prevent overfitting β€’ Empirical loss and expected loss are different β€’ Also called training error and test/generalization error β€’ Larger the data set, smaller the difference between the two β€’ Larger the hypothesis class, easier to find a hypothesis that fits the difference between the two β€’ Thus has small training error but large test error (overfitting) β€’ Larger data set helps! β€’ Throwing away useless hypotheses also helps!

  16. Prevent overfitting β€’ Empirical loss and expected loss are different β€’ Also called training error and test error β€’ Larger the hypothesis class, easier to find a hypothesis that fits the difference between the two Use prior knowledge/model to β€’ Thus has small training error but large test error (overfitting) prune hypotheses β€’ Larger the data set, smaller the difference between the two β€’ Throwing away useless hypotheses also helps! Use experience/data to β€’ Larger data set helps! prune hypotheses

  17. Prior v.s. data

  18. Prior vs experience β€’ Super strong prior knowledge: π“˜ = {𝑔 βˆ— } β€’ No data is needed! 𝑔 βˆ— : the best function

  19. Prior vs experience β€’ Super strong prior knowledge: π“˜ = {𝑔 βˆ— , 𝑔 1 } β€’ A few data points suffices to detect 𝑔 βˆ— 𝑔 βˆ— : the best function

  20. Prior vs experience β€’ Super larger data set: infinite data β€’ Hypothesis class π“˜ can be all functions! 𝑔 βˆ— : the best function

  21. Prior vs experience β€’ Practical scenarios: finite data, π“˜ of median capacity, 𝑔 βˆ— in/not in π“˜ π“˜ 1 π“˜ 2 𝑔 βˆ— : the best function

  22. Prior vs experience β€’ Practical scenarios lie between the two extreme cases π“˜ = {𝑔 βˆ— } practice Infinite data

  23. General Phenomenon Figure from Deep Learning , Goodfellow, Bengio and Courville

  24. Cross validation

  25. Model selection β€’ How to choose the optimal capacity? β€’ e.g., choose the best degree for polynomial curve fitting β€’ Cannot be done by training data alone β€’ Create held-out data to approx. the test error β€’ Called validation data set

  26. Model selection: cross validation β€’ Partition the training data into several groups β€’ Each time use one group as validation set Figure from Machine Learning and Pattern Recognition , Bishop

  27. Model selection: cross validation β€’ Also used for selecting other hyper-parameters for model/algorithm β€’ E.g., learning rate, stopping criterion of SGD, etc. β€’ Pros: general, simple β€’ Cons: computationally expensive; even worse when there are more hyper-parameters

Recommend


More recommend