lecture 6 overfitting
play

Lecture 6: Overfitting Princeton University COS 495 Instructor: - PowerPoint PPT Presentation

Machine Learning Basics Lecture 6: Overfitting Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data , : 1 i.i.d. from distribution


  1. Machine Learning Basics Lecture 6: Overfitting Princeton University COS 495 Instructor: Yingyu Liang

  2. Review: machine learning basics

  3. Math formulation โ€ข Given training data ๐‘ฆ ๐‘— , ๐‘ง ๐‘— : 1 โ‰ค ๐‘— โ‰ค ๐‘œ i.i.d. from distribution ๐ธ 1 โ€ข Find ๐‘ง = ๐‘”(๐‘ฆ) โˆˆ ๐“˜ that minimizes เท  ๐‘œ ๐‘œ ฯƒ ๐‘—=1 ๐‘€ ๐‘” = ๐‘š(๐‘”, ๐‘ฆ ๐‘— , ๐‘ง ๐‘— ) โ€ข s.t. the expected loss is small ๐‘€ ๐‘” = ๐”ฝ ๐‘ฆ,๐‘ง ~๐ธ [๐‘š(๐‘”, ๐‘ฆ, ๐‘ง)]

  4. Machine learning 1-2-3 โ€ข Collect data and extract features โ€ข Build model: choose hypothesis class ๐“˜ and loss function ๐‘š โ€ข Optimization: minimize the empirical loss

  5. Feature mapping Machine learning 1-2-3 Maximum Likelihood โ€ข Collect data and extract features โ€ข Build model: choose hypothesis class ๐“˜ and loss function ๐‘š โ€ข Optimization: minimize the empirical loss Occamโ€™s razor Gradient descent; convex optimization

  6. Overfitting

  7. Linear vs nonlinear models 2 ๐‘ฆ 1 2 ๐‘ฆ 2 ๐‘ฆ 1 2๐‘ฆ 1 ๐‘ฆ 2 ๐‘ง = sign(๐‘ฅ ๐‘ˆ ๐œš(๐‘ฆ) + ๐‘) 2๐‘‘๐‘ฆ 1 ๐‘ฆ 2 2๐‘‘๐‘ฆ 2 ๐‘‘ Polynomial kernel

  8. Linear vs nonlinear models โ€ข Linear model: ๐‘” ๐‘ฆ = ๐‘ 0 + ๐‘ 1 ๐‘ฆ โ€ข Nonlinear model: ๐‘” ๐‘ฆ = ๐‘ 0 + ๐‘ 1 ๐‘ฆ + ๐‘ 2 ๐‘ฆ 2 + ๐‘ 3 ๐‘ฆ 3 + โ€ฆ + ๐‘ ๐‘ ๐‘ฆ ๐‘ โ€ข Linear model โŠ† Nonlinear model (since can always set ๐‘ ๐‘— = 0 (๐‘— > 1) ) โ€ข Looks like nonlinear model can always achieve same/smaller error โ€ข Why one use Occamโ€™s razor (choose a smaller hypothesis class)?

  9. Example: regression using polynomial curve ๐‘ข = sin 2๐œŒ๐‘ฆ + ๐œ— Figure from Machine Learning and Pattern Recognition , Bishop

  10. Example: regression using polynomial curve ๐‘ข = sin 2๐œŒ๐‘ฆ + ๐œ— Regression using polynomial of degree M Figure from Machine Learning and Pattern Recognition , Bishop

  11. Example: regression using polynomial curve ๐‘ข = sin 2๐œŒ๐‘ฆ + ๐œ— Figure from Machine Learning and Pattern Recognition , Bishop

  12. Example: regression using polynomial curve ๐‘ข = sin 2๐œŒ๐‘ฆ + ๐œ— Figure from Machine Learning and Pattern Recognition , Bishop

  13. Example: regression using polynomial curve ๐‘ข = sin 2๐œŒ๐‘ฆ + ๐œ— Figure from Machine Learning and Pattern Recognition , Bishop

  14. Example: regression using polynomial curve Figure from Machine Learning and Pattern Recognition , Bishop

  15. Prevent overfitting โ€ข Empirical loss and expected loss are different โ€ข Also called training error and test/generalization error โ€ข Larger the data set, smaller the difference between the two โ€ข Larger the hypothesis class, easier to find a hypothesis that fits the difference between the two โ€ข Thus has small training error but large test error (overfitting) โ€ข Larger data set helps! โ€ข Throwing away useless hypotheses also helps!

  16. Prevent overfitting โ€ข Empirical loss and expected loss are different โ€ข Also called training error and test error โ€ข Larger the hypothesis class, easier to find a hypothesis that fits the difference between the two Use prior knowledge/model to โ€ข Thus has small training error but large test error (overfitting) prune hypotheses โ€ข Larger the data set, smaller the difference between the two โ€ข Throwing away useless hypotheses also helps! Use experience/data to โ€ข Larger data set helps! prune hypotheses

  17. Prior v.s. data

  18. Prior vs experience โ€ข Super strong prior knowledge: ๐“˜ = {๐‘” โˆ— } โ€ข No data is needed! ๐‘” โˆ— : the best function

  19. Prior vs experience โ€ข Super strong prior knowledge: ๐“˜ = {๐‘” โˆ— , ๐‘” 1 } โ€ข A few data points suffices to detect ๐‘” โˆ— ๐‘” โˆ— : the best function

  20. Prior vs experience โ€ข Super larger data set: infinite data โ€ข Hypothesis class ๐“˜ can be all functions! ๐‘” โˆ— : the best function

  21. Prior vs experience โ€ข Practical scenarios: finite data, ๐“˜ of median capacity, ๐‘” โˆ— in/not in ๐“˜ ๐“˜ 1 ๐“˜ 2 ๐‘” โˆ— : the best function

  22. Prior vs experience โ€ข Practical scenarios lie between the two extreme cases ๐“˜ = {๐‘” โˆ— } practice Infinite data

  23. General Phenomenon Figure from Deep Learning , Goodfellow, Bengio and Courville

  24. Cross validation

  25. Model selection โ€ข How to choose the optimal capacity? โ€ข e.g., choose the best degree for polynomial curve fitting โ€ข Cannot be done by training data alone โ€ข Create held-out data to approx. the test error โ€ข Called validation data set

  26. Model selection: cross validation โ€ข Partition the training data into several groups โ€ข Each time use one group as validation set Figure from Machine Learning and Pattern Recognition , Bishop

  27. Model selection: cross validation โ€ข Also used for selecting other hyper-parameters for model/algorithm โ€ข E.g., learning rate, stopping criterion of SGD, etc. โ€ข Pros: general, simple โ€ข Cons: computationally expensive; even worse when there are more hyper-parameters

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend