Machine Learning Basics Lecture 6: Overfitting Princeton University COS 495 Instructor: Yingyu Liang
Review: machine learning basics
Math formulation β’ Given training data π¦ π , π§ π : 1 β€ π β€ π i.i.d. from distribution πΈ 1 β’ Find π§ = π(π¦) β π that minimizes ΰ· π π Ο π=1 π π = π(π, π¦ π , π§ π ) β’ s.t. the expected loss is small π π = π½ π¦,π§ ~πΈ [π(π, π¦, π§)]
Machine learning 1-2-3 β’ Collect data and extract features β’ Build model: choose hypothesis class π and loss function π β’ Optimization: minimize the empirical loss
Feature mapping Machine learning 1-2-3 Maximum Likelihood β’ Collect data and extract features β’ Build model: choose hypothesis class π and loss function π β’ Optimization: minimize the empirical loss Occamβs razor Gradient descent; convex optimization
Overfitting
Linear vs nonlinear models 2 π¦ 1 2 π¦ 2 π¦ 1 2π¦ 1 π¦ 2 π§ = sign(π₯ π π(π¦) + π) 2ππ¦ 1 π¦ 2 2ππ¦ 2 π Polynomial kernel
Linear vs nonlinear models β’ Linear model: π π¦ = π 0 + π 1 π¦ β’ Nonlinear model: π π¦ = π 0 + π 1 π¦ + π 2 π¦ 2 + π 3 π¦ 3 + β¦ + π π π¦ π β’ Linear model β Nonlinear model (since can always set π π = 0 (π > 1) ) β’ Looks like nonlinear model can always achieve same/smaller error β’ Why one use Occamβs razor (choose a smaller hypothesis class)?
Example: regression using polynomial curve π’ = sin 2ππ¦ + π Figure from Machine Learning and Pattern Recognition , Bishop
Example: regression using polynomial curve π’ = sin 2ππ¦ + π Regression using polynomial of degree M Figure from Machine Learning and Pattern Recognition , Bishop
Example: regression using polynomial curve π’ = sin 2ππ¦ + π Figure from Machine Learning and Pattern Recognition , Bishop
Example: regression using polynomial curve π’ = sin 2ππ¦ + π Figure from Machine Learning and Pattern Recognition , Bishop
Example: regression using polynomial curve π’ = sin 2ππ¦ + π Figure from Machine Learning and Pattern Recognition , Bishop
Example: regression using polynomial curve Figure from Machine Learning and Pattern Recognition , Bishop
Prevent overfitting β’ Empirical loss and expected loss are different β’ Also called training error and test/generalization error β’ Larger the data set, smaller the difference between the two β’ Larger the hypothesis class, easier to find a hypothesis that fits the difference between the two β’ Thus has small training error but large test error (overfitting) β’ Larger data set helps! β’ Throwing away useless hypotheses also helps!
Prevent overfitting β’ Empirical loss and expected loss are different β’ Also called training error and test error β’ Larger the hypothesis class, easier to find a hypothesis that fits the difference between the two Use prior knowledge/model to β’ Thus has small training error but large test error (overfitting) prune hypotheses β’ Larger the data set, smaller the difference between the two β’ Throwing away useless hypotheses also helps! Use experience/data to β’ Larger data set helps! prune hypotheses
Prior v.s. data
Prior vs experience β’ Super strong prior knowledge: π = {π β } β’ No data is needed! π β : the best function
Prior vs experience β’ Super strong prior knowledge: π = {π β , π 1 } β’ A few data points suffices to detect π β π β : the best function
Prior vs experience β’ Super larger data set: infinite data β’ Hypothesis class π can be all functions! π β : the best function
Prior vs experience β’ Practical scenarios: finite data, π of median capacity, π β in/not in π π 1 π 2 π β : the best function
Prior vs experience β’ Practical scenarios lie between the two extreme cases π = {π β } practice Infinite data
General Phenomenon Figure from Deep Learning , Goodfellow, Bengio and Courville
Cross validation
Model selection β’ How to choose the optimal capacity? β’ e.g., choose the best degree for polynomial curve fitting β’ Cannot be done by training data alone β’ Create held-out data to approx. the test error β’ Called validation data set
Model selection: cross validation β’ Partition the training data into several groups β’ Each time use one group as validation set Figure from Machine Learning and Pattern Recognition , Bishop
Model selection: cross validation β’ Also used for selecting other hyper-parameters for model/algorithm β’ E.g., learning rate, stopping criterion of SGD, etc. β’ Pros: general, simple β’ Cons: computationally expensive; even worse when there are more hyper-parameters
Recommend
More recommend