CS 337: Arti fi cial Intelligence & Machine Learning Instructor: - - PowerPoint PPT Presentation
CS 337: Arti fi cial Intelligence & Machine Learning Instructor: - - PowerPoint PPT Presentation
CS 337: Arti fi cial Intelligence & Machine Learning Instructor: Prof. Ganesh Ramakrishnan Lecture 8: Regularization, Overfitting, Bias and Variance August 2019 Recap: Regularization for Generalizability Recall: Complex models could lead to
Recap: Regularization for Generalizability
Recall: Complex models could lead to overfitting. How to counter? Regularization: The main idea is to modify the error function so that model complexity is also explicitly penalized Lossreg(w) = LossD(w) + λ · Reg(w) A squared penalty on the weights, i.e. Reg(w) = ||w||2 is a popular penalty function and is known as L2 regularization.
Recap: MAP objective and regularization
Bayesian view of regularization: Regularization can be achieved using different types of priors on the parameters wMAP = arg min
w
1 2σ2
- j
(yj − wTxj)2 + λ 2||w||2
2
We get an L2 regularized solution for the linear regression problem using a Gaussian prior on the weights. What happens when ||w||2
2 is replaced with ||w||1 ? Contrast their level
curves!
Number of zero w's for different lambdas lambda_1e-15 0 lambda_1e-10 0 lambda_1e-08 0 lambda_0.0001 0 lambda_0.001 0 lambda_0.01 0 lambda_1 0 lambda_5 0 lambda_10 0 lambda_20 0
Contrasting Level Curves
Recap: Lasso Regularized Least Squares Regression
The general Penalized Regularized L.S Problem: wReg = arg min
w
||Φw − y||2
2 + λΩ(w)
Ω(w) = ||w||1 ⇒ Lasso
Lasso Regression wlasso = arg min
w
||Φw − y||2
2 + λ||w||1
Lasso is the MAP estimate of Linear Regression subject to Laplace Prior on w ∼ Laplace(0, θ) Laplace(wi | µ, b) = 1 2b exp
- −|wi − µ|
b
Gaussian Hare vs. Laplacian Tortoise
Gaussian easier to estimate Laplacian yields more sparsity
Lasso: Iterative Soft Thresholding Algorithm (ISTA)
The LASSO Regularized L.S Problem: wLasso = arg min
w
ELasso(w) = arg min
w
ELS(w) + λ|w|1 where ELS(w) = ||Φw − y||2
2
while relative drop in ELasso(wt) across t = k and t = k + 1 is significant:
LS Iterate: wk+1
LS
= wk
Lasso − η∇ELS(wk Lasso)
Proximal1 Step:
- wk+1
Lasso
- i =
- wk+1
LS
- i − λη
if
- wk+1
LS
- i > λη
- wk+1
LS
- i + λη
if
- wk+1
LS
- i < −λη
- therwise
1See slide 1 of https://www.cse.iitb.ac.in/~cs709/notes/enotes/
24-23-10-2018-generalized-proximal-projected-gradientdescent-examples-geometry-convergence-accelerated-annotated.pdf
Note how LASSO yields greater sparsity
NUMBER OF w's that are zeros for different values of lambda lambda_1e-15 0 lambda_1e-10 0 lambda_1e-08 0 lambda_1e-05 8 lambda_0.0001 10 lambda_0.001 12 lambda_0.01 13 lambda_1 15 lambda_5 15 lambda_10 15
CS 337: Artificial Intelligence & Machine Learning Instructor: Prof. Ganesh Ramakrishnan Lecture: Understanding Generalization and Overfitting through bias & variance August 2019
Evaluating model performance
We saw in the last class how to estimate linear predictors by minimizing a squared loss objective function. How do we evaluate whether or not our estimated predictor is good? Measure 1: Training error
Evaluating model performance
We saw in the last class how to estimate linear predictors by minimizing a squared loss objective function. How do we evaluate whether or not our estimated predictor is good? Measure 1: Training error Measure 2: Test error
Error vs. Model Complexity
Prediction Error Model Complexity
Sources of error
Three main sources of test error:
1
Bias
2
Variance
3
Noise
Example: function
Fitting 50 lines after slight perturbation of points
Variance after slight perturbation of points
Bias (with respect to non-linear fit)
Noise
Overfitting
Overfitting: When the proposed hypothesis fits the training data too well
Underfitting
Underfitting: When the hypothesis is insufficient to fit the training data
Bias/Variance Decomposition for Regression
Bias-Variance Analysis in Regression
Say the true underlying function is y = g(x) + where is a r.v. with mean 0 and variance σ2. Given a dataset of m samples, D = {xi, yi}, i = 1 . . . m, we fit a linear hypothesis parameterized by w: fD(x) = wTx to minimize the sum of squared errors
- i
(yi − fD(xi))2 Given a new test point ˆ x, whose corresponding ˆ y = g(ˆ x) + ˆ , what is the expected test error for ˆ x, Err(ˆ x) = ED,ˆ
[(fD(ˆ
x) − ˆ y)2]?
Decomposing expected test error
E[(f (ˆ x) − ˆ y)2] = E[f (ˆ x)2 + ˆ y 2 − 2f (ˆ x)ˆ y] = E[f (ˆ x)2] + E[ˆ y 2] − 2E[f (ˆ x)]E[ˆ y] = E[(f (ˆ x) − f (ˆ x))2] + f (ˆ x)2 + E[ˆ y 2] − 2E[f (ˆ x)]E[ˆ y] = E[(f (ˆ x) − f (ˆ x))2] + f (ˆ x)2 + E[ˆ y 2] − 2f (ˆ x)g(ˆ x) (1) where we have used the fact that E
- (x − E[x])2
+ (E [x])2 = E [x2]
Decomposing expected test error
Applying the same trick used in Equation (1) to E[ˆ y 2], we get E[(f (ˆ x) − ˆ y)2] = E[(f (ˆ x) − f (ˆ x))2] + f (ˆ x)2 + E[(ˆ y − g(ˆ x))2] + g(ˆ x)2 − 2f (ˆ x)g(ˆ x)
Bias-variance decomposition
E[(f (ˆ x) − ˆ y)2] = E[(f (ˆ x) − f (ˆ x))2] + (f (ˆ x) − g(ˆ x))2 + E[(ˆ y − g(ˆ x))2] E[(g(ˆ x) − ˆ y)2] = Variance(g(ˆ x)) + Bias(g(ˆ x))2 + σ2
Each error term
Bias: f (ˆ x) − g(ˆ x) Average error of f (ˆ x) Variance: E[(f (ˆ x) − f (ˆ x))2] Variance of f (ˆ x) across different training datasets Noise: E[(ˆ y − g(ˆ x))2] E(2) = σ2 Irreducible noise
Illustrating bias and variance
Image from http://scott.fortmann-roe.com/docs/BiasVariance.html
Model Selection
Given the bias-variance tradeoff, how do we choose the best predictor for the problem at hand? How do we set the model’s parameters?
TO BE DISCUSSED IN NEXT LAB SESSION
Measuring bias/variance
Bootstrap sampling: Repeatedly sample observations from a dataset with replacement For each bootstrap dataset Db, let Vb refer to the left-out samples which will be used for validation. Train on Db to estimate fb and test on each sample in Vb
TO BE DISCUSSED IN NEXT LAB SESSION
Measuring bias/variance
Bootstrap sampling: Repeatedly sample observations from a dataset with replacement For each bootstrap dataset Db, let Vb refer to the left-out samples which will be used for validation. Train on Db to estimate fb and test on each sample in Vb Compute bias and variance
TO BE DISCUSSED IN NEXT LAB SESSION
Train-Validation-Test split
Divide the available samples into three sets:
1
Train set: Used to train the learning algorithm
2
Validation/Development set: Used for model selection and tuning hyperparameters
3
Test/Evaluation set: Used for final testing
TO BE DISCUSSED IN NEXT LAB SESSION
Cross-Validation
k-fold Cross-Validation Given: Training set D of m examples, set of parameters Θ learner F, number of folds k Split D into k folds, D1, . . . , Dk For each θ ∈ Θ, do for i = 1 . . . k, do Estimate fi,θ = Fθ(D \ Di) errθ = 1
k
k
i=1 Loss(fi,θ)
Output: θ∗ = arg minθ errθ fθ∗ = F ∗
θ (D)