CS 337: Arti fi cial Intelligence & Machine Learning Instructor: - - PowerPoint PPT Presentation

cs 337 arti fi cial intelligence machine learning
SMART_READER_LITE
LIVE PREVIEW

CS 337: Arti fi cial Intelligence & Machine Learning Instructor: - - PowerPoint PPT Presentation

CS 337: Arti fi cial Intelligence & Machine Learning Instructor: Prof. Ganesh Ramakrishnan Lecture 8: Regularization, Overfitting, Bias and Variance August 2019 Recap: Regularization for Generalizability Recall: Complex models could lead to


slide-1
SLIDE 1

CS 337: Artificial Intelligence & Machine Learning Instructor: Prof. Ganesh Ramakrishnan Lecture 8: Regularization, Overfitting, Bias and Variance August 2019

slide-2
SLIDE 2

Recap: Regularization for Generalizability

Recall: Complex models could lead to overfitting. How to counter? Regularization: The main idea is to modify the error function so that model complexity is also explicitly penalized Lossreg(w) = LossD(w) + λ · Reg(w) A squared penalty on the weights, i.e. Reg(w) = ||w||2 is a popular penalty function and is known as L2 regularization.

slide-3
SLIDE 3
slide-4
SLIDE 4

Recap: MAP objective and regularization

Bayesian view of regularization: Regularization can be achieved using different types of priors on the parameters wMAP = arg min

w

1 2σ2

  • j

(yj − wTxj)2 + λ 2||w||2

2

We get an L2 regularized solution for the linear regression problem using a Gaussian prior on the weights. What happens when ||w||2

2 is replaced with ||w||1 ? Contrast their level

curves!

slide-5
SLIDE 5

Number of zero w's for different lambdas lambda_1e-15 0 lambda_1e-10 0 lambda_1e-08 0 lambda_0.0001 0 lambda_0.001 0 lambda_0.01 0 lambda_1 0 lambda_5 0 lambda_10 0 lambda_20 0

slide-6
SLIDE 6

Contrasting Level Curves

slide-7
SLIDE 7

Recap: Lasso Regularized Least Squares Regression

The general Penalized Regularized L.S Problem: wReg = arg min

w

||Φw − y||2

2 + λΩ(w)

Ω(w) = ||w||1 ⇒ Lasso

Lasso Regression wlasso = arg min

w

||Φw − y||2

2 + λ||w||1

Lasso is the MAP estimate of Linear Regression subject to Laplace Prior on w ∼ Laplace(0, θ) Laplace(wi | µ, b) = 1 2b exp

  • −|wi − µ|

b

slide-8
SLIDE 8

Gaussian Hare vs. Laplacian Tortoise

Gaussian easier to estimate Laplacian yields more sparsity

slide-9
SLIDE 9

Lasso: Iterative Soft Thresholding Algorithm (ISTA)

The LASSO Regularized L.S Problem: wLasso = arg min

w

ELasso(w) = arg min

w

ELS(w) + λ|w|1 where ELS(w) = ||Φw − y||2

2

while relative drop in ELasso(wt) across t = k and t = k + 1 is significant:

LS Iterate: wk+1

LS

= wk

Lasso − η∇ELS(wk Lasso)

Proximal1 Step:

  • wk+1

Lasso

  • i =

      

  • wk+1

LS

  • i − λη

if

  • wk+1

LS

  • i > λη
  • wk+1

LS

  • i + λη

if

  • wk+1

LS

  • i < −λη
  • therwise

1See slide 1 of https://www.cse.iitb.ac.in/~cs709/notes/enotes/

24-23-10-2018-generalized-proximal-projected-gradientdescent-examples-geometry-convergence-accelerated-annotated.pdf

slide-10
SLIDE 10

Note how LASSO yields greater sparsity

NUMBER OF w's that are zeros for different values of lambda lambda_1e-15 0 lambda_1e-10 0 lambda_1e-08 0 lambda_1e-05 8 lambda_0.0001 10 lambda_0.001 12 lambda_0.01 13 lambda_1 15 lambda_5 15 lambda_10 15

slide-11
SLIDE 11

CS 337: Artificial Intelligence & Machine Learning Instructor: Prof. Ganesh Ramakrishnan Lecture: Understanding Generalization and Overfitting through bias & variance August 2019

slide-12
SLIDE 12

Evaluating model performance

We saw in the last class how to estimate linear predictors by minimizing a squared loss objective function. How do we evaluate whether or not our estimated predictor is good? Measure 1: Training error

slide-13
SLIDE 13

Evaluating model performance

We saw in the last class how to estimate linear predictors by minimizing a squared loss objective function. How do we evaluate whether or not our estimated predictor is good? Measure 1: Training error Measure 2: Test error

slide-14
SLIDE 14

Error vs. Model Complexity

Prediction 
 Error Model Complexity

slide-15
SLIDE 15

Sources of error

Three main sources of test error:

1

Bias

2

Variance

3

Noise

slide-16
SLIDE 16

Example: function

slide-17
SLIDE 17

Fitting 50 lines after slight perturbation of points

slide-18
SLIDE 18

Variance after slight perturbation of points

slide-19
SLIDE 19

Bias (with respect to non-linear fit)

slide-20
SLIDE 20

Noise

slide-21
SLIDE 21

Overfitting

Overfitting: When the proposed hypothesis fits the training data too well

slide-22
SLIDE 22

Underfitting

Underfitting: When the hypothesis is insufficient to fit the training data

slide-23
SLIDE 23

Bias/Variance Decomposition for Regression

slide-24
SLIDE 24

Bias-Variance Analysis in Regression

Say the true underlying function is y = g(x) + where is a r.v. with mean 0 and variance σ2. Given a dataset of m samples, D = {xi, yi}, i = 1 . . . m, we fit a linear hypothesis parameterized by w: fD(x) = wTx to minimize the sum of squared errors

  • i

(yi − fD(xi))2 Given a new test point ˆ x, whose corresponding ˆ y = g(ˆ x) + ˆ , what is the expected test error for ˆ x, Err(ˆ x) = ED,ˆ

[(fD(ˆ

x) − ˆ y)2]?

slide-25
SLIDE 25

Decomposing expected test error

E[(f (ˆ x) − ˆ y)2] = E[f (ˆ x)2 + ˆ y 2 − 2f (ˆ x)ˆ y] = E[f (ˆ x)2] + E[ˆ y 2] − 2E[f (ˆ x)]E[ˆ y] = E[(f (ˆ x) − f (ˆ x))2] + f (ˆ x)2 + E[ˆ y 2] − 2E[f (ˆ x)]E[ˆ y] = E[(f (ˆ x) − f (ˆ x))2] + f (ˆ x)2 + E[ˆ y 2] − 2f (ˆ x)g(ˆ x) (1) where we have used the fact that E

  • (x − E[x])2

+ (E [x])2 = E [x2]

slide-26
SLIDE 26

Decomposing expected test error

Applying the same trick used in Equation (1) to E[ˆ y 2], we get E[(f (ˆ x) − ˆ y)2] = E[(f (ˆ x) − f (ˆ x))2] + f (ˆ x)2 + E[(ˆ y − g(ˆ x))2] + g(ˆ x)2 − 2f (ˆ x)g(ˆ x)

slide-27
SLIDE 27

Bias-variance decomposition

E[(f (ˆ x) − ˆ y)2] = E[(f (ˆ x) − f (ˆ x))2] + (f (ˆ x) − g(ˆ x))2 + E[(ˆ y − g(ˆ x))2] E[(g(ˆ x) − ˆ y)2] = Variance(g(ˆ x)) + Bias(g(ˆ x))2 + σ2

slide-28
SLIDE 28

Each error term

Bias: f (ˆ x) − g(ˆ x) Average error of f (ˆ x) Variance: E[(f (ˆ x) − f (ˆ x))2] Variance of f (ˆ x) across different training datasets Noise: E[(ˆ y − g(ˆ x))2] E(2) = σ2 Irreducible noise

slide-29
SLIDE 29

Illustrating bias and variance

Image from http://scott.fortmann-roe.com/docs/BiasVariance.html

slide-30
SLIDE 30

Model Selection

Given the bias-variance tradeoff, how do we choose the best predictor for the problem at hand? How do we set the model’s parameters?

TO BE DISCUSSED IN NEXT LAB SESSION

slide-31
SLIDE 31

Measuring bias/variance

Bootstrap sampling: Repeatedly sample observations from a dataset with replacement For each bootstrap dataset Db, let Vb refer to the left-out samples which will be used for validation. Train on Db to estimate fb and test on each sample in Vb

TO BE DISCUSSED IN NEXT LAB SESSION

slide-32
SLIDE 32

Measuring bias/variance

Bootstrap sampling: Repeatedly sample observations from a dataset with replacement For each bootstrap dataset Db, let Vb refer to the left-out samples which will be used for validation. Train on Db to estimate fb and test on each sample in Vb Compute bias and variance

TO BE DISCUSSED IN NEXT LAB SESSION

slide-33
SLIDE 33

Train-Validation-Test split

Divide the available samples into three sets:

1

Train set: Used to train the learning algorithm

2

Validation/Development set: Used for model selection and tuning hyperparameters

3

Test/Evaluation set: Used for final testing

TO BE DISCUSSED IN NEXT LAB SESSION

slide-34
SLIDE 34

Cross-Validation

k-fold Cross-Validation Given: Training set D of m examples, set of parameters Θ learner F, number of folds k Split D into k folds, D1, . . . , Dk For each θ ∈ Θ, do for i = 1 . . . k, do Estimate fi,θ = Fθ(D \ Di) errθ = 1

k

k

i=1 Loss(fi,θ)

Output: θ∗ = arg minθ errθ fθ∗ = F ∗

θ (D)

TO BE DISCUSSED IN NEXT LAB SESSION