cs 337 arti fi cial intelligence machine learning
play

CS 337: Arti fi cial Intelligence & Machine Learning Instructor: - PowerPoint PPT Presentation

CS 337: Arti fi cial Intelligence & Machine Learning Instructor: Prof. Ganesh Ramakrishnan Lecture 8: Regularization, Overfitting, Bias and Variance August 2019 Recap: Regularization for Generalizability Recall: Complex models could lead to


  1. CS 337: Arti fi cial Intelligence & Machine Learning Instructor: Prof. Ganesh Ramakrishnan Lecture 8: Regularization, Overfitting, Bias and Variance August 2019

  2. Recap: Regularization for Generalizability Recall: Complex models could lead to over fi tting. How to counter? Regularization: The main idea is to modify the error function so that model complexity is also explicitly penalized Loss reg ( w ) = Loss D ( w ) + λ · Reg ( w ) A squared penalty on the weights, i.e. Reg ( w ) = || w || 2 is a popular penalty function and is known as L 2 regularization.

  3. Recap: MAP objective and regularization Bayesian view of regularization: Regularization can be achieved using di ff erent types of priors on the parameters 1 ( y j − w T x j ) 2 + λ � 2 || w || 2 w MAP = arg min 2 2 σ 2 w j We get an L 2 regularized solution for the linear regression problem using a Gaussian prior on the weights. What happens when || w || 2 2 is replaced with || w || 1 ? Contrast their level curves!

  4. Number of zero w's for di ff erent lambdas lambda_1e-15 0 lambda_1e-10 0 lambda_1e-08 0 lambda_0.0001 0 lambda_0.001 0 lambda_0.01 0 lambda_1 0 lambda_5 0 lambda_10 0 lambda_20 0

  5. Contrasting Level Curves

  6. Recap: Lasso Regularized Least Squares Regression The general Penalized Regularized L.S Problem : || Φ w − y || 2 w Reg = arg min 2 + λ Ω ( w ) w Ω ( w ) = || w || 1 ⇒ Lasso Lasso Regression || Φ w − y || 2 w lasso = arg min 2 + λ || w || 1 w Lasso is the MAP estimate of Linear Regression subject to Laplace Prior on w ∼ Laplace (0 , θ ) � � Laplace ( w i | µ, b ) = 1 − | w i − µ | 2 b exp b

  7. Gaussian Hare vs. Laplacian Tortoise Gaussian easier to estimate Laplacian yields more sparsity

  8. Lasso: Iterative Soft Thresholding Algorithm (ISTA) The LASSO Regularized L.S Problem : w Lasso = arg min E Lasso ( w ) = arg min E LS ( w ) + λ | w | 1 w w where E LS ( w ) = || Φ w − y || 2 2 while relative drop in E Lasso ( w t ) across t = k and t = k + 1 is signi fi cant: LS Iterate: w k +1 = w k Lasso − η ∇ E LS ( w k Lasso ) LS Proximal 1 Step: � � � �  w k +1 w k +1 i − λη if i > λη  LS LS  � �  w k +1 � � � � i = w k +1 w k +1 i + λη if i < − λη Lasso LS LS   0 otherwise  1 See slide 1 of https://www.cse.iitb.ac.in/~cs709/notes/enotes/ 24-23-10-2018-generalized-proximal-projected-gradientdescent-examples-geometry-convergence-accelerated-annotated.pdf

  9. Note how LASSO yields greater sparsity NUMBER OF w's that are zeros for di ff erent values of lambda lambda_1e-15 0 lambda_1e-10 0 lambda_1e-08 0 lambda_1e-05 8 lambda_0.0001 10 lambda_0.001 12 lambda_0.01 13 lambda_1 15 lambda_5 15 lambda_10 15

  10. CS 337: Arti fi cial Intelligence & Machine Learning Instructor: Prof. Ganesh Ramakrishnan Lecture: Understanding Generalization and Overfitting through bias & variance August 2019

  11. Evaluating model performance We saw in the last class how to estimate linear predictors by minimizing a squared loss objective function. How do we evaluate whether or not our estimated predictor is good? Measure 1: Training error

  12. Evaluating model performance We saw in the last class how to estimate linear predictors by minimizing a squared loss objective function. How do we evaluate whether or not our estimated predictor is good? Measure 1: Training error Measure 2: Test error

  13. Error vs. Model Complexity Prediction 
 Error Model Complexity

  14. Sources of error Three main sources of test error: Bias 1 Variance 2 Noise 3

  15. Example: function

  16. Fitting 50 lines after slight perturbation of points

  17. Variance after slight perturbation of points

  18. Bias (with respect to non-linear fi t)

  19. Noise

  20. Over fi tting Over fi tting: When the proposed hypothesis fi ts the training data too well ����������

  21. Under fi tting Under fi tting: When the hypothesis is insu ffi cient to fi t the training data �����������

  22. Bias/Variance Decomposition for Regression

  23. Bias-Variance Analysis in Regression Say the true underlying function is y = g ( x ) + � where � is a r.v. with mean 0 and variance σ 2 . Given a dataset of m samples, D = { x i , y i } , i = 1 . . . m , we fi t a linear hypothesis parameterized by w : f D ( x ) = w T x to minimize the sum of � ( y i − f D ( x i )) 2 squared errors i Given a new test point ˆ x , whose corresponding ˆ y = g ( ˆ x ) + ˆ � , what is the y ) 2 ]? expected test error for ˆ x , Err( ˆ x ) = E D , ˆ � [( f D ( ˆ x ) − ˆ

  24. Decomposing expected test error x ) 2 + ˆ y 2 − 2 f ( ˆ y ) 2 ] = E [ f ( ˆ E [( f ( ˆ x ) − ˆ x )ˆ y ] x ) 2 ] + E [ˆ y 2 ] − 2 E [ f ( ˆ = E [ f ( ˆ x )] E [ˆ y ] x )) 2 ] + f ( ˆ x ) 2 = E [( f ( ˆ x ) − f ( ˆ y 2 ] − 2 E [ f ( ˆ + E [ˆ x )] E [ˆ y ] x )) 2 ] + f ( ˆ x ) 2 = E [( f ( ˆ x ) − f ( ˆ y 2 ] − 2 f ( ˆ + E [ˆ x ) g ( ˆ x ) (1) + ( E [ x ]) 2 = E [ x 2 ] � ( x − E [ x ]) 2 � where we have used the fact that E

  25. Decomposing expected test error y 2 ], we get Applying the same trick used in Equation (1) to E [ˆ y ) 2 ] = E [( f ( ˆ x )) 2 ] + f ( ˆ x ) 2 E [( f ( ˆ x ) − ˆ x ) − f ( ˆ x )) 2 ] + g ( ˆ x ) 2 + E [(ˆ y − g ( ˆ − 2 f ( ˆ x ) g ( ˆ x )

  26. Bias-variance decomposition y ) 2 ] = E [( f ( ˆ x )) 2 ] E [( f ( ˆ x ) − ˆ x ) − f ( ˆ x )) 2 + ( f ( ˆ x ) − g ( ˆ x )) 2 ] + E [(ˆ y − g ( ˆ x )) 2 + σ 2 y ) 2 ] = Variance( g ( ˆ E [( g ( ˆ x ) − ˆ x )) + Bias( g ( ˆ

  27. Each error term Bias: f ( ˆ x ) − g ( ˆ x ) Average error of f ( ˆ x ) x )) 2 ] Variance: E [( f ( ˆ x ) − f ( ˆ Variance of f ( ˆ x ) across di ff erent training datasets x )) 2 ] E ( � 2 ) = σ 2 Noise: E [(ˆ y − g ( ˆ Irreducible noise

  28. Illustrating bias and variance Image from http://scott.fortmann-roe.com/docs/BiasVariance.html

  29. Model Selection TO BE DISCUSSED IN NEXT LAB SESSION Given the bias-variance tradeo ff , how do we choose the best predictor for the problem at hand? How do we set the model’s parameters?

  30. Measuring bias/variance TO BE DISCUSSED IN NEXT LAB SESSION Bootstrap sampling: Repeatedly sample observations from a dataset with replacement For each bootstrap dataset D b , let V b refer to the left-out samples which will be used for validation. Train on D b to estimate f b and test on each sample in V b

  31. Measuring bias/variance TO BE DISCUSSED IN NEXT LAB SESSION Bootstrap sampling: Repeatedly sample observations from a dataset with replacement For each bootstrap dataset D b , let V b refer to the left-out samples which will be used for validation. Train on D b to estimate f b and test on each sample in V b Compute bias and variance

  32. Train-Validation-Test split TO BE DISCUSSED IN NEXT LAB SESSION Divide the available samples into three sets: Train set: Used to train the learning algorithm 1 Validation/Development set: Used for model selection and tuning 2 hyperparameters Test/Evaluation set: Used for fi nal testing 3

  33. Cross-Validation TO BE DISCUSSED IN NEXT LAB SESSION k -fold Cross-Validation Given: Training set D of m examples, set of parameters Θ learner F , number of folds k Split D into k folds, D 1 , . . . , D k For each θ ∈ Θ , do for i = 1 . . . k , do Estimate f i , θ = F θ ( D \ D i ) � k err θ = 1 i =1 Loss( f i , θ ) k Output: θ ∗ = arg min θ err θ f θ ∗ = F ∗ θ ( D )

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend