CS 337: Arti fi cial Intelligence & Machine Learning Instructor: - PowerPoint PPT Presentation

CS 337: Arti fi cial Intelligence & Machine Learning Instructor: Prof. Ganesh Ramakrishnan Lecture 8: Regularization, Overfitting, Bias and Variance August 2019

Recap: Regularization for Generalizability Recall: Complex models could lead to over fi tting. How to counter? Regularization: The main idea is to modify the error function so that model complexity is also explicitly penalized Loss reg ( w ) = Loss D ( w ) + λ · Reg ( w ) A squared penalty on the weights, i.e. Reg ( w ) = || w || 2 is a popular penalty function and is known as L 2 regularization.

Recap: MAP objective and regularization Bayesian view of regularization: Regularization can be achieved using di ff erent types of priors on the parameters 1 ( y j − w T x j ) 2 + λ � 2 || w || 2 w MAP = arg min 2 2 σ 2 w j We get an L 2 regularized solution for the linear regression problem using a Gaussian prior on the weights. What happens when || w || 2 2 is replaced with || w || 1 ? Contrast their level curves!

Number of zero w's for di ff erent lambdas lambda_1e-15 0 lambda_1e-10 0 lambda_1e-08 0 lambda_0.0001 0 lambda_0.001 0 lambda_0.01 0 lambda_1 0 lambda_5 0 lambda_10 0 lambda_20 0

Contrasting Level Curves

Recap: Lasso Regularized Least Squares Regression The general Penalized Regularized L.S Problem : || Φ w − y || 2 w Reg = arg min 2 + λ Ω ( w ) w Ω ( w ) = || w || 1 ⇒ Lasso Lasso Regression || Φ w − y || 2 w lasso = arg min 2 + λ || w || 1 w Lasso is the MAP estimate of Linear Regression subject to Laplace Prior on w ∼ Laplace (0 , θ ) � � Laplace ( w i | µ, b ) = 1 − | w i − µ | 2 b exp b

Gaussian Hare vs. Laplacian Tortoise Gaussian easier to estimate Laplacian yields more sparsity

Lasso: Iterative Soft Thresholding Algorithm (ISTA) The LASSO Regularized L.S Problem : w Lasso = arg min E Lasso ( w ) = arg min E LS ( w ) + λ | w | 1 w w where E LS ( w ) = || Φ w − y || 2 2 while relative drop in E Lasso ( w t ) across t = k and t = k + 1 is signi fi cant: LS Iterate: w k +1 = w k Lasso − η ∇ E LS ( w k Lasso ) LS Proximal 1 Step: � � � �  w k +1 w k +1 i − λη if i > λη  LS LS  � �  w k +1 � � � � i = w k +1 w k +1 i + λη if i < − λη Lasso LS LS   0 otherwise  1 See slide 1 of https://www.cse.iitb.ac.in/~cs709/notes/enotes/ 24-23-10-2018-generalized-proximal-projected-gradientdescent-examples-geometry-convergence-accelerated-annotated.pdf

Note how LASSO yields greater sparsity NUMBER OF w's that are zeros for di ff erent values of lambda lambda_1e-15 0 lambda_1e-10 0 lambda_1e-08 0 lambda_1e-05 8 lambda_0.0001 10 lambda_0.001 12 lambda_0.01 13 lambda_1 15 lambda_5 15 lambda_10 15

CS 337: Arti fi cial Intelligence & Machine Learning Instructor: Prof. Ganesh Ramakrishnan Lecture: Understanding Generalization and Overfitting through bias & variance August 2019

Evaluating model performance We saw in the last class how to estimate linear predictors by minimizing a squared loss objective function. How do we evaluate whether or not our estimated predictor is good? Measure 1: Training error

Evaluating model performance We saw in the last class how to estimate linear predictors by minimizing a squared loss objective function. How do we evaluate whether or not our estimated predictor is good? Measure 1: Training error Measure 2: Test error

Error vs. Model Complexity Prediction   Error Model Complexity

Sources of error Three main sources of test error: Bias 1 Variance 2 Noise 3

Example: function

Fitting 50 lines after slight perturbation of points

Variance after slight perturbation of points

Bias (with respect to non-linear fi t)

Over fi tting Over fi tting: When the proposed hypothesis fi ts the training data too well ��

Under fi tting Under fi tting: When the hypothesis is insu ffi cient to fi t the training data ��

Bias/Variance Decomposition for Regression

Bias-Variance Analysis in Regression Say the true underlying function is y = g ( x ) + � where � is a r.v. with mean 0 and variance σ 2 . Given a dataset of m samples, D = { x i , y i } , i = 1 . . . m , we fi t a linear hypothesis parameterized by w : f D ( x ) = w T x to minimize the sum of � ( y i − f D ( x i )) 2 squared errors i Given a new test point ˆ x , whose corresponding ˆ y = g ( ˆ x ) + ˆ � , what is the y ) 2 ]? expected test error for ˆ x , Err( ˆ x ) = E D , ˆ � [( f D ( ˆ x ) − ˆ

Decomposing expected test error x ) 2 + ˆ y 2 − 2 f ( ˆ y ) 2 ] = E [ f ( ˆ E [( f ( ˆ x ) − ˆ x )ˆ y ] x ) 2 ] + E [ˆ y 2 ] − 2 E [ f ( ˆ = E [ f ( ˆ x )] E [ˆ y ] x )) 2 ] + f ( ˆ x ) 2 = E [( f ( ˆ x ) − f ( ˆ y 2 ] − 2 E [ f ( ˆ + E [ˆ x )] E [ˆ y ] x )) 2 ] + f ( ˆ x ) 2 = E [( f ( ˆ x ) − f ( ˆ y 2 ] − 2 f ( ˆ + E [ˆ x ) g ( ˆ x ) (1) + ( E [ x ]) 2 = E [ x 2 ] � ( x − E [ x ]) 2 � where we have used the fact that E

Decomposing expected test error y 2 ], we get Applying the same trick used in Equation (1) to E [ˆ y ) 2 ] = E [( f ( ˆ x )) 2 ] + f ( ˆ x ) 2 E [( f ( ˆ x ) − ˆ x ) − f ( ˆ x )) 2 ] + g ( ˆ x ) 2 + E [(ˆ y − g ( ˆ − 2 f ( ˆ x ) g ( ˆ x )

Bias-variance decomposition y ) 2 ] = E [( f ( ˆ x )) 2 ] E [( f ( ˆ x ) − ˆ x ) − f ( ˆ x )) 2 + ( f ( ˆ x ) − g ( ˆ x )) 2 ] + E [(ˆ y − g ( ˆ x )) 2 + σ 2 y ) 2 ] = Variance( g ( ˆ E [( g ( ˆ x ) − ˆ x )) + Bias( g ( ˆ

Each error term Bias: f ( ˆ x ) − g ( ˆ x ) Average error of f ( ˆ x ) x )) 2 ] Variance: E [( f ( ˆ x ) − f ( ˆ Variance of f ( ˆ x ) across di ff erent training datasets x )) 2 ] E ( � 2 ) = σ 2 Noise: E [(ˆ y − g ( ˆ Irreducible noise

Illustrating bias and variance Image from http://scott.fortmann-roe.com/docs/BiasVariance.html

Model Selection TO BE DISCUSSED IN NEXT LAB SESSION Given the bias-variance tradeo ff , how do we choose the best predictor for the problem at hand? How do we set the model’s parameters?

Measuring bias/variance TO BE DISCUSSED IN NEXT LAB SESSION Bootstrap sampling: Repeatedly sample observations from a dataset with replacement For each bootstrap dataset D b , let V b refer to the left-out samples which will be used for validation. Train on D b to estimate f b and test on each sample in V b

Measuring bias/variance TO BE DISCUSSED IN NEXT LAB SESSION Bootstrap sampling: Repeatedly sample observations from a dataset with replacement For each bootstrap dataset D b , let V b refer to the left-out samples which will be used for validation. Train on D b to estimate f b and test on each sample in V b Compute bias and variance

Train-Validation-Test split TO BE DISCUSSED IN NEXT LAB SESSION Divide the available samples into three sets: Train set: Used to train the learning algorithm 1 Validation/Development set: Used for model selection and tuning 2 hyperparameters Test/Evaluation set: Used for fi nal testing 3

Cross-Validation TO BE DISCUSSED IN NEXT LAB SESSION k -fold Cross-Validation Given: Training set D of m examples, set of parameters Θ learner F , number of folds k Split D into k folds, D 1 , . . . , D k For each θ ∈ Θ , do for i = 1 . . . k , do Estimate f i , θ = F θ ( D \ D i ) � k err θ = 1 i =1 Loss( f i , θ ) k Output: θ ∗ = arg min θ err θ f θ ∗ = F ∗ θ ( D )

CS 337: Arti fi cial Intelligence & Machine Learning Instructor: - PowerPoint PPT Presentation

CS 337: Arti fi cial Intelligence & Machine Learning Instructor: Prof. Ganesh Ramakrishnan Lecture 8: Regularization, Overfitting, Bias and Variance August 2019 Recap: Regularization for Generalizability Recall: Complex models could lead to

OPPORTUNITIES FOR Financial Literacy Education in Millennials with Arti fi cial

Explain it like Im 5 AI, ML, NLP, and Deep Learning Kathryn Hume, Sales & Marketing

Digital Collections Customer Days 2017 Arti fj cial Intelligence, Semantic Data & Distributed

FY 2012-2013 FY 2012 2013 R Recommended Budget d d B d $337 702 237 $337,702,237 (1 02%

CSc 337 LECTURE 16: WRITING YOUR OWN WEB SERVICE Basic web service // CSC 337 hello world server

BANKO BIKANO COMMUNITY LED SANITATION CAMPAIGN IN BIKANER Under Nirmal Bharat Abhiyan Arti Dogra

1 Monday Tuesday Wed. Thurs. Fri. Speci cial Education 1/2 1/2 Direct ctor Speci cial

MIGRA GRATION AND ND SO SOCI CIAL AL SECURI SECURITY MIGRA GRATION AND ND SO SOCI CIAL

www.FLgov.com/FBCB Spe peci cial al Th Than anks ks To To:

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

An Introduction to National Intelligence Unclassified National Intelligence Intelligence:

Outline Introduction to CMOS VLSI Design Partitioning Design MIPS Processor Example

Synthesizable Verilog Code Examples Last update August 2000 D Type Flip Flops: Two things to note

Verilog 2 - Design Examples 6.375 Complex Digital Systems Arvind February 9, 2009 Courtesy of

Property testing and hypergraph regularity lemmas Mathias Schacht Institut f ur Informatik

Towards an efficient representation for epistemic planning Supervised by Alexandre Niveau

SMD137 SyncSim Introduction Lab Assistance: dtlabs@sm.luth.se ASK QUESTIONS! 2 3 4

Quick Warm-Up Suppose we have a biased coin that comes up heads with some unknown probability p

Monte Carlo Methods Lecture notes for MAP001169 Based on Script by Martin Sk old adopted by

CS 337: Arti fi cial Intelligence & Machine Learning Instructor: - PowerPoint PPT Presentation

CS 337: Arti fi cial Intelligence & Machine Learning Instructor: Prof. Ganesh Ramakrishnan Lecture 8: Regularization, Overfitting, Bias and Variance August 2019 Recap: Regularization for Generalizability Recall: Complex models could lead to

OPPORTUNITIES FOR Financial Literacy Education in Millennials with Arti fi cial

Explain it like Im 5 AI, ML, NLP, and Deep Learning Kathryn Hume, Sales &amp; Marketing

Digital Collections Customer Days 2017 Arti fj cial Intelligence, Semantic Data &amp; Distributed

FY 2012-2013 FY 2012 2013 R Recommended Budget d d B d $337 702 237 $337,702,237 (1 02%

CSc 337 LECTURE 16: WRITING YOUR OWN WEB SERVICE Basic web service // CSC 337 hello world server

BANKO BIKANO COMMUNITY LED SANITATION CAMPAIGN IN BIKANER Under Nirmal Bharat Abhiyan Arti Dogra

1 Monday Tuesday Wed. Thurs. Fri. Speci cial Education 1/2 1/2 Direct ctor Speci cial

MIGRA GRATION AND ND SO SOCI CIAL AL SECURI SECURITY MIGRA GRATION AND ND SO SOCI CIAL

www.FLgov.com/FBCB Spe peci cial al Th Than anks ks To To:

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

An Introduction to National Intelligence Unclassified National Intelligence Intelligence:

Outline Introduction to CMOS VLSI Design Partitioning Design MIPS Processor Example

Synthesizable Verilog Code Examples Last update August 2000 D Type Flip Flops: Two things to note

Verilog 2 - Design Examples 6.375 Complex Digital Systems Arvind February 9, 2009 Courtesy of

Property testing and hypergraph regularity lemmas Mathias Schacht Institut f ur Informatik

Towards an efficient representation for epistemic planning Supervised by Alexandre Niveau

SMD137 SyncSim Introduction Lab Assistance: dtlabs@sm.luth.se ASK QUESTIONS! 2 3 4

Quick Warm-Up Suppose we have a biased coin that comes up heads with some unknown probability p

Monte Carlo Methods Lecture notes for MAP001169 Based on Script by Martin Sk old adopted by

Explain it like Im 5 AI, ML, NLP, and Deep Learning Kathryn Hume, Sales & Marketing

Digital Collections Customer Days 2017 Arti fj cial Intelligence, Semantic Data & Distributed