Linear Regression, Regularization Bias-Variance Tradeoff Thanks to - PowerPoint PPT Presentation

HTF: Ch3, 7 B: Ch3 Linear Regression, Regularization Bias-Variance Tradeoff Thanks to C Guestrin, T Dietterich, R Parr, N Ray �

Outline � Linear Regression � MLE = Least Squares! � Basis functions � Evaluating Predictors � Training set error vs Test set error � Cross Validation � Model Selection � Bias-Variance analysis � Regularization, Bayesian Model �

What is best choice of Polynomial? Noisy Source Data �

Fit using Degree 0,1,3,9 �

Comparison � Degree 9 is the best match to the samples (over-fitting) � Degree 3 is the best match to the source � Performance on test data: �

What went wrong? � A bad choice of polynomial? � Not enough data? � Yes �

Terms � x – input variable � x * – new input variable � h( x ) – “truth” – underlying response function � t = h( x ) + � – actual observed response � y( x ; D) – predicted response, based on model learned from dataset D � � ( x ) = E D [ y( x ; D) ] – expected response, averaged over (models based on) all datasets � ��

Bias-Variance Analysis in Regression � Observed value is t( x ) = h( x ) + ε � ε ~ N(0, σ 2 ) � normally distributed: mean 0, std deviation σ 2 � Note: h( x ) = E[ t(x) | x ] � Given training examples, D = {( x i , t i )}, let y( . ) = y( . ; D) be predicted function, based on model learned using D � Eg, linear model y w ( x ) = w ⋅ x + w 0 using w =MLE(D) �

Example: 20 points t = x + 2 sin(1.5x) + N (0, 0.2) �

Bias-Variance Analysis � Given a new data point x * � return predicted response: y( x *) � observed response: t* = h( x *) + ε � The expected prediction error is … ��

Expected Loss � [y( x ) – t] 2 = [y( x ) – h( x ) + h( x ) – t] 2 = [y( x ) – h( x )] 2 + 2 [y( x ) – h( x )] [h( x ) – t] + [h( x ) – t] 2 Expected value is 0 as h( x ) = E[t| x ] � Eerr = � [y( x ) – t] 2 p ( x ,t) d x dt = � { y ( x ) − h ( x )} 2 p ( x ) d x + � { h ( x ) − t } 2 p ( x , t ) d x dt Mismatch between OUR hypothesis y(.) & target h(.) Noise in distribution of target … we can influence this … nothing we can do ��

E err = � { y ( x ) − h ( x )} 2 p ( x ) d x + � { h ( x ) − t } 2 p ( x , t ) d x dt Relevant Part of Loss � Really y( x ) = y( x ; D) fit to data D… so consider expectation over data sets D � Let � ( x ) = E D [y( x ; D)] � E D [ {h( x ) – y( x ; D) } 2 ] = E D [h( x )– � (x) + � (x) – y( x ; D) ]} 2 0 = E D [ {h( x ) – � (x)} 2 ] + 2E D [ {h( x ) – � (x)} { y( x ; D) – E D [y( x ; D) }] + E D [{ y( x ; D) – E D [y( x ; D)] } 2 ] = {h( x ) – � ( x )} 2 + E D [ { y( x ; D) – � ( x ) } 2 ] Bias 2 �� Variance

50 fits (20 examples each) ��

Bias, Variance, Noise �� !�� "�#��$� ��

Understanding Bias { � ( x ) – h( x ) } 2 � Measures how well our approximation architecture can fit the data � Weak approximators � (e.g. low degree polynomials) will have high bias � Strong approximators � (e.g. high degree polynomials) will have lower bias ��

Understanding Variance E D [ { y( x ; D) – � D ( x ) } 2 ] � No direct dependence on target values � For a fixed size D: � Strong approximators tend to have more variance … different datasets will lead to DIFFERENT predictors � Weak approximators tend to have less variance … slightly different datasets may lead to SIMILAR predictors � Variance will typically disappear as |D| →∞ ��

Summary of Bias,Variance,Noise � Eerr = E[ (t*– y( x *)) 2 ] = E[ (y( x *) – � ( x *)) 2 ] + ( � ( x *)– h( x *)) 2 + E[ (t* – h( x *)) 2 ] = Var( h(x*) ) + Bias( h(x*) ) 2 + Noise Expected prediction error = Variance + Bias 2 + Noise ��

Bias, Variance, and Noise � Bias : � ( x *)– h( x *) � the best error of model � (x*) [average over datasets] � Variance : E D [ ( y D ( x *) – � ( x *) ) 2 ] � How much y D (x*) varies from one training set D to another � Noise : E[ (t* – h( x *)) 2 ] = E[ ε 2 ] = σ 2 � How much t* varies from h( x *) = t * + ε � Error, even given PERFECT model, and ∞ data ��

Predictions at x=2.0 ��

Predictions at x=5.0 Variance true value Bias ��

Observed Responses at x=5.0 ��%#�� Noise ��

Model Selection: Bias-Variance C 1 � C 1 “more expressive than” C 2 C 2 iff representable in C 1 � representable in C 2 “C 2 ⊂ C 1 ” � Eg, LinearFns ⊂ QuadraticFns 0-HiddenLayerNNs ⊂ 1-HiddenLayerNNs � can ALWAYs get better fit using C 1 , over C 2 � But … sometimes better to look for y ∊ C 2 ��

Standard Plots… ��

Why? � C 2 ⊂ C 1 � ∀ y ∊ C 2 ∃ x * ∊ C 1 that is at-least-as-good-as y � But given limited sample, might not find this best x * � Approach: consider Bias 2 + Variance!! ��

Bias-Variance tradeoff – Intuition � � Model too “simple” does not fit the data well � A biased solution � Model too complex � small changes to the data, changes predictor a lot � A high-variance solution ��

Bias-Variance Tradeoff � Choice of hypothesis class introduces learning bias � More complex class � less bias � More complex class � more variance ��

2 2 ~Variance ~Bias 2 ��

� Behavior of test sample and training sample error as function of model complexity � light blue curves show the training error err , � light red curves show the conditional test error Err T for 100 training sets of size 50 each � Solid curves = expected test error Err and expected training error E[err] . ��

Empirical Study… � Based on different regularizers ��

Effect of Algorithm Parameters on Bias and Variance � k-nearest neighbor: � increasing k typically increases bias and reduces variance � decision trees of depth D: � increasing D typically increases variance and reduces bias � RBF SVM with parameter σ : � increasing σ typically increases bias and reduces variance ��

a datapoint Least Squares Estimator x 1 , …, x k � Truth: f(x) = x T β �� X = Observed: y = f(x) + ε Ε[ ε ] = 0 � Least squares estimator � (x 0 ) = x 0 T β β = (X T X) -1 X T y &��"��'�#(�� Unbiased: f(x 0 ) = E[ � (x 0 ) ] f(x 0 ) – E[ � (x 0 ) ] = x 0T β −Ε[ x 0T (X T X) -1 X T y ] = x 0T β −Ε[ x 0T (X T X) -1 X T (X β + ε) ] = x 0T β −Ε[ x 0T β + x 0T (X T X) -1 X T ε ] = x 0T β − x 0T β + x 0T (X T X) -1 X T Ε[ε ] = 0 ��

Gauss-Markov Theorem � Least squares estimator � (x 0 ) = x 0T (X T X) -1 X T y � … is unbiased: f(x 0 ) = E[ � (x 0 ) ] � … is linear in y … � (x 0 ) = c 0 T y where c 0 T � Gauss-Markov Theorem: Least square estimate has the minimum variance among all linear unbiased estimators. � BLUE: Best Linear Unbiased Estimator � Interpretation: Let g ( x 0 ) be any other … � unbiased estimator of f ( x 0 ) … ie, E[ g(x 0 ) ] = f(x 0 ) � that is linear in y … ie, g(x 0 ) = c T y then Var[ � (x 0 ) ] ≤ Var[ g(x 0 ) ] ��

Variance of Least Squares Estimator y = f(x) + ε Ε[ ε ] = 0 � Least squares estimator var( ε ) = σ ε � (x 0 ) = x 0 T β β = (X T X) -1 X T y 2 � Variance: E[ ( � (x 0 ) – E[ � (x 0 ) ] ) 2 ] = E[ ( � (x 0 ) – f(x 0 ) ) 2 ] T β ) 2 ] T (X T X) -1 X T β − x 0 = E[ ( x 0 T β ) 2 ] T (X T X) -1 X T (X β + ε) − x 0 = Ε[ ( x 0 T β ) 2 ] T β + x 0 T (X T X) -1 X T ε − x 0 = Ε[ ( x 0 T (X T X) -1 X T ε) 2 ] = Ε[ ( x 0 2 p/N = σ ε �� … in “in-sample error” model …

Trading off Bias for Variance � What is the best estimator for the given linear additive model? � Least squares estimator � (x 0 ) = x 0T β β = (X T X) -1 X T y is BLUE: Best Linear Unbiased Estimator � Optimal variance, wrt unbiased estimators � But variance is O( p / N ) … � So if FEWER features, smaller variance… … albeit with some bias?? ��

Feature Selection � LS solution can have large variance � variance ∝ p (#features) � Decrease p � decrease variance… but increase bias � If decreases test error, do it! � Feature selection � Small #features also means: � easy to interpret ��

Statistical Significance Test � Y = β 0 + � j β j X j � Q: Which X j are relevant? A: Use statistical hypothesis testing! � Use simple model: Y = β 0 + � j β j X j + ε 2 ) ε ~ N(0, σ e � Here: β ~ N( β , (X T X) -1 σ e ˆ 2 ) β ˆ β � Use j = z j N 1 � σ ˆ v 2 σ = − ˆ ( y y ˆ ) j i i − − N p 1 = i 1 v j is the j th diagonal element of ( X T X ) -1 • Keep variable X i if z j is large… ��

Linear Regression, Regularization Bias-Variance Tradeoff Thanks to - PowerPoint PPT Presentation

HTF: Ch3, 7 B: Ch3 Linear Regression, Regularization Bias-Variance Tradeoff Thanks to C Guestrin, T Dietterich, R Parr, N Ray Outline Linear Regression MLE = Least Squares! Basis functions Evaluating Predictors

Review Selection bias, overfitting Bias v. variance v. residual Bias-variance tradeoff

CS7015 (Deep Learning) : Lecture 8 Regularization: Bias Variance Tradeoff, l2 regularization,

Bias-Variance Tradeoff Machine Learning 1 Bias and variance Every learning algorithm requires

Linear Regression and the Bias Variance Tradeoff Guest Lecturer Joseph E. Gonzalez slides

Bias- -Variance Theory Variance Theory Bias Decompose Error Rate into components, some

Bias, Variance and Error Bias and Variance given algorithm that outputs estimate for , we

10701 Machine Learning Boosting Fighting the bias-variance tradeoff Simple (a.k.a. weak)

BIAS What Is Bias? Bias can be defined as favoring one side, position, or belief being

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

Linear Models for Regression Greg Mori - CMPT 419/726 Bishop PRML Ch. 3 Regression Linear Basis

Bias-Variance Tradeoff Matthieu R. Bloch h in a given set H that minimizes the true risk R ( h ) .

Variance Will Perkins January 22, 2013 Variance Definition The variance of a random variable X

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Performance Estimation and Regularization Kasthuri Kannan, PhD. Machine

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Linear regression Linear regression is a simple approach to supervised learning. It assumes

ECON2228 Notes 3 Christopher F Baum Boston College Economics 20142015 cfb (BC Econ)

Estimation theory Parametric estimation Properties of estimators Minimum variance

Evaluating Estimators Statistical evaluation ways of choosing with- out access to test data

Normal Form Games Game Theory MohammadAmin Fazli Social and Economic Networks 1 TOC Self

Introduction to Statistical Learning Jean-Philippe Vert Jean-Philippe.Vert@ensmp.fr Mines

Using graphs and Laplacian eigenvalues to evaluate block designs R. A. Bailey University of St

Introduction to General and Generalized Linear Models Mixed effects models - Part II Henrik

Intr oduc tion to E c onome tr ic s Chapte r 2 E ze quie l Ur ie l Jim ne z Unive r