linear regression regularization bias variance tradeoff
play

Linear Regression, Regularization Bias-Variance Tradeoff Thanks to - PowerPoint PPT Presentation

HTF: Ch3, 7 B: Ch3 Linear Regression, Regularization Bias-Variance Tradeoff Thanks to C Guestrin, T Dietterich, R Parr, N Ray Outline Linear Regression MLE = Least Squares! Basis functions Evaluating Predictors


  1. HTF: Ch3, 7 B: Ch3 Linear Regression, Regularization Bias-Variance Tradeoff Thanks to C Guestrin, T Dietterich, R Parr, N Ray �

  2. Outline � Linear Regression � MLE = Least Squares! � Basis functions � Evaluating Predictors � Training set error vs Test set error � Cross Validation � Model Selection � Bias-Variance analysis � Regularization, Bayesian Model �

  3. What is best choice of Polynomial? Noisy Source Data �

  4. Fit using Degree 0,1,3,9 �

  5. Comparison � Degree 9 is the best match to the samples (over-fitting) � Degree 3 is the best match to the source � Performance on test data: �

  6. What went wrong? � A bad choice of polynomial? � Not enough data? � Yes �

  7. Terms � x – input variable � x * – new input variable � h( x ) – “truth” – underlying response function � t = h( x ) + � – actual observed response � y( x ; D) – predicted response, based on model learned from dataset D � � ( x ) = E D [ y( x ; D) ] – expected response, averaged over (models based on) all datasets � ���� ��� ��������� ������ �� � ��� � � � ���������� � ���������������������� � � �

  8. Bias-Variance Analysis in Regression � Observed value is t( x ) = h( x ) + ε � ε ~ N(0, σ 2 ) � normally distributed: mean 0, std deviation σ 2 � Note: h( x ) = E[ t(x) | x ] � Given training examples, D = {( x i , t i )}, let y( . ) = y( . ; D) be predicted function, based on model learned using D � Eg, linear model y w ( x ) = w ⋅ x + w 0 using w =MLE(D) �

  9. Example: 20 points t = x + 2 sin(1.5x) + N (0, 0.2) �

  10. Bias-Variance Analysis � Given a new data point x * � return predicted response: y( x *) � observed response: t* = h( x *) + ε � The expected prediction error is … ���� ��� ��������� ������ �� � ��� � � ��

  11. Expected Loss � [y( x ) – t] 2 = [y( x ) – h( x ) + h( x ) – t] 2 = [y( x ) – h( x )] 2 + 2 [y( x ) – h( x )] [h( x ) – t] + [h( x ) – t] 2 Expected value is 0 as h( x ) = E[t| x ] � Eerr = � [y( x ) – t] 2 p ( x ,t) d x dt = � { y ( x ) − h ( x )} 2 p ( x ) d x + � { h ( x ) − t } 2 p ( x , t ) d x dt Mismatch between OUR hypothesis y(.) & target h(.) Noise in distribution of target … we can influence this … nothing we can do ��

  12. E err = � { y ( x ) − h ( x )} 2 p ( x ) d x + � { h ( x ) − t } 2 p ( x , t ) d x dt Relevant Part of Loss � Really y( x ) = y( x ; D) fit to data D… so consider expectation over data sets D � Let � ( x ) = E D [y( x ; D)] � E D [ {h( x ) – y( x ; D) } 2 ] = E D [h( x )– � (x) + � (x) – y( x ; D) ]} 2 0 = E D [ {h( x ) – � (x)} 2 ] + 2E D [ {h( x ) – � (x)} { y( x ; D) – E D [y( x ; D) }] + E D [{ y( x ; D) – E D [y( x ; D)] } 2 ] = {h( x ) – � ( x )} 2 + E D [ { y( x ; D) – � ( x ) } 2 ] Bias 2 �� Variance

  13. 50 fits (20 examples each) ��

  14. Bias, Variance, Noise ���� �������� � � �!������ ����"�#������$� ����� ��

  15. Understanding Bias { � ( x ) – h( x ) } 2 � Measures how well our approximation architecture can fit the data � Weak approximators � (e.g. low degree polynomials) will have high bias � Strong approximators � (e.g. high degree polynomials) will have lower bias ��

  16. Understanding Variance E D [ { y( x ; D) – � D ( x ) } 2 ] � No direct dependence on target values � For a fixed size D: � Strong approximators tend to have more variance … different datasets will lead to DIFFERENT predictors � Weak approximators tend to have less variance … slightly different datasets may lead to SIMILAR predictors � Variance will typically disappear as |D| →∞ ��

  17. Summary of Bias,Variance,Noise � Eerr = E[ (t*– y( x *)) 2 ] = E[ (y( x *) – � ( x *)) 2 ] + ( � ( x *)– h( x *)) 2 + E[ (t* – h( x *)) 2 ] = Var( h(x*) ) + Bias( h(x*) ) 2 + Noise Expected prediction error = Variance + Bias 2 + Noise ��

  18. Bias, Variance, and Noise � Bias : � ( x *)– h( x *) � the best error of model � (x*) [average over datasets] � Variance : E D [ ( y D ( x *) – � ( x *) ) 2 ] � How much y D (x*) varies from one training set D to another � Noise : E[ (t* – h( x *)) 2 ] = E[ ε 2 ] = σ 2 � How much t* varies from h( x *) = t * + ε � Error, even given PERFECT model, and ∞ data ��

  19. 50 fits (20 examples each) ��

  20. Predictions at x=2.0 ��

  21. 50 fits (20 examples each) ��

  22. Predictions at x=5.0 Variance true value Bias ��

  23. Observed Responses at x=5.0 ������%#�� Noise ��

  24. Model Selection: Bias-Variance C 1 � C 1 “more expressive than” C 2 C 2 iff representable in C 1 � representable in C 2 “C 2 ⊂ C 1 ” � Eg, LinearFns ⊂ QuadraticFns 0-HiddenLayerNNs ⊂ 1-HiddenLayerNNs � can ALWAYs get better fit using C 1 , over C 2 � But … sometimes better to look for y ∊ C 2 ��

  25. Standard Plots… ��

  26. Why? � C 2 ⊂ C 1 � ∀ y ∊ C 2 ∃ x * ∊ C 1 that is at-least-as-good-as y � But given limited sample, might not find this best x * � Approach: consider Bias 2 + Variance!! ��

  27. Bias-Variance tradeoff – Intuition � � Model too “simple” does not fit the data well � A biased solution � Model too complex � small changes to the data, changes predictor a lot � A high-variance solution ��

  28. Bias-Variance Tradeoff � Choice of hypothesis class introduces learning bias � More complex class � less bias � More complex class � more variance ��

  29. 2 2 ~Variance ~Bias 2 ��

  30. � Behavior of test sample and training sample error as function of model complexity � light blue curves show the training error err , � light red curves show the conditional test error Err T for 100 training sets of size 50 each � Solid curves = expected test error Err and expected training error E[err] . ��

  31. Empirical Study… � Based on different regularizers ��

  32. Effect of Algorithm Parameters on Bias and Variance � k-nearest neighbor: � increasing k typically increases bias and reduces variance � decision trees of depth D: � increasing D typically increases variance and reduces bias � RBF SVM with parameter σ : � increasing σ typically increases bias and reduces variance ��

  33. a datapoint Least Squares Estimator x 1 , …, x k � Truth: f(x) = x T β ������������� X = Observed: y = f(x) + ε Ε[ ε ] = 0 � Least squares estimator � (x 0 ) = x 0 T β β = (X T X) -1 X T y &���"�������'�#(�� � Unbiased: f(x 0 ) = E[ � (x 0 ) ] f(x 0 ) – E[ � (x 0 ) ] = x 0T β −Ε[ x 0T (X T X) -1 X T y ] = x 0T β −Ε[ x 0T (X T X) -1 X T (X β + ε) ] = x 0T β −Ε[ x 0T β + x 0T (X T X) -1 X T ε ] = x 0T β − x 0T β + x 0T (X T X) -1 X T Ε[ε ] = 0 ��

  34. Gauss-Markov Theorem � Least squares estimator � (x 0 ) = x 0T (X T X) -1 X T y � … is unbiased: f(x 0 ) = E[ � (x 0 ) ] � … is linear in y … � (x 0 ) = c 0 T y where c 0 T � Gauss-Markov Theorem: Least square estimate has the minimum variance among all linear unbiased estimators. � BLUE: Best Linear Unbiased Estimator � Interpretation: Let g ( x 0 ) be any other … � unbiased estimator of f ( x 0 ) … ie, E[ g(x 0 ) ] = f(x 0 ) � that is linear in y … ie, g(x 0 ) = c T y then Var[ � (x 0 ) ] ≤ Var[ g(x 0 ) ] ��

  35. Variance of Least Squares Estimator y = f(x) + ε Ε[ ε ] = 0 � Least squares estimator var( ε ) = σ ε � (x 0 ) = x 0 T β β = (X T X) -1 X T y 2 � Variance: E[ ( � (x 0 ) – E[ � (x 0 ) ] ) 2 ] = E[ ( � (x 0 ) – f(x 0 ) ) 2 ] T β ) 2 ] T (X T X) -1 X T β − x 0 = E[ ( x 0 T β ) 2 ] T (X T X) -1 X T (X β + ε) − x 0 = Ε[ ( x 0 T β ) 2 ] T β + x 0 T (X T X) -1 X T ε − x 0 = Ε[ ( x 0 T (X T X) -1 X T ε) 2 ] = Ε[ ( x 0 2 p/N = σ ε �� … in “in-sample error” model …

  36. Trading off Bias for Variance � What is the best estimator for the given linear additive model? � Least squares estimator � (x 0 ) = x 0T β β = (X T X) -1 X T y is BLUE: Best Linear Unbiased Estimator � Optimal variance, wrt unbiased estimators � But variance is O( p / N ) … � So if FEWER features, smaller variance… … albeit with some bias?? ��

  37. Feature Selection � LS solution can have large variance � variance ∝ p (#features) � Decrease p � decrease variance… but increase bias � If decreases test error, do it! � Feature selection � Small #features also means: � easy to interpret ��

  38. Statistical Significance Test � Y = β 0 + � j β j X j � Q: Which X j are relevant? A: Use statistical hypothesis testing! � Use simple model: Y = β 0 + � j β j X j + ε 2 ) ε ~ N(0, σ e � Here: β ~ N( β , (X T X) -1 σ e ˆ 2 ) β ˆ β � Use j = z j N 1 � σ ˆ v 2 σ = − ˆ ( y y ˆ ) j i i − − N p 1 = i 1 v j is the j th diagonal element of ( X T X ) -1 • Keep variable X i if z j is large… ��

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend