Regression and the Bias-Variance Decomposition William Cohen - PowerPoint PPT Presentation

Regression and the Bias-Variance Decomposition William Cohen 10-601 April 2008 Readings: Bishop 3.1,3.2 1

Regression • Technically: learning a function f( x )=y where y is real-valued , rather than discrete . – Replace livesInSquirrelHill(x1,x2,…,xn) with averageCommuteDistanceInMiles(x1,x2,…,xn) – Replace userLikesMovie( u , m ) with usersRatingForMovie( u , m ) – … 2

Example: univariate linear regression • Example: predict age from number of publications 50 45 40 35 30 Age in Years 25 20 15 10 5 0 0 20 40 60 80 100 120 140 160 Number of Publications 3

Linear regression • Model: yi = axi + b + εi where εi ~ N(0,σ) • Training Data: ( x1,y1),….(xn,yn) ^ ^ • Goal: estimate a,b with w= ( a,b )  w arg max Pr( w | D )  arg max Pr( D | w ) Pr( w )  arg max log Pr( D | w ) assume MLE   arg max log Pr( y i x | , w ) i i ˆ      ˆ ˆ ( w ) y ( a x b )   ˆ 2 arg min [ ( w )] i i i i i 4

Linear regression • Model: yi = axi + b + εi where εi ~ N(0,σ) • Training Data: ( x1,y1),….(xn,yn)  ^ ^ 2   ˆ w arg min • Goal: estimate a,b with w= ( a,b ) i i • Ways to estimate parameters – Find derivative wrt parameters a,b – Set to zero and solve • Or use gradient ascent to solve • Or …. 5

Linear regression y2 d3 How to estimate the slope? d2 y y1  y d1  slope  x       y y y y    1 2 ...       x x x x 1 2 x x1 x2 1     y y    i n   x x y y n*cov(X,Y)   i  i i 1      2  x x  x x n*var(X ) i i i n i 6

Linear regression y2 d3 How to estimate the intercept? d2 y y1 ˆ ˆ   y a x b d1 ˆ   ˆ b y a x x x1 x2 7

Bias/Variance Decomposition of Error 8

Bias – Variance decomposition of error • Return to the simple regression problem f:X  Y y = f(x) +  noise N(0,  ) deterministic What is the expected error for a learned h ? 9

Bias – Variance decomposition of error   2       E [ f ( x ) h ( x ) Pr( ) Pr( x ) d dx ] D D learned from D dataset true fct Experiment (the error of which I’d like to predict): 1. Draw size n sample D = ( x1,y1),….(xn,yn) 2 . Train linear function hD using D 3. Draw a test example (x,f(x)+ε) 4. Measure squared error of hD on that example 10

Bias – Variance decomposition of error (2)     2    E f ( x ) h ( x )  D , D learned from D dataset true fct Fix x, then do this experiment: 1. Draw size n sample D = ( x1,y1),….(xn,yn) 2 . Train linear function hD using D 3. Draw the test example (x,f(x)+ε) 4. Measure squared error of hD on that example 11

Bias – Variance decomposition of error t     2    E f ( x ) h ( x )  D , D ^ f ^ really yD   y  ˆ 2 E ( t y )     2     why not? ˆ E [ t f ] [ f y ]          2 2 ˆ ˆ E [ t f ] [ f y ] 2 [ t f ][ f y ]           2 2 2 ˆ ˆ ˆ E [ t f ] [ f y ] 2 [ tf t y f f y ] 12

Bias – Variance decomposition of error    2 ˆ E D ( t y )  ,     2     ˆ E [ t f ] [ f y ]          2 2 ˆ ˆ E [ t f ] [ f y ] 2 [ t f ][ f y ]           2 2 2 ˆ ˆ ˆ E [ t f ] [ f y ] 2 [ tf t y f f y ]           2 2 2 ˆ ˆ ˆ E [ ] E [( f y ) ] 2 E [ tf ] E [ t y ] E [ f ] E [ f y ]     ˆ ( f ) f ( f ) y Depends on how well learner Intrinsic approximates f noise 13

Bias – Variance decomposition of error    2 ˆ E ( f y )    h E { h ( x )}       2 ˆ E [ f h ] [ h y ] D D  D  ˆ ˆ y y h ( x )   D        2 2 ˆ ˆ E [ f h ] [ h y ] 2 [ f h ][ h y ]           2 2 2 ˆ ˆ ˆ E [ f h ] [ h y ] 2 [ fh f y h h y ]           2 ˆ 2 ˆ 2 ˆ E [( f h ) ] E [( h y ) ] 2 E [ fh ] E [ f y ] E [ h ] E [ h y ] VARIANCE Squared difference between best possible Squared difference btwn our long- prediction for x, f(x), and term expectation for the learners our “ long-term ” performance, ED[hD(x)], and what expectation for what the we expect in a representative run learner will do if we on a dataset D (hat y) averaged over many BIAS2 datasets D, ED[hD(x)] 14

Bias-variance decomposition Make the long-term average better approximate the true function f(x ) Make the learner less sensitive to variations in the data How can you reduce bias of a learner? How can you reduce variance of a learner? 15

A generalization of bias-variance decomposition to other loss functions • “Arbitrary” real-valued loss L(t,y) Claim: But L(y,y’)=L(y’,y), L(y,y)=0, and L(y,y’)!=0 if y!=y’ ED,t[L(t,y) ] = c1N(x) • Define “optimal prediction”: +B(x)+c2V(x) y* = argmin y’ L(t,y’) where • Define “main prediction of learner” ym=ym,D = argmin y’ ED{L(t,y’)} c1=PrD [ y=y* ] - 1 m=n=|D | • Define “bias of learner”: c2=1 if ym=y*, -1 else B(x)=L(y*,ym) • Define “variance of learner” V(x)=ED[L(ym,y)] • Define “noise for x”: N(x) = Et[L(t,y*)] 16

Other regression methods 17

Example: univariate linear regression • Example: predict age from number of publications Paul Erdős 50 45 40 1   ˆ y x 26 35 7 30 Age in Years Hungarian 25 mathematician, 1913-1996 20 x ~ 1500 15 age about 240 10 5 0 0 20 40 60 80 100 120 140 160 Number of Publications 18

Linear regression Summary: y2 d3 d2    y   x x y y   ˆ a i i   2 y1  x x i i d1 ˆ   ˆ b y a x To simplify: x x1 x2 • assume zero-centered data, as we   T T 1 ˆ a x y ( x x ) did for PCA • let x= ( x1,…,xn) and y = (y1,…,yn) ˆ  b 0 • then… 19

Onward: multivariate linear regression Multivariate col is feature Univariate     y 1 k x ,...., x  x x ,...., x 1 1 1     1 n  y ...  X ...      y y ,...., y     1 n y 1 k x ,...., x     n n n   T T 1 ˆ w x y ( x x )    1 1 k k ˆ ˆ ˆ y w x ... w x row is example   T T 1 ˆ w X y ( X X )    2 ˆ w arg min [ ( w )] i    ˆ T ( w ) y w x i i i 20

regularized Onward: multivariate linear regression ^     y 1 k x ,...., x  1   2 ˆ w arg min [ ( w )] 1 1     i  y ...  X ...        ˆ T ( w ) y w x i i i     y 1 k x ,...., x       n    ˆ 2 T n n w arg min [ ( w )] w w i 2    1 1 k k ˆ ˆ ˆ y w x ... w x      T T 1  ˆ T T 1 ˆ w X y ( I X X ) w X y ( X X ) 21

Onward: multivariate linear regression Multivariate, multiple outputs     1 1   y y ,...., y 1 m x ,...., x 1 1 1 1 1          y ... Y ... X ...             1 k y y ,...., y 1 m x ,...., x       n n n n n    1 1 k k  ˆ ˆ ˆ ˆ y w x ... w x y W x    T T 1  ˆ T T 1 w X y ( X X ) W X Y ( X X ) 22

regularized Onward: multivariate linear regression ^     y 1 k x ,...., x  1   2 ˆ w arg min [ ( w )] 1 1     i  y ...  X ...        ˆ T ( w ) y w x i i i     y 1 k x ,...., x       n    ˆ 2 T n n w arg min [ ( w )] w w i 2 What does increasing λ do?    1 1 k k ˆ ˆ ˆ y w x ... w x      T T 1  ˆ T T 1 ˆ w X y ( I X X ) w X y ( X X ) 23

regularized Onward: multivariate linear regression ^     y 1 , x  1   2 ˆ w arg min [ ( w )] 1     i  y ...  X ...        ˆ T ( w ) y w x i i i     y 1 , x       n    ˆ 2 T n w arg min [ ( w )] w w i 2 w= (w1,w2) What does fixing w2=0    1 1 k k ˆ ˆ ˆ y w x ... w x do (if λ=0)?      T T 1  ˆ T T 1 ˆ w X y ( I X X ) w X y ( X X ) 24

Regression and the Bias-Variance Decomposition William Cohen - PowerPoint PPT Presentation

Regression and the Bias-Variance Decomposition William Cohen 10-601 April 2008 Readings: Bishop 3.1,3.2 1 Regression Technically: learning a function f( x )=y where y is real-valued , rather than discrete . Replace

Bias, Variance and Error Bias and Variance given algorithm that outputs estimate for , we

Bias- -Variance Theory Variance Theory Bias Decompose Error Rate into components, some

Review Selection bias, overfitting Bias v. variance v. residual Bias-variance tradeoff

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

BIAS What Is Bias? Bias can be defined as favoring one side, position, or belief being

Variance Will Perkins January 22, 2013 Variance Definition The variance of a random variable X

Bias-Variance Tradeoff Machine Learning 1 Bias and variance Every learning algorithm requires

BIAS BIAS LIGHT LIGHT & & MEDIUM MEDIUM TR TRUCK UCK TIRES TIRES Bias Bias Ligh

Analysis of variance and regression December 4, 2007 Variance component models Variance

Expectancy bias and Bias and forensic evidence Bias and speech research forensic speech

Publication bias in QCA Publication bias in QCA Publication bias in QCA Meaning, diagnosis and

Polar Decomposition of a Matrix Garrett Buffington May 4, 2014 The Polar Decomposition SVD and

Thermal decomposition of the Thermal decomposition of the Thermal decomposition of the Thermal

Analysis of variance and regression Other types of regression models Other types of regression

Linear Regression and the Bias Variance Tradeoff Guest Lecturer Joseph E. Gonzalez slides

Linear Regression, Regularization Bias-Variance Tradeoff Thanks to C Guestrin, T Dietterich, R

Introduction to Data Science: Neural [ 1 , 2 , , p ] g x w h m g h g f w old M

http://listenonrepeat.com/ watch/?v=X2Q_udvSakg#Col umbia_FSAE_Metlife_Stadiu m_Autocross Making

Workshop 7: (Generalized) Linear models Murray Logan 19 Jul 2017 Section 1 Linear model

Coding Lab: Visualizing data with ggplot2 Ari Anisfeld Summer 2020 1 / 36 How to use ggplot

MATH 12002 - CALCULUS I 5.2: The Natural Logarithm Professor Donald L. White Department of

Low Weight Discrete Logarithms and Subset Sum in 2 0 . 65 n with Polynomial Memory EUROCRYPT 2020 ,

Base-2 Logarithms If n = 2 k then k is called the logarithm (base 2) of n n=2 k k log 10 (n)

Surreal models of the reals with exponentiation A. Berarducci University of Pisa Paris, IHP, 6-8

Regression and the Bias-Variance Decomposition William Cohen - PowerPoint PPT Presentation

Regression and the Bias-Variance Decomposition William Cohen 10-601 April 2008 Readings: Bishop 3.1,3.2 1 Regression Technically: learning a function f( x )=y where y is real-valued , rather than discrete . Replace

Bias, Variance and Error Bias and Variance given algorithm that outputs estimate for , we

Bias- -Variance Theory Variance Theory Bias Decompose Error Rate into components, some

Review Selection bias, overfitting Bias v. variance v. residual Bias-variance tradeoff

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

BIAS What Is Bias? Bias can be defined as favoring one side, position, or belief being

Variance Will Perkins January 22, 2013 Variance Definition The variance of a random variable X

Bias-Variance Tradeoff Machine Learning 1 Bias and variance Every learning algorithm requires

BIAS BIAS LIGHT LIGHT &amp; &amp; MEDIUM MEDIUM TR TRUCK UCK TIRES TIRES Bias Bias Ligh

Analysis of variance and regression December 4, 2007 Variance component models Variance

Expectancy bias and Bias and forensic evidence Bias and speech research forensic speech

Publication bias in QCA Publication bias in QCA Publication bias in QCA Meaning, diagnosis and

Polar Decomposition of a Matrix Garrett Buffington May 4, 2014 The Polar Decomposition SVD and

Thermal decomposition of the Thermal decomposition of the Thermal decomposition of the Thermal

Analysis of variance and regression Other types of regression models Other types of regression

Linear Regression and the Bias Variance Tradeoff Guest Lecturer Joseph E. Gonzalez slides

Linear Regression, Regularization Bias-Variance Tradeoff Thanks to C Guestrin, T Dietterich, R

Introduction to Data Science: Neural [ 1 , 2 , , p ] g x w h m g h g f w old M

http://listenonrepeat.com/ watch/?v=X2Q_udvSakg#Col umbia_FSAE_Metlife_Stadiu m_Autocross Making

Workshop 7: (Generalized) Linear models Murray Logan 19 Jul 2017 Section 1 Linear model

Coding Lab: Visualizing data with ggplot2 Ari Anisfeld Summer 2020 1 / 36 How to use ggplot

MATH 12002 - CALCULUS I 5.2: The Natural Logarithm Professor Donald L. White Department of

Low Weight Discrete Logarithms and Subset Sum in 2 0 . 65 n with Polynomial Memory EUROCRYPT 2020 ,

Base-2 Logarithms If n = 2 k then k is called the logarithm (base 2) of n n=2 k k log 10 (n)

Surreal models of the reals with exponentiation A. Berarducci University of Pisa Paris, IHP, 6-8

BIAS BIAS LIGHT LIGHT & & MEDIUM MEDIUM TR TRUCK UCK TIRES TIRES Bias Bias Ligh