regression and the bias variance decomposition
play

Regression and the Bias-Variance Decomposition William Cohen - PowerPoint PPT Presentation

Regression and the Bias-Variance Decomposition William Cohen 10-601 April 2008 Readings: Bishop 3.1,3.2 1 Regression Technically: learning a function f( x )=y where y is real-valued , rather than discrete . Replace


  1. Regression and the Bias-Variance Decomposition William Cohen 10-601 April 2008 Readings: Bishop 3.1,3.2 1

  2. Regression • Technically: learning a function f( x )=y where y is real-valued , rather than discrete . – Replace livesInSquirrelHill(x1,x2,…,xn) with averageCommuteDistanceInMiles(x1,x2,…,xn) – Replace userLikesMovie( u , m ) with usersRatingForMovie( u , m ) – … 2

  3. Example: univariate linear regression • Example: predict age from number of publications 50 45 40 35 30 Age in Years 25 20 15 10 5 0 0 20 40 60 80 100 120 140 160 Number of Publications 3

  4. Linear regression • Model: yi = axi + b + εi where εi ~ N(0,σ) • Training Data: ( x1,y1),….(xn,yn) ^ ^ • Goal: estimate a,b with w= ( a,b )  w arg max Pr( w | D )  arg max Pr( D | w ) Pr( w )  arg max log Pr( D | w ) assume MLE   arg max log Pr( y i x | , w ) i i ˆ      ˆ ˆ ( w ) y ( a x b )   ˆ 2 arg min [ ( w )] i i i i i 4

  5. Linear regression • Model: yi = axi + b + εi where εi ~ N(0,σ) • Training Data: ( x1,y1),….(xn,yn)  ^ ^ 2   ˆ w arg min • Goal: estimate a,b with w= ( a,b ) i i • Ways to estimate parameters – Find derivative wrt parameters a,b – Set to zero and solve • Or use gradient ascent to solve • Or …. 5

  6. Linear regression y2 d3 How to estimate the slope? d2 y y1  y d1  slope  x       y y y y    1 2 ...       x x x x 1 2 x x1 x2 1     y y    i n   x x y y n*cov(X,Y)   i  i i 1      2  x x  x x n*var(X ) i i i n i 6

  7. Linear regression y2 d3 How to estimate the intercept? d2 y y1 ˆ ˆ   y a x b d1 ˆ   ˆ b y a x x x1 x2 7

  8. Bias/Variance Decomposition of Error 8

  9. Bias – Variance decomposition of error • Return to the simple regression problem f:X  Y y = f(x) +  noise N(0,  ) deterministic What is the expected error for a learned h ? 9

  10. Bias – Variance decomposition of error   2       E [ f ( x ) h ( x ) Pr( ) Pr( x ) d dx ] D D learned from D dataset true fct Experiment (the error of which I’d like to predict): 1. Draw size n sample D = ( x1,y1),….(xn,yn) 2 . Train linear function hD using D 3. Draw a test example (x,f(x)+ε) 4. Measure squared error of hD on that example 10

  11. Bias – Variance decomposition of error (2)     2    E f ( x ) h ( x )  D , D learned from D dataset true fct Fix x, then do this experiment: 1. Draw size n sample D = ( x1,y1),….(xn,yn) 2 . Train linear function hD using D 3. Draw the test example (x,f(x)+ε) 4. Measure squared error of hD on that example 11

  12. Bias – Variance decomposition of error t     2    E f ( x ) h ( x )  D , D ^ f ^ really yD   y  ˆ 2 E ( t y )     2     why not? ˆ E [ t f ] [ f y ]          2 2 ˆ ˆ E [ t f ] [ f y ] 2 [ t f ][ f y ]           2 2 2 ˆ ˆ ˆ E [ t f ] [ f y ] 2 [ tf t y f f y ] 12

  13. Bias – Variance decomposition of error    2 ˆ E D ( t y )  ,     2     ˆ E [ t f ] [ f y ]          2 2 ˆ ˆ E [ t f ] [ f y ] 2 [ t f ][ f y ]           2 2 2 ˆ ˆ ˆ E [ t f ] [ f y ] 2 [ tf t y f f y ]           2 2 2 ˆ ˆ ˆ E [ ] E [( f y ) ] 2 E [ tf ] E [ t y ] E [ f ] E [ f y ]     ˆ ( f ) f ( f ) y Depends on how well learner Intrinsic approximates f noise 13

  14. Bias – Variance decomposition of error    2 ˆ E ( f y )    h E { h ( x )}       2 ˆ E [ f h ] [ h y ] D D  D  ˆ ˆ y y h ( x )   D        2 2 ˆ ˆ E [ f h ] [ h y ] 2 [ f h ][ h y ]           2 2 2 ˆ ˆ ˆ E [ f h ] [ h y ] 2 [ fh f y h h y ]           2 ˆ 2 ˆ 2 ˆ E [( f h ) ] E [( h y ) ] 2 E [ fh ] E [ f y ] E [ h ] E [ h y ] VARIANCE Squared difference between best possible Squared difference btwn our long- prediction for x, f(x), and term expectation for the learners our “ long-term ” performance, ED[hD(x)], and what expectation for what the we expect in a representative run learner will do if we on a dataset D (hat y) averaged over many BIAS2 datasets D, ED[hD(x)] 14

  15. Bias-variance decomposition Make the long-term average better approximate the true function f(x ) Make the learner less sensitive to variations in the data How can you reduce bias of a learner? How can you reduce variance of a learner? 15

  16. A generalization of bias-variance decomposition to other loss functions • “Arbitrary” real-valued loss L(t,y) Claim: But L(y,y’)=L(y’,y), L(y,y)=0, and L(y,y’)!=0 if y!=y’ ED,t[L(t,y) ] = c1N(x) • Define “optimal prediction”: +B(x)+c2V(x) y* = argmin y’ L(t,y’) where • Define “main prediction of learner” ym=ym,D = argmin y’ ED{L(t,y’)} c1=PrD [ y=y* ] - 1 m=n=|D | • Define “bias of learner”: c2=1 if ym=y*, -1 else B(x)=L(y*,ym) • Define “variance of learner” V(x)=ED[L(ym,y)] • Define “noise for x”: N(x) = Et[L(t,y*)] 16

  17. Other regression methods 17

  18. Example: univariate linear regression • Example: predict age from number of publications Paul Erdős 50 45 40 1   ˆ y x 26 35 7 30 Age in Years Hungarian 25 mathematician, 1913-1996 20 x ~ 1500 15 age about 240 10 5 0 0 20 40 60 80 100 120 140 160 Number of Publications 18

  19. Linear regression Summary: y2 d3 d2    y   x x y y   ˆ a i i   2 y1  x x i i d1 ˆ   ˆ b y a x To simplify: x x1 x2 • assume zero-centered data, as we   T T 1 ˆ a x y ( x x ) did for PCA • let x= ( x1,…,xn) and y = (y1,…,yn) ˆ  b 0 • then… 19

  20. Onward: multivariate linear regression Multivariate col is feature Univariate     y 1 k x ,...., x  x x ,...., x 1 1 1     1 n  y ...  X ...      y y ,...., y     1 n y 1 k x ,...., x     n n n   T T 1 ˆ w x y ( x x )    1 1 k k ˆ ˆ ˆ y w x ... w x row is example   T T 1 ˆ w X y ( X X )    2 ˆ w arg min [ ( w )] i    ˆ T ( w ) y w x i i i 20

  21. regularized Onward: multivariate linear regression ^     y 1 k x ,...., x  1   2 ˆ w arg min [ ( w )] 1 1     i  y ...  X ...        ˆ T ( w ) y w x i i i     y 1 k x ,...., x       n    ˆ 2 T n n w arg min [ ( w )] w w i 2    1 1 k k ˆ ˆ ˆ y w x ... w x      T T 1  ˆ T T 1 ˆ w X y ( I X X ) w X y ( X X ) 21

  22. Onward: multivariate linear regression Multivariate, multiple outputs     1 1   y y ,...., y 1 m x ,...., x 1 1 1 1 1          y ... Y ... X ...             1 k y y ,...., y 1 m x ,...., x       n n n n n    1 1 k k  ˆ ˆ ˆ ˆ y w x ... w x y W x    T T 1  ˆ T T 1 w X y ( X X ) W X Y ( X X ) 22

  23. regularized Onward: multivariate linear regression ^     y 1 k x ,...., x  1   2 ˆ w arg min [ ( w )] 1 1     i  y ...  X ...        ˆ T ( w ) y w x i i i     y 1 k x ,...., x       n    ˆ 2 T n n w arg min [ ( w )] w w i 2 What does increasing λ do?    1 1 k k ˆ ˆ ˆ y w x ... w x      T T 1  ˆ T T 1 ˆ w X y ( I X X ) w X y ( X X ) 23

  24. regularized Onward: multivariate linear regression ^     y 1 , x  1   2 ˆ w arg min [ ( w )] 1     i  y ...  X ...        ˆ T ( w ) y w x i i i     y 1 , x       n    ˆ 2 T n w arg min [ ( w )] w w i 2 w= (w1,w2) What does fixing w2=0    1 1 k k ˆ ˆ ˆ y w x ... w x do (if λ=0)?      T T 1  ˆ T T 1 ˆ w X y ( I X X ) w X y ( X X ) 24

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend