linear regression models
play

Linear Regression Models Based on Chapter 3 of Hastie, Tibshirani - PowerPoint PPT Presentation

Linear Regression Models Based on Chapter 3 of Hastie, Tibshirani and Friedman Linear Regression Models p = + f ( X ) X 0 j j = j 1 Here the X s might be: Raw predictor variables (continuous or


  1. Linear Regression Models Based on Chapter 3 of Hastie, Tibshirani and Friedman

  2. Linear Regression Models p � = � + � f ( X ) X 0 j j = j 1 Here the X ’s might be: •Raw predictor variables (continuous or coded-categorical) •Transformed predictors ( X 4 =log X 3 ) •Basis expansions ( X 4 = X 3 2 , X 5 = X 3 3 , etc.) •Interactions ( X 4 = X 2 X 3 ) Popular choice for estimation is least squares: p N � � � = � � � � 2 RSS ( ) ( y X ) i 0 j j = = i 1 j 1

  3. Least Squares � = � � � � T RSS ( ) ( y X ) ( y X ) ˆ � � = � T 1 T ( X X ) X y ˆ � = � = � T 1 T ˆ y X X ( X X ) X y hat matrix Often assume that the Y ’s are independent and normally distributed, leading to various classical statistical tests and confidence intervals

  4. Gauss-Markov Theorem Consider any linear combination of the β ’s: � = � T a The least squares estimate of θ is: ˆ ˆ � = � = � T T T 1 T a a ( X X ) X y If the linear model is correct, this estimate is unbiased ( X fixed): � ) = E ( a T ( X T X ) � 1 X T y ) = a T ( X T X ) � 1 X T X � = a T � E ( ˆ Gauss-Markov states that for any other linear unbiased estimator ~ � = : c T i.e., E ( c T y ) = a T � , y ˆ � � T T Var ( a ) Var ( c y ) Of course, there might be a biased estimator with lower MSE…

  5. bias-variance ~ For any estimator : � ~ ~ � = E � � � 2 ( ) ( ) MSE ~ ~ ~ = � � � + � � � 2 E ( E ( ) E ( ) ) ~ ~ ~ = � � � + � � � 2 2 E ( E ( )) E ( E ( ) ) ~ ~ = � + � � � 2 Var ( ) ( E ( ) ) bias Note MSE closely related to prediction error: ~ ~ ~ � � = � � + � � � = � + � T 2 T 2 T T 2 2 T E ( Y x ) E ( Y x ) E ( x x ) MSE ( x ) 0 0 0 0 0 0 0

  6. Too Many Predictors? When there are lots of X ’s, get models with high variance and prediction suffers. Three “solutions:” 1. Subset selection Score: AIC, BIC, etc. All-subsets + leaps-and-bounds, Stepwise methods, 2. Shrinkage/Ridge Regression 3. Derived Inputs

  7. Subset Selection •Standard “all-subsets” finds the subset of size k , k =1,…, p , that minimizes RSS: •Choice of subset size requires tradeoff – AIC, BIC, marginal likelihood, cross-validation, etc. •“Leaps and bounds” is an efficient algorithm to do all-subsets

  8. Cross-Validation •e.g. 10-fold cross-validation:  Randomly divide the data into ten parts  Train model using 9 tenths and compute prediction error on the remaining 1 tenth  Do these for each 1 tenth of the data  Average the 10 prediction error estimates “One standard error rule” pick the simplest model within one standard error of the minimum

  9. Shrinkage Methods •Subset selection is a discrete process – individual variables are either in or out •This method can have high variance – a different dataset from the same source can result in a totally different model •Shrinkage methods allow a variable to be partly included in the model. That is, the variable is included but with a shrunken co-efficient.

  10. Ridge Regression p N � � ˆ � = � � � � ridge 2 arg min ( y x ) i 0 ij j � = = i 1 j 1 p � � � 2 s subject to: j = j 1 Equivalently: � � p p N � � � ˆ � � � = � � � � + � � ridge 2 2 arg min ( y x ) � � i 0 ij j j � � � = = = i 1 j 1 j 1 This leads to: ˆ � � = + � ridge T 1 T ( X X I ) X y works even when X T X is singular Choose λ by cross-validation. Predictors should be centered.

  11. effective number of X ’s

  12. Ridge Regression = Bayesian Regression � + � � T 2 y ~ N ( x , ) i 0 i � � 2 ~ N ( 0 , ) j � = � � 2 2 same as ridge with

  13. The Lasso p N � � ˆ � = � � � � ridge 2 arg min ( y x ) i 0 ij j � = = i 1 j 1 p � � � s subject to: j = j 1 Quadratic programming algorithm needed to solve for the parameter estimates. Choose s via cross-validation. � � q q =0: var. sel. p p ~ N � � � � � � = � � � � + � � 2 arg min ( y x ) q =1: lasso � � i 0 ij j j q =2: ridge � � � = = = i 1 j 1 j 1 Learn q ?

  14. function of 1/lambda

  15. Principal Component Regression Consider a an eigen-decomposition of X T X (and hence the covariance matrix of X ): (X is first centered) = T 2 T X X VD V (X is N x p) The eigenvectors v j are called the principal components of X D is diagonal with entries d 1 ≥ d 2 ≥ … ≥ d p Xv has largest sample variance amongst all normalized linear 1 2 d combinations of the columns of X = 1 (var ( Xv ) ) 1 N Xv has largest sample variance amongst all normalized linear k combinations of the columns of X subject to being orthogonal to all the earlier ones

  16. Principal Component Regression PC Regression regresses on the first M principal components where M < p Similar to ridge regression in some respects – see HTF, p.66

  17. www.r-project.org/user-2006/Slides/Hesterberg+Fraley.pdf

  18. x1<-rnorm(10) x2<-rnorm(10) y<-(3*x1) + x2 + rnorm(10,0.1) par(mfrow=c(1,2)) plot(x1,y,xlim=range(c(x1,x2)),ylim=range(y)) abline(lm(y~-1+x1)) plot(x2,y,xlim=range(c(x1,x2)),ylim=range(y)) abline(lm(y~-1+x2)) epsilon <- 0.1 r <- y beta <- c(0,0) numIter <- 25 for (i in 1:numIter) { cat(cor(x1,r),"\t",cor(x2,r),"\t",beta[1],"\t",beta[2],"\n"); if (cor(x1,r) > cor(x2,r)) { delta <- epsilon * ((2 * ((r%*%x1) > 0))-1) beta[1] <- beta[1] + delta r <- r - (delta * x1) par(mfg=c(1,1)) abline(0,beta[1],col="red") } if (cor(x1,r) <= cor(x2,r)) { delta <- epsilon * ((2 * ((r%*%x2) > 0))-1) beta[2] <- beta[2] + delta r <- r - (delta * x2) par(mfg=c(1,2)) abline(0,beta[2],col="green") } }

  19. LARS ► Start with all coefficients b j = 0 ► Find the predictor x j most correlated with y ► Increase b j in the direction of the sign of its correlation with y . Take residuals r = y - y hat along the way. Stop when some other predictor x k has as much correlation with r as x j has ► Increase ( b j , b k ) in their joint least squares direction until some other predictor x m has as much correlation with the residual r . ► Continue until all predictors are in the model

  20. Fused Lasso If there are many correlated features, lasso gives • non-zero weight to only one of them Maybe correlated features (e.g. time-ordered) • should have similar coefficients? Tibshirani et al. (2005)

  21. Group Lasso Suppose you represent a categorical predictor • with indicator variables Might want the set of indicators to be in or out • regular lasso: group lasso: Yuan and Lin (2006)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend