Linear Regression Models Based on Chapter 3 of Hastie, Tibshirani - - PowerPoint PPT Presentation

linear regression models
SMART_READER_LITE
LIVE PREVIEW

Linear Regression Models Based on Chapter 3 of Hastie, Tibshirani - - PowerPoint PPT Presentation

Linear Regression Models Based on Chapter 3 of Hastie, Tibshirani and Friedman Linear Regression Models p = + f ( X ) X 0 j j = j 1 Here the X s might be: Raw predictor variables (continuous or


slide-1
SLIDE 1

Linear Regression Models

Based on Chapter 3 of Hastie, Tibshirani and Friedman

slide-2
SLIDE 2

Linear Regression Models

  • =

+ =

p j j j

X X f

1

) (

  • Here the X’s might be:
  • Raw predictor variables (continuous or coded-categorical)
  • Transformed predictors (X4=log X3)
  • Basis expansions (X4=X3

2, X5=X3 3, etc.)

  • Interactions (X4=X2

X3 )

Popular choice for estimation is least squares:

2 1 1

) ( ) (

  • =

=

  • =

N i p j j j i

X y RSS

slide-3
SLIDE 3
slide-4
SLIDE 4

Least Squares

) ( ) ( ) (

  • X

y X y RSS

T

  • =

Often assume that the Y’s are independent and normally distributed, leading to various classical statistical tests and confidence intervals

y X X X

T T 1

) ( ˆ

  • =

y X X X X X y

T T 1

) ( ˆ ˆ

  • =

=

  • hat matrix
slide-5
SLIDE 5

The least squares estimate of θ is: If the linear model is correct, this estimate is unbiased (X fixed): Gauss-Markov states that for any other linear unbiased estimator : Of course, there might be a biased estimator with lower MSE…

Gauss-Markov Theorem

Consider any linear combination of the β’s:

y X X X a a

T T T T 1

) ( ˆ ˆ

  • =

=

  • T

a = E( ˆ ) = E(aT (XT X)1XT y) = aT (XT X)1XT X = aT y cT =

  • ~

) ( Var ) ˆ ( Var y c a

T T

  • i.e., E(cT y) = aT,
slide-6
SLIDE 6

bias-variance

  • ~

2 2 2 2

) ) ~ ( ( ) ~ ( ) ) ~ ( ( )) ~ ( ~ ( ) ) ~ ( ) ~ ( ~ (

  • +

=

  • +
  • =
  • +
  • =

E Var E E E E E E E

2

) ~ ( ) ~ (

  • = E

MSE

For any estimator : bias Note MSE closely related to prediction error:

) ~ ( ) ~ ( ) ( ) ~ (

2 2 2 2

  • T

T T T T

x MSE x x E x Y E x Y E + =

  • +
  • =
slide-7
SLIDE 7

Too Many Predictors?

When there are lots of X’s, get models with high variance and prediction suffers. Three “solutions:” 1. Subset selection 2. Shrinkage/Ridge Regression 3. Derived Inputs Score: AIC, BIC, etc. All-subsets + leaps-and-bounds, Stepwise methods,

slide-8
SLIDE 8

Subset Selection

  • Standard “all-subsets” finds the subset of size k, k=1,…,p,

that minimizes RSS:

  • Choice of subset size requires tradeoff – AIC, BIC,

marginal likelihood, cross-validation, etc.

  • “Leaps and bounds” is an efficient algorithm to do

all-subsets

slide-9
SLIDE 9

Cross-Validation

  • e.g. 10-fold cross-validation:
  • Randomly divide the data into ten parts
  • Train model using 9 tenths and compute prediction error on the

remaining 1 tenth

  • Do these for each 1 tenth of the data
  • Average the 10 prediction error estimates

“One standard error rule” pick the simplest model within

  • ne standard error of the

minimum

slide-10
SLIDE 10

Shrinkage Methods

  • Subset selection is a discrete process – individual variables

are either in or out

  • This method can have high variance – a different dataset

from the same source can result in a totally different model

  • Shrinkage methods allow a variable to be partly included in

the model. That is, the variable is included but with a shrunken co-efficient.

slide-11
SLIDE 11

Ridge Regression

subject to:

2 1 1 ridge

) ( min arg ˆ

  • =

=

  • =

N i p j j ij i

x y

  • =
  • p

j j

s

1 2

  • Equivalently:
  • +
  • =
  • =

= = p j j N i p j j ij i

x y

1 2 2 1 1 ridge

) ( min arg ˆ

  • This leads to:

Choose λ by cross-validation. Predictors should be centered.

y X I X X

T T 1 ridge

) ( ˆ

  • +

=

  • works even when

XTX is singular

slide-12
SLIDE 12

effective number of X’s

slide-13
SLIDE 13

Ridge Regression = Bayesian Regression

2 2 2 2

) , ( ~ ) , ( ~

  • =

+ with ridge as same N x N y

j T i i

slide-14
SLIDE 14

The Lasso

subject to:

2 1 1 ridge

) ( min arg ˆ

  • =

=

  • =

N i p j j ij i

x y

  • =
  • p

j j

s

1

  • Quadratic programming algorithm needed to solve for the

parameter estimates. Choose s via cross-validation.

  • +
  • =
  • =

= = q p j j N i p j j ij i

x y

1 2 1 1

) ( min arg ~

  • q=0: var. sel.

q=1: lasso q=2: ridge Learn q?

slide-15
SLIDE 15
slide-16
SLIDE 16

function of 1/lambda

slide-17
SLIDE 17

has largest sample variance amongst all normalized linear combinations of the columns of X

Principal Component Regression

Consider a an eigen-decomposition of XTX (and hence the covariance matrix of X):

T T

V VD X X

2

=

The eigenvectors vj are called the principal components of X D is diagonal with entries d1 ≥ d2 ≥… ≥dp

1

Xv

has largest sample variance amongst all normalized linear combinations of the columns of X subject to being orthogonal to all the earlier ones

k

Xv

) ) (

2 1 1

N d Xv = (var

(X is first centered) (X is N x p)

slide-18
SLIDE 18
slide-19
SLIDE 19

Principal Component Regression

PC Regression regresses on the first M principal components where M<p Similar to ridge regression in some respects – see HTF, p.66

slide-20
SLIDE 20

www.r-project.org/user-2006/Slides/Hesterberg+Fraley.pdf

slide-21
SLIDE 21
slide-22
SLIDE 22

x1<-rnorm(10) x2<-rnorm(10) y<-(3*x1) + x2 + rnorm(10,0.1) par(mfrow=c(1,2)) plot(x1,y,xlim=range(c(x1,x2)),ylim=range(y)) abline(lm(y~-1+x1)) plot(x2,y,xlim=range(c(x1,x2)),ylim=range(y)) abline(lm(y~-1+x2)) epsilon <- 0.1 r <- y beta <- c(0,0) numIter <- 25 for (i in 1:numIter) { cat(cor(x1,r),"\t",cor(x2,r),"\t",beta[1],"\t",beta[2],"\n"); if (cor(x1,r) > cor(x2,r)) { delta <- epsilon * ((2 * ((r%*%x1) > 0))-1) beta[1] <- beta[1] + delta r <- r - (delta * x1) par(mfg=c(1,1)) abline(0,beta[1],col="red") } if (cor(x1,r) <= cor(x2,r)) { delta <- epsilon * ((2 * ((r%*%x2) > 0))-1) beta[2] <- beta[2] + delta r <- r - (delta * x2) par(mfg=c(1,2)) abline(0,beta[2],col="green") } }

slide-23
SLIDE 23
slide-24
SLIDE 24
slide-25
SLIDE 25

► Start with all coefficients bj = 0 ► Find the predictor xj most correlated with y ► Increase bj in the direction of the sign of its correlation with y. Take residuals r=y-yhat along the way. Stop when some other predictor xk has as much correlation with r as xj has ► Increase (bj,bk) in their joint least squares direction until some other predictor xm has as much correlation with the residual r. ► Continue until all predictors are in the model

LARS

slide-26
SLIDE 26
slide-27
SLIDE 27
  • If there are many correlated features, lasso gives

non-zero weight to only one of them

  • Maybe correlated features (e.g. time-ordered)

should have similar coefficients?

Fused Lasso

Tibshirani et al. (2005)

slide-28
SLIDE 28
slide-29
SLIDE 29
  • Suppose you represent a categorical predictor

with indicator variables

  • Might want the set of indicators to be in or out

Group Lasso

Yuan and Lin (2006)

regular lasso: group lasso:

slide-30
SLIDE 30
slide-31
SLIDE 31