Linear Regression Models Based on Chapter 3 of Hastie, Tibshirani - - PowerPoint PPT Presentation
Linear Regression Models Based on Chapter 3 of Hastie, Tibshirani - - PowerPoint PPT Presentation
Linear Regression Models Based on Chapter 3 of Hastie, Tibshirani and Friedman Linear Regression Models p = + f ( X ) X 0 j j = j 1 Here the X s might be: Raw predictor variables (continuous or
Linear Regression Models
- =
+ =
p j j j
X X f
1
) (
- Here the X’s might be:
- Raw predictor variables (continuous or coded-categorical)
- Transformed predictors (X4=log X3)
- Basis expansions (X4=X3
2, X5=X3 3, etc.)
- Interactions (X4=X2
X3 )
Popular choice for estimation is least squares:
2 1 1
) ( ) (
- =
=
- =
N i p j j j i
X y RSS
Least Squares
) ( ) ( ) (
- X
y X y RSS
T
- =
Often assume that the Y’s are independent and normally distributed, leading to various classical statistical tests and confidence intervals
y X X X
T T 1
) ( ˆ
- =
y X X X X X y
T T 1
) ( ˆ ˆ
- =
=
- hat matrix
The least squares estimate of θ is: If the linear model is correct, this estimate is unbiased (X fixed): Gauss-Markov states that for any other linear unbiased estimator : Of course, there might be a biased estimator with lower MSE…
Gauss-Markov Theorem
Consider any linear combination of the β’s:
y X X X a a
T T T T 1
) ( ˆ ˆ
- =
=
- T
a = E( ˆ ) = E(aT (XT X)1XT y) = aT (XT X)1XT X = aT y cT =
- ~
) ( Var ) ˆ ( Var y c a
T T
- i.e., E(cT y) = aT,
bias-variance
- ~
2 2 2 2
) ) ~ ( ( ) ~ ( ) ) ~ ( ( )) ~ ( ~ ( ) ) ~ ( ) ~ ( ~ (
- +
=
- +
- =
- +
- =
E Var E E E E E E E
2
) ~ ( ) ~ (
- = E
MSE
For any estimator : bias Note MSE closely related to prediction error:
) ~ ( ) ~ ( ) ( ) ~ (
2 2 2 2
- T
T T T T
x MSE x x E x Y E x Y E + =
- +
- =
Too Many Predictors?
When there are lots of X’s, get models with high variance and prediction suffers. Three “solutions:” 1. Subset selection 2. Shrinkage/Ridge Regression 3. Derived Inputs Score: AIC, BIC, etc. All-subsets + leaps-and-bounds, Stepwise methods,
Subset Selection
- Standard “all-subsets” finds the subset of size k, k=1,…,p,
that minimizes RSS:
- Choice of subset size requires tradeoff – AIC, BIC,
marginal likelihood, cross-validation, etc.
- “Leaps and bounds” is an efficient algorithm to do
all-subsets
Cross-Validation
- e.g. 10-fold cross-validation:
- Randomly divide the data into ten parts
- Train model using 9 tenths and compute prediction error on the
remaining 1 tenth
- Do these for each 1 tenth of the data
- Average the 10 prediction error estimates
“One standard error rule” pick the simplest model within
- ne standard error of the
minimum
Shrinkage Methods
- Subset selection is a discrete process – individual variables
are either in or out
- This method can have high variance – a different dataset
from the same source can result in a totally different model
- Shrinkage methods allow a variable to be partly included in
the model. That is, the variable is included but with a shrunken co-efficient.
Ridge Regression
subject to:
2 1 1 ridge
) ( min arg ˆ
- =
=
- =
N i p j j ij i
x y
- =
- p
j j
s
1 2
- Equivalently:
- +
- =
- =
= = p j j N i p j j ij i
x y
1 2 2 1 1 ridge
) ( min arg ˆ
- This leads to:
Choose λ by cross-validation. Predictors should be centered.
y X I X X
T T 1 ridge
) ( ˆ
- +
=
- works even when
XTX is singular
effective number of X’s
Ridge Regression = Bayesian Regression
2 2 2 2
) , ( ~ ) , ( ~
- =
+ with ridge as same N x N y
j T i i
The Lasso
subject to:
2 1 1 ridge
) ( min arg ˆ
- =
=
- =
N i p j j ij i
x y
- =
- p
j j
s
1
- Quadratic programming algorithm needed to solve for the
parameter estimates. Choose s via cross-validation.
- +
- =
- =
= = q p j j N i p j j ij i
x y
1 2 1 1
) ( min arg ~
- q=0: var. sel.
q=1: lasso q=2: ridge Learn q?
function of 1/lambda
has largest sample variance amongst all normalized linear combinations of the columns of X
Principal Component Regression
Consider a an eigen-decomposition of XTX (and hence the covariance matrix of X):
T T
V VD X X
2
=
The eigenvectors vj are called the principal components of X D is diagonal with entries d1 ≥ d2 ≥… ≥dp
1
Xv
has largest sample variance amongst all normalized linear combinations of the columns of X subject to being orthogonal to all the earlier ones
k
Xv
) ) (
2 1 1
N d Xv = (var
(X is first centered) (X is N x p)
Principal Component Regression
PC Regression regresses on the first M principal components where M<p Similar to ridge regression in some respects – see HTF, p.66
www.r-project.org/user-2006/Slides/Hesterberg+Fraley.pdf
x1<-rnorm(10) x2<-rnorm(10) y<-(3*x1) + x2 + rnorm(10,0.1) par(mfrow=c(1,2)) plot(x1,y,xlim=range(c(x1,x2)),ylim=range(y)) abline(lm(y~-1+x1)) plot(x2,y,xlim=range(c(x1,x2)),ylim=range(y)) abline(lm(y~-1+x2)) epsilon <- 0.1 r <- y beta <- c(0,0) numIter <- 25 for (i in 1:numIter) { cat(cor(x1,r),"\t",cor(x2,r),"\t",beta[1],"\t",beta[2],"\n"); if (cor(x1,r) > cor(x2,r)) { delta <- epsilon * ((2 * ((r%*%x1) > 0))-1) beta[1] <- beta[1] + delta r <- r - (delta * x1) par(mfg=c(1,1)) abline(0,beta[1],col="red") } if (cor(x1,r) <= cor(x2,r)) { delta <- epsilon * ((2 * ((r%*%x2) > 0))-1) beta[2] <- beta[2] + delta r <- r - (delta * x2) par(mfg=c(1,2)) abline(0,beta[2],col="green") } }
► Start with all coefficients bj = 0 ► Find the predictor xj most correlated with y ► Increase bj in the direction of the sign of its correlation with y. Take residuals r=y-yhat along the way. Stop when some other predictor xk has as much correlation with r as xj has ► Increase (bj,bk) in their joint least squares direction until some other predictor xm has as much correlation with the residual r. ► Continue until all predictors are in the model
LARS
- If there are many correlated features, lasso gives
non-zero weight to only one of them
- Maybe correlated features (e.g. time-ordered)
should have similar coefficients?
Fused Lasso
Tibshirani et al. (2005)
- Suppose you represent a categorical predictor
with indicator variables
- Might want the set of indicators to be in or out
Group Lasso
Yuan and Lin (2006)
regular lasso: group lasso: