SLIDE 1
Linear Models for Statistical Learning, Regression
David Dalpiaz STAT 430, Fall 2017
1
SLIDE 2 Announcements
- Homework 01 due today.
- Homework 02 released later today. (Hopefully.)
2
SLIDE 3 Statistical Learning
- Supervised Learning
- Regression
- Classification
- Unsupervised Learning
3
SLIDE 4 Regression Setup
Y = f (x1, x2, x3, . . . xp) + ǫ numeric response = signal + noise
- Want to learn the signal
- Want to be very careful not to “learn noise”
4
SLIDE 5
Using a Linear Model
Setup: Y = f (x1, x2, x3, . . . xp) + ǫ Assume: f (x1, x2, x3, . . . xp) = β0 + β1x1 + β2x2 + . . . + βpxp
5
SLIDE 6 The Linear Model
Y = β0 + β1x1 + β2x2 + . . . + βpxp + ǫ, ǫ ∼ N(0, σ2) Y | X ∼ N(β0 + β1x1 + β2x2 + . . . + βpxp, σ2) There are a total of p + 2 parameters in this model
- The p + 1 β parameters, or coefficients, control the signal
- The σ2 controls the noise
6
SLIDE 7 Fitting a Linear Model
This is a parametric model, meaning to fit the model, we need to estimate the parameters. For the sake of making predictions, we only need to estimate the β parameters since ˆ f (x1, x2, x3, . . . xp) = ˆ y(x1, x2, x3, . . . xp) = ˆ β0+ˆ β1x1+ˆ β2x2+. . .+ˆ βpxp Using either least squares or maximum likelihood, this becomes the same optimization problem argmin
β0,β1,...βp n
(yi − (β0 + β1xi1 + β2xi2 + · · · + βpxip))2
7
SLIDE 8 Estimating σ2
While it is not needed to make predictions, to fully estimate the model, we would also need to estimate σ2. s2
e =
1 n − (p + 1)
n
(yi − ˆ yi)2 Least Squares ˆ σ2 = 1 n
n
(yi − ˆ yi)2 MLE Both are estimates of σ2. What is the difference?
8
SLIDE 9
Model “Size”
Consider two models: Y = β0 + β1x1 + β2x2 + ǫ Y = β0 + β1x1 + β2x2 + β3x3 + β4x4 + ǫ Which is bigger?
9
SLIDE 10
Model Complexity
In general, we are interested in the complexity or flexibility of a model. For nested linear models, the more parameters, the bigger, thus, more complex. Models that are more complex will be more wiggly.
10
SLIDE 11
Pictures of Complexity
Go to ISL Slides
11
SLIDE 12 Test-Train Split
We’ve already discussed the Test-Train Split and RMSE RMSETrain = RMSE(ˆ f , Train Data) =
nTr
f (xi)
2
RMSETest = RMSE(ˆ f , Test Data) =
nTe
f (xi)
2
12
SLIDE 13 Overfitting
- Overfitting occurs when a model is too
complex (too flexible) for the data
- Underfitting occurs when a model is not
complex enough (too inflexible) for the data
13
SLIDE 14 Train RMSE
20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 3.0
Prediction Error vs Model Complexity
Complexity (Parameters) Error (RMSE)
14
SLIDE 15 (Expected) Test RMSE
20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 3.0
Prediction Error vs Model Complexity
Complexity (Parameters) Error (RMSE) (Expected) Test Train
15
SLIDE 16 The “Best” Model
- Pick the model with the lowest Test RMSE
- Compared to this. . .
- More complex models with higher Test RMSE are Overfitting
- Less complex models with higher Test RMSE are Underfitting
- This is only a “guess” of the “best” model based on available
information
- In practice, Test RMSE might not be such a nice curve
- This is due to the randomness of the split
- You could get lucky, or unlucky
16
SLIDE 17 Explanation vs Prediction
- Sometimes we check model assumptions directly
- When predicting, we make assumptions and check them
indirectly
- If we assume a correct (or close to correct) form of the model,
the Test RMSE will be low
17
SLIDE 18 If Time. . .
- rmarkdown Tables
- Using code from the Internet
- Back to Test-Train Split Lab
- What would be a good Test RMSE?
- Overfitting: n vs p
- Randomness of Split
- Pseudo RNG
18