Regression
Practical Machine Learning Fabian Wauthier 09/10/2009
Adapted from slides by Kurt Miller and Romain Thibaux
1
Regression Practical Machine Learning Fabian Wauthier 09/10/2009 - - PowerPoint PPT Presentation
Regression Practical Machine Learning Fabian Wauthier 09/10/2009 Adapted from slides by Kurt Miller and Romain Thibaux 1 Outline Ordinary Least Squares Regression - Online version - Normal equations - Probabilistic interpretation
Practical Machine Learning Fabian Wauthier 09/10/2009
Adapted from slides by Kurt Miller and Romain Thibaux
1Outline
Outline
Regression vs. Classification:
Anything:
– {0,1} binary – {1,…k} multi-class – tree, etc. structured
Classification
Regression vs. Classification:
Anything:
Perceptron Logistic Regression Support Vector Machine Decision Tree Random Forest Kernel trick
Classification
Regression vs. Classification:
Anything:
Regression
– ℜ, ℜd
Examples
⇒ ⇒ ⇒ ⇒ ⇒
7Linear regression
10 20 20 40x y
10 20 30 40 10 20 30 20 22 24 26x y
Given examples Predict given a new point
8where is a parameter to be estimated and we have used the standard convention of letting the first component of be 1. We wish to estimate by a linear function of our data :
10 20 30 40 10 20 30 20 22 24 26Linear regression
10 20 20 40x y x y
ˆ y x w ˆ yn+1 = w0 + w1xn+1,1 + w2xn+1,2 = w⊤xn+1 x
9Choosing the regressor
10
Of the many regression fits that approximate the data, which should we choose? Observation
20Xi = 1 xi
LMS Algorithm
(Least Mean Squares) In order to clarify what we mean by a good choice of , we will define a cost function for how well we are doing on the training data: w
20Error or “residual” Prediction Observation
Cost =
1 2
n
(w⊤xi − yi)2 Xi = 1 xi
LMS Algorithm
(Least Mean Squares) The best choice of is the one that minimizes our cost function w E = 1 2
n
(w⊤xi − yi)2 =
n
Ei In order to optimize this equation, we use standard gradient descent where ∂ ∂wE =
n
∂ ∂wEi and ∂ ∂wEi = 1 2 ∂ ∂w(w⊤xi − yi)2 = (w⊤xi − yi)xi
wt+1 := wt − α ∂ ∂wE
12LMS Algorithm
(Least Mean Squares) The LMS algorithm is an online method that performs the following update for each new data point
wt+1 := wt − α ∂ ∂wEi = wt + α(yi − x⊤
i w)xi
α∂Ei ∂w
13LMS, Logistic regression, and Perceptron updates
wt+1 := wt + α(yi − x⊤
i w)xi
wt+1 := wt + α(yi − fw(xi))xi wt+1 := wt + α(yi − fw(xi))xi
14Ordinary Least Squares (OLS)
20Error or “residual” Prediction Observation
Cost =
1 2
n
(w⊤xi − yi)2 Xi = 1 xi
Minimize the sum squared error
n d ∂ ∂wE = X⊤Xw − X⊤y Setting the derivative equal to zero gives us the Normal Equations X⊤Xw = X⊤y w = (X⊤X)−1X⊤y
E = 1 2
n
(w⊤xi − yi)2 = 1 2(Xw − y)⊤(Xw − y) = 1 2(w⊤X⊤Xw − 2y⊤Xw + y⊤y)
16A geometric interpretation
17
We solved
∂ ∂wE = X⊤(Xw − y) = 0
Residuals are orthogonal to columns of X
⇒ ⇒
gives the best reconstruction of
ˆ y = Xw
y
in the range ofX
1718
[X]1 y [X]2 y’ y’ is an orthogonal projection of y onto S Subspace S spanned by columns of X Residual vector y!y’ is
Computing the solution
19
w.
. Euclidean norm. the pseudoinverse
X
If X⊤X is not invertible, there is no unique solution In that case chooses the solution with smallest and the solution is unique.
w = (X⊤X)−1X⊤y
We compute If X⊤X is invertible, then (X⊤X)−1X⊤ coincides with
X+ of
An alternative way to deal with non-invertible X⊤X is to add a small portion of the identity matrix (= Ridge regression).
w = X+y
19Beyond lines and planes
10 20 20 40Linear models become powerful function approximators when we consider non-linear feature transformations.
All the math is the same! Predictions are still linear in X !
20Geometric interpretation
[Matlab demo]
10 20 100 200 300 400ˆ y = w0 + w1x + w2x2
21Ordinary Least Squares [summary]
n d Let For example Let Minimize by solving Given examples Predict
22Probabilistic interpretation
20Likelihood
2324
2 4 6 8 10 5 10 15 20 25 X y µ=8 µ=5 µ=3 Mean µ Conditional Gaussians p(y|x)
24BREAK
25Outline
Overfitting
model accuracy.
separate lecture.
27
27Overfitting
2 4 6 8 10 12 14 16 18 20[Matlab demo]
Degree 15 polynomial
28Ridge Regression (Regularization)
2 4 6 8 10 12 14 16 18 20with “small” by solving Minimize (X⊤X + ǫI)w = X⊤y
[Continue Matlab demo]
29Probabilistic interpretation
Likelihood Prior Posterior
P(w|X, y) = P(w, x1, . . . , xn, y1, . . . , yn) P(x1, . . . , xn, y1, . . . , yn) ∝ P(w.x1, . . . , x1, y1, . . . , yn) ∝ exp
2σ2 ||w||2
2 i
exp
2σ2
i w − yi
2 = exp
2σ2
2 +
(X⊤
i w − yi)2
Outline
Errors in Variables (Total Least Squares)
32Sensitivity to outliers
High weight given to outliers
10 20 30 40 10 20 30 5 10 15 20 25Temperature at noon
Influence function
33L1 Regression
Linear program
Influence function [Matlab demo]
34Quantile Regression
16 17 18 19 20 21 260 280 300 320 340 360 workload (ViewItem.php) [req/s] CPU utilization [MHz] mean CPU 95th percentile of CPU
Slide courtesy of Peter Bodik
35Generalized Linear Models
36
Probabilistic interpretation of OLS
Mean is linear in Xi
OLS: linearly predict the mean of a Gaussian conditional. GLM: predict the mean of some other conditional density. May need to transform linear prediction by to produce a valid parameter.
yi|xi ∼ p
i w)
Example: “Poisson regression”
37
yi|xi ∼ Poisson
i w)
are event counts:
y
Typical distribution for count data: Poisson
Poisson(y|λ) = e−λλy y!
Mean parameter is λ > 0 Say we predict λ = f(x⊤w) = exp
GLM:
3738
2 4 6 8 10 5 10 15 20 25 X y Mean ! Conditional Poissons p(y|x) !=8 !=5 !=3
38Poisson regression: learning
39
As for OLS: optimize by maximizing the likelihood of data.
w
Equivalently: maximize log likelihood. Likelihood
L =
Poisson
i w)
i wyi − exp
i w
Log likelihood ∂l ∂w =
i w
=
i w
Batch gradient:
LMS, Logistic regression, Perceptron and GLM updates
wt+1 := wt + α(yi − x⊤
i w)xi
wt+1 := wt + α(yi − fw(xi))xi wt+1 := wt + α(yi − fw(xi))xi wt+1 := wt + α(yi − fw(xi))xi
40Kernel Regression and Locally Weighted Linear Regression
Take a very very conservative function approximator called
Take a conservative function approximator called LINEAR
Slide from Paul Viola 2003
41Kernel Regression
2 4 6 8 10 12 14 16 18 20Kernel regression (sigma=1)
42Locally Weighted Linear Regression (LWR)
Kernel regression (sigma=1)
E = 1 2
n
(w⊤xi − yi)2 OLS cost function: LWR cost function:
E′ =
n
k(xi − x)(w⊤xi − yi)2 [Matlab demo]
431 2
#requests per minute
Time (days)
5000
Heteroscedasticity
44What we covered