10-601 Machine Learning Regression Outline Regression vs - - PowerPoint PPT Presentation
10-601 Machine Learning Regression Outline Regression vs - - PowerPoint PPT Presentation
10-601 Machine Learning Regression Outline Regression vs Classification Linear regression another discriminative learning method As optimization Gradient descent As matrix inversion (Ordinary Least Squares) Overfitting
Outline
- Regression vs Classification
- Linear regression – another discriminative
learning method
– As optimization è Gradient descent – As matrix inversion (Ordinary Least Squares)
- Overfitting and bias-variance
- Bias-variance decomposition for classification
What is regression?
Where we are
Inputs Classifier Predict category Inputs Density Estimator Prob- ability Inputs Regressor Predict real #
√
Today
√
Regression examples
Prediction of menu prices
Chaheau Gimpel … and Smith EMNLP 2012
…
A decision tree: classification
Play Play Don’t Play Don’t Play
A regression tree
Play = 30m, 45min Play = 0m, 0m, 15m Play = 0m, 0m Play = 20m, 30m, 45m,
Play ~= 37 Play ~= 5 Play ~= 0 Play ~= 32
Theme for the week: learning as optimization
Types of learners
- Two types of learners:
- 1. Generative: make assumptions about how to generate
data (given the class)
- e.g., naïve Bayes
- 2. Discriminative
- directly estimate a decision rule/boundary
- e.g., logistic regression
Today: another discriminative learner, but for regression tasks
Regression for LMS as
- ptimization
Toy problem #2
11
Least Mean Squares
Linear regression
- Given an input x we would like
to compute an output y
- For example:
- Predict height from age
- Predict Google’s price from
Yahoo’s price
- Predict distance from wall
from sensors
X Y
Linear regression
- Given an input x we would like to
compute an output y
- In linear regression we assume
that y and x are related with the following equation: y = wx+ε where w is a parameter and ε represents measurement or
- ther noise
X Y What we are trying to predict Observed values
- Our goal is to estimate w from a training
data of <xi,yi> pairs
- Optimization goal: minimize squared error
(least squares):
- Why least squares?
- minimizes squared distance between
measurements and predicted line
- has a nice probabilistic interpretation
- the math is pretty
Linear regression
∑
−
i i i w
wx y
2
) ( min arg
X Y
ε + = wx y
see HW
Solving linear regression
- To optimize:
- We just take the derivative w.r.t. to w ….
∂ ∂w (yi − wxi)2
i
∑
= 2 −xi(yi − wxi)
i
∑
prediction prediction Compare to logistic regression…
Solving linear regression
- To optimize – closed form:
- We just take the derivative w.r.t. to w and set to 0:
∂ ∂w (yi − wxi)2
i
∑
= 2 −xi(yi − wxi)
i
∑
⇒ 2 xi(yi − wxi) = 0
i
∑
⇒ xiyi = wxi
2 i
∑
i
∑
⇒ w = xiyi
i
∑
xi
2 i
∑
covar(X,Y)/var(X) if mean(X)=mean(Y)=0
2 xiyi
i
∑
− 2 wxixi
i
∑
= 0
Regression example
- Generated: w=2
- Recovered: w=2.03
- Noise: std=1
Regression example
- Generated: w=2
- Recovered: w=2.05
- Noise: std=2
Regression example
- Generated: w=2
- Recovered: w=2.08
- Noise: std=4
Bias term
- So far we assumed that the
line passes through the origin
- What if the line does not?
- No problem, simply change the
model to y = w0 + w1x+ε
- Can use least squares to
determine w0 , w1
n x w y w
i i i
∑
− =
1
X Y w0
∑ ∑
− =
i i i i i
x w y x w
2 1
) (
Bias term
- So far we assumed that the
line passes through the origin
- What if the line does not?
- No problem, simply extend the
model to y = w0 + w1x+ε
- Can use least squares to
determine w0 , w1
n x w y w
i i i
∑
− =
1
X Y w0
∑ ∑
− =
i i i i i
x w y x w
2 1
) (
Simpler solution is coming soon…
Multivariate regression
- What if we have several inputs?
- Stock prices for Yahoo, Microsoft and Ebay for
the Google prediction task
- This becomes a multivariate regression problem
- Again, its easy to model:
y = w0 + w1x1+ … + wkxk + ε Google’s stock
price Yahoo’s stock price Microsoft’s stock price
Multivariate regression
- What if we have several inputs?
- Stock prices for Yahoo, Microsoft and Ebay for
the Google prediction task
- This becomes a multivariate regression problem
- Again, its easy to model:
y = w0 + w1x1+ … + wkxk + ε
y=10+3x1
2-2x2 2+ε
In some cases we would like to use polynomial or other terms based on the input data, are these still linear regression problems?
- Yes. As long as the coefficients are
linear the equation is still a linear regression problem! Not all functions can be approximated by a line/hyperplane…
Non-Linear basis function
- So far we only used the observed values x1,x2,…
- However, linear regression can be applied in the same
way to functions of these values
– Eg: to add a term w x1x2 add a new variable z=x1x2 so each example becomes: x1, x2, …. z
- As long as these functions can be directly computed
from the observed values the parameters are still linear in the data and the problem remains a multi-variate linear regression problem
ε + + + + =
2 2 1 1 k kx
w x w w y …
Non-Linear basis function
- How can we use this to add an intercept term?
Add a new “variable” z=1 and weight w0
Non-linear basis functions
- What type of functions can we use?
- A few common examples:
- Polynomial: φj(x) = xj for j=0 … n
- Gaussian:
- Sigmoid:
- Logs:
φ j(x) = (x − µ j) 2σ j
2
φ j(x) = 1 1+ exp(−s jx)
Any function of the input values can be used. The solution for the parameters
- f the regression remains
the same.
φ j(x) = log(x +1)
General linear regression problem
- Using our new notations for the basis function linear
regression can be written as
- Where φj(x) can be either xj for multivariate regression or
- ne of the non-linear basis functions we defined
- … and φ0(x)=1 for the intercept term
y = w jφ j(x)
j= 0 n
∑
Learning/Optimizing Multivariate Least Squares
Approach 1: Gradient Descent
Gradient descent
30
Gradient Descent for Linear Regression
predict with : ˆ y
i =
w jφ j(xi)
j n
∑
JX,y(w) = yi − ˆ yi
( )
i
∑
2 =
yi − wjφ j(xi)
j
∑
# $ % % & ' ( (
i
∑
2
Goal: minimize the following loss function:
sum over n examples sum over k+1 basis vectors
Gradient Descent for Linear Regression
predict with : ˆ y
i =
w jφ j(xi)
j n
∑
JX,y(w) = y i − ˆ y
i
( )
i
∑
2 =
y i − w jφ j(x i)
j
∑
% & ' ' ( ) * *
i
∑
2
Goal: minimize the following loss function: ∂ ∂w j J(w) = ∂ ∂w j y i − ˆ y
i
( )
i
∑
2
= 2 y i − ˆ y
i
( )
i
∑
∂ ∂w j ˆ y
i
= 2 y i − ˆ y
i
( )
i
∑
∂ ∂w j w jφ j(x i)
j
∑
= 2 y i − ˆ y
i
( )
i
∑
φ j(x i)
Gradient Descent for Linear Regression
ˆ yi = wjφ j(xi)
j=0 k
∑
∂ ∂w j J(w) = 2 y i − ˆ y
i
( )
i
∑
φ j(x i) Learning algorithm:
- Initialize weights w=0
- For t=1,… until convergence:
- Predict for each example xi using w:
- Compute gradient of loss:
- This is a vector g
- Update: w = w – λg
- λ is the learning rate.
Gradient Descent for Linear Regression
- We can use any of the tricks we used for logistic
regression:
– stochastic gradient descent (if the data is too big to put in memory) – regularization – …
Linear regression is a convex
- ptimization problem
proof: differentiate again to get the second derivative
so again gradient descent will reach a global optimum
Multivariate Least Squares
Approach 2: Matrix Inversion
OLS (Ordinary Least Squares Solution)
predict with : ˆ y
i =
w jφ j(xi)
j n
∑
JX,y(w) = y i − ˆ y
i
( )
i
∑
2 =
y i − w jφ j(x i)
j
∑
% & ' ' ( ) * *
i
∑
2
Goal: minimize the following loss function: ∂ ∂w j J(w) = 2 y i − ˆ y
i
( )
i
∑
φ j(x i)
predict with : ˆ y
i =
w jφ j(xi)
j n
∑
JX,y(w) = y i − ˆ y
i
( )
i
∑
2 =
y i − w jφ j(x i)
j
∑
% & ' ' ( ) * *
i
∑
2
Goal: minimize the following loss function: ∂ ∂w j J(w) = 2 y i − ˆ y
i
( )
i
∑
φ j(x i)
Φ = φ0(x1) φ1(x1) φk(x1) φ0(x2) φ1(x2) φk(x2) φ0(xn) φ1(xn) φk(xn) " # $ $ $ $ $ % & ' ' ' ' '
Notation: n examples k+1 basis vectors
predict with : ˆ y
i =
w jφ j(xi)
j n
∑
JX,y(w) = y i − ˆ y
i
( )
i
∑
2 =
y i − w jφ j(x i)
j
∑
% & ' ' ( ) * *
i
∑
2
Goal: minimize the following loss function: ∂ ∂w j J(w) = 2 y i − ˆ y
i
( )
i
∑
φ j(x i)
Φ = φ0(x1) φ1(x1) φk(x1) φ0(x2) φ1(x2) φk(x2) φ0(xn) φ1(xn) φk(xn) " # $ $ $ $ $ % & ' ' ' ' '
n examples k+1 basis vectors
y = y1 ... ... y n " # $ $ $ $ % & ' ' ' ' w = w0 .. wk ! " # # # # $ % & & & &
∂ ∂w0 J(w) = 2 yi − ˆ yi
( )
i
∑
φ0(xi)
Φ = φ0(x1) φ1(x1) φk(x1) φ0(x2) φ1(x2) φk(x2) φ0(xn) φ1(xn) φk(xn) " # $ $ $ $ $ % & ' ' ' ' '
n examples k+1 basis vectors
y = y1 ... ... y n " # $ $ $ $ % & ' ' ' ' w = w0 .. wk ! " # # # # $ % & & & &
∂ ∂wk J(w) = 2 yi − ˆ yi
( )
i
∑
φk(xi)
…
= φ1 ... ... φ n ! " # # # # # # $ % & & & & & &
notation: φ j
i ≡ φ j(xi)
∂ ∂w0 J(w) = 2 yiφ1
i − ˆ
yiφ1
i
( )
i
∑
Φ = φ0(x1) φ1(x1) φk(x1) φ0(x2) φ1(x2) φk(x2) φ0(xn) φ1(xn) φk(xn) " # $ $ $ $ $ % & ' ' ' ' '
n examples k+1 basis vectors
y = y1 ... ... y n " # $ $ $ $ % & ' ' ' ' w = w0 .. wk ! " # # # # $ % & & & &
…
= φ1 ... ... φ n ! " # # # # # # $ % & & & & & &
∂ ∂wk J(w) = 2 yiφk
i − ˆ
yiφk
i
( )
i
∑
recall ˆ yi = wjφ j
i j n
∑
= φ iw
Φ = φ0(x1) φ1(x1) φk(x1) φ0(x2) φ1(x2) φk(x2) φ0(xn) φ1(xn) φk(xn) " # $ $ $ $ $ % & ' ' ' ' '
n examples k+1 basis vectors
y = y1 ... ... y n " # $ $ $ $ % & ' ' ' ' w = w0 .. wk ! " # # # # $ % & & & &
…
= φ1 ... ... φ n # $ % % % % & ' ( ( ( (
∂ ∂wk J(w) = 2 yiφk
i −φ iw φk i
( )
i
∑
∂ ∂w0 J(w) = 2 yiφ0
i −φ iw φ0 i
( )
i
∑
= 2ΦTy − 2ΦTΦw
ΦT = φ0(x1) φ0(x2) φ0(xn) φk(x1) φk(x2) φk(xn) " # $ $ $ $ $ % & ' ' ' ' '
n examples
y = y1 ... ... ... yn ! " # # # # # # # $ % & & & & & & &
…
∂ ∂wk J(w) = 2 φk
iyi −φk iφ iw
( )
i
∑
∂ ∂w0 J(w) = 2 φ0
i yi −φ0 iφ iw
( )
i
∑
= 2ΦTy −...
k+1 basis …
∂ ∂wk J(w) = 2 φk
iyi −φk iφ iw
( )
i
∑
∂ ∂w0 J(w) = 2 φ0
i yi −φ0 iφ iw
( )
i
∑
=...− 2ΦTΦw
φ0(x1) φk(x1) φ0(x2) φk(x2) φ0(xn) φk(xn) ! " # # # # # $ % & & & & &
w0 .. wk ! " # # # # $ % & & & &
φ0(x1) ... φ0(xn) φ1(x1) φ1(xn) φk(x1) φk(xn) ! " # # # # # $ % & & & & &
n examples
Φ = φ0(x1) φ1(x1) φk(x1) φ0(x2) φ1(x2) φk(x2) φ0(xn) φ1(xn) φk(xn) " # $ $ $ $ $ % & ' ' ' ' '
n examples k+1 basis vectors
y = y1 ... ... y n " # $ $ $ $ % & ' ' ' ' w = w0 .. wk ! " # # # # $ % & & & &
…
= φ1 ... ... φ n # $ % % % % & ' ( ( ( (
∂ ∂wk J(w) = 2 yiφk
i −φ iw φk i
( )
i
∑
∂ ∂w0 J(w) = 2 yiφ0
i −φ iw φ0 i
( )
i
∑
= 2ΦTy − 2ΦTΦw w = ΦTΦ
( )
−1ΦTy
= 0
recap: Solving linear regression
- To optimize – closed form:
- We just take the derivative w.r.t. to w and set to 0:
∂ ∂w (yi − wxi)2
i
∑
= 2 −xi(yi − wxi)
i
∑
⇒ 2 xi(yi − wxi) = 0
i
∑
⇒ xiyi = wxi
2 i
∑
i
∑
⇒ w = xiyi
i
∑
xi
2 i
∑
covar(X,Y)/var(X) if mean(X)=mean(Y)=0
2 xiyi
i
∑
− 2 wxixi
i
∑
= 0
Φ = φ0(x1) φ1(x1) φk(x1) φ0(x2) φ1(x2) φk(x2) φ0(xn) φ1(xn) φk(xn) " # $ $ $ $ $ % & ' ' ' ' '
n examples k+1 basis vectors
y = y1 ... ... y n " # $ $ $ $ % & ' ' ' ' w = w0 .. wk ! " # # # # $ % & & & &
…
= φ1 ... ... φ n # $ % % % % & ' ( ( ( (
∂ ∂wk J(w) = 2 yiφk
i −φ iw φk i
( )
i
∑
∂ ∂w0 J(w) = 2 yiφ0
i −φ iw φ0 i
( )
i
∑
= 2ΦTy − 2ΦTΦw w = ΦTΦ
( )
−1ΦTy
= 0
2 xiyi
i
∑
− 2 wxixi
i
∑
= 0
LMS for general linear regression problem
J(w) = (y i − w Tφ(x i))2
i
∑
Deriving w we get: w = (ΦTΦ)−1ΦTy n by k+1 matrix n entries vector k+1 entries vector This solution is also known as ‘pseudo inverse’
Another reason to start with an objective function: you can see when two learning methods are the same!
LMS versus gradient descent
J(w) = (y i − w Tφ(x i))2
i
∑
w = (ΦTΦ)−1ΦTy
LMS solution: + Very simple in Matlab or something similar
- Requires matrix inverse, which is expensive for a large
matrix. Gradient descent: + Fast for large matrices + Stochastic GD is very memory efficient + Easily extended to other cases
- Parameters to tweak (how to decide convergence?
what is the learning rate? ….)
Regression and Overfitting
An example: polynomial basis vectors on a small dataset
– From Bishop Ch 1
0th Order Polynomial
n=10
1st Order Polynomial
3rd Order Polynomial
9th Order Polynomial
Over-fitting
Root-Mean-Square (RMS) Error:
Polynomial Coefficients
Data Set Size:
9th Order Polynomial
Regularization
Penalize large coefficient values
JX,y(w) = 1 2 yi − wjφ j(xi)
j
∑
# $ % % & ' ( (
i
∑
2
− λ 2 w
2
Regularization:
+
Polynomial Coefficients
none exp(18) huge
Over Regularization:
Regularized Gradient Descent for LR
predict with : ˆ y
i =
w jφ j(xi)
j n
∑
JX,y(w) = 1 2 yi − wjφ j(xi)
j
∑
# $ % % & ' ( (
i
∑
2
− λ 2 w
2
Goal: minimize the following loss function:
∂ ∂wj J(w) = yi − ˆ yi
( )
i
∑
φ j(xi)− λwj
JX,y(w) = 1 2 yi − wjφ j(xi)
j
∑
# $ % % & ' ( (
i
∑
2
− λ 2 wj
2 j
∑
Probabilistic Interpretation of Least Squares
A probabilistic interpretation
Our least squares minimization solution can also be motivated by a probabilistic in interpretation of the regression problem:
y = w Tφ(x) + ε
The MLE for w in this model is the same as the solution we derived for least squares criteria:
w = (ΦTΦ)−1ΦTy
where ε is Gaussian noise
Understanding Overfitting: Bias-Variance
Example
Tom Dietterich, Oregon St
Example
Tom Dietterich, Oregon St
Same experiment, repeated: with 50 samples of 20 points each
The true function f can’t be fit perfectly with hypotheses from
- ur class H
(lines) è Error1 We don’t get the best hypothesis from H because
- f noise/small
sample size è Error2 Fix: more expressive set of hypotheses H Fix: less expressive set of hypotheses H noise is similar to error1
Bias-Variance Decomposition: Regression
Bias and variance for regression
- For regression, we can easily decompose the
error of the learned model into two parts: bias (error 1) and variance (error 2)
– Bias: the class of models can’t fit the data.
- Fix: a more expressive model class.
– Variance: the class of models could fit the data, but doesn’t because it’s hard to fit.
- Fix: a less expressive model class.
Bias – Variance decomposition of error
learned from D
( )
{ }
) ( ) (
2 ,
x h x f E
D D
− +ε
ε
true function dataset and noise Fix test case x, then do this experiment:
- 1. Draw size n sample D=(x1,y1),….(xn,yn)
- 2. Train linear regressor hD using D
- 3. Draw one test example (x, f(x)+ε)
- 4. Measure squared error of hD on that one example x
What’s the expected error?
72
noise
Bias – Variance decomposition of error
learned from D
ED,ε f (x)+ε − hD(x)
( )
2
{ }
true function dataset and noise noise
)} ( { x h E h
D D
≡ ) ( ˆ ˆ x h y y
D D ≡
=
Notation - to simplify this
f ≡ f (x)+ε
long-term expectation of learner’s prediction
- n this x averaged over many data sets D
Bias – Variance decomposition of error
ED,ε ( f − ˆ y)2
{ }
= E [ f − h]+[h − ˆ y]
( )
2
{ }
= E [ f − h]2 +[h − ˆ y]2 + 2[ f − h][h − ˆ y]
{ }
= E [ f − h]2 +[h − ˆ y]2 + 2[ fh − fˆ y − h2 + hˆ y]
{ }
= E[( f − h)2]+ E[(h − ˆ y)2]+ 2 E[ fh]− E[ fˆ y]− E[h2]+ E[hˆ y]
( )
)} ( { x h E h
D D
≡ ) ( ˆ ˆ x h y y
D D ≡
=
f ≡ f (x)+ε
ED,ε f (x)+ε
( )*ED hD(x)
{ }
{ }
= ED,ε f (x)+ε
( )*hD(x)
{ }
ED,ε ED hD(x)
{ }*ED hD(x) { }
{ }
= ED,ε ED hD(x)
{ }*hD(x)
{ }
Bias – Variance decomposition of error
ED,ε ( f − ˆ y)2
{ }
= E [ f − h]+[h − ˆ y]
( )
2
{ }
= E [ f − h]2 +[h − ˆ y]2 + 2[ f − h][h − ˆ y]
{ }
= E[( f − h)2]+ E[(h − ˆ y)2]
Squared difference btwn our long- term expectation for the learners performance, ED[hD(x)], and what we expect in a representative run
- n a dataset D (hat y)
Squared difference between best possible prediction for x, f(x), and
- ur “long-term” expectation
for what the learner will do if we averaged over many datasets D, ED[hD(x)]
)} ( { x h E h
D D
≡ ) ( ˆ ˆ x h y y
D D ≡
=
BIAS2 VARIANCE
75
f ≡ f (x)+ε
bias variance x=5
Bias-variance decomposition
- This is something real that you can (approximately)
measure experimentally – if you have synthetic data
- Different learners and model classes have different
tradeoffs – large bias/small variance: few features, highly regularized, highly pruned decision trees, large-k k- NN… – small bias/high variance: many features, less regularization, unpruned trees, small-k k-NN…
Bias and variance
- For classification, we can also decompose the
error of a learned classifier into two terms: bias and variance
– Bias: the class of models can’t fit the data. – Fix: a more expressive model class. – Variance: the class of models could fit the data, but doesn’t because it’s hard to fit. – Fix: a less expressive model class.
Another view of a decision tree
Sepal_length<5.7 Sepal_width>2.8
Another view of a decision tree
Sepal_length>5.7 N Sepal_width>2.8 length>5.1 N Y width>3.1 Y length>4.6 N Y
Another view of a decision tree
Sepal_length>5.7 N Sepal_width>2.8 length>5.1 N Y width>3.1 Y N
Another view of a decision tree
Another view of a decision tree
Bias-Variance Decomposition: Measuring
Bias-variance decomposition
- This is something real that you can (approximately)
measure experimentally – if you have synthetic data – …or if you’re clever – You need to somehow approximate ED{hD(x)} – I.e., construct many variants of the dataset D
Background: “Bootstrap” sampling
- Input: dataset D
- Output: many variants of D: D1,…,DT
- For t=1,….,T:
– Dt = { } – For i=1…|D|:
- Pick (x,y) uniformly at random from D (i.e.,
with replacement) and add it to Dt
- Some examples never get picked (~37%)
- Some are picked 2x, 3x, ….
Measuring Bias-Variance with “Bootstrap” sampling
- Create B bootstrap variants of D (approximate many draws of D)
- For each bootstrap dataset
– Tb is the dataset; Ub are the “out of bag” examples – Train a hypothesis hb on Tb – Test hb on each x in Ub
- Now for each (x,y) example we have many predictions
h1(x),h2(x), …. so we can estimate (ignoring noise) – variance: ordinary variance of h1(x),….,hn(x) – bias: average(h1(x),…,hn(x)) - y
Applying Bias-Variance Analysis
- By measuring the bias and variance on a
problem, we can determine how to improve our model
– If bias is high, we need to allow our model to be more complex – If variance is high, we need to reduce the complexity
- f the model
- Bias-variance analysis also suggests a way to
reduce variance: bagging
88
Bagging
Bootstrap Aggregation (Bagging)
- Use the bootstrap to create B variants of D
- Learn a classifier from each variant
- Vote the learned classifiers to predict on a test
example
Bagging (bootstrap aggregation)
- Breaking it down:
– input: dataset D and YFCL – output: a classifier hD-BAG – use bootstrap to construct variants D1,…,DT – for t=1,…,T: train YFCL on Dt to get ht – to classify x with hD-BAG
- classify x with h1,….,hT and predict the most
frequently predicted class for x (majority vote)
Note that you can use any learner you like! You can also test ht on the “out of bag” examples
Experiments
Freund and Schapire
solid: NB dashed: LR
Bagged, minimally pruned decision trees
Generally, bagged decision trees
- utperform the linear
classifier eventually if the data is large enough and clean enough.
Bagging (bootstrap aggregation)
- Experimentally:
– especially with minimal pruning: decision trees have low bias but high variance. – bagging usually improves performance for decision trees and similar methods – It reduces variance without increasing the bias (much).