. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Introduction to Machine Learning - CS725 Instructor: Prof. Ganesh - - PowerPoint PPT Presentation
Introduction to Machine Learning - CS725 Instructor: Prof. Ganesh - - PowerPoint PPT Presentation
Introduction to Machine Learning - CS725 Instructor: Prof. Ganesh Ramakrishnan Lecture 4 - Linear Regression - Probabilistic Interpretation and Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Recap: Linear Regression is not Naively Linear
Need to determine w for the linear function f (x, w) = ∑n
i=1 wiφi(xj) = Φw which minimizes our error
function E (f (x, w), D) Owing to basis function φ, “Linear Regression” is linear in w but NOT in x (which could be arbitrarily non-linear)! Φ = φ1(x1) φ2(x1) ...... φp(x1) . . φ1(xm) φ2(xm) ...... φn(xm) (1)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Recap: Linear Regression is not Naively Linear
Need to determine w for the linear function f (x, w) = ∑n
i=1 wiφi(xj) = Φw which minimizes our error
function E (f (x, w), D) Owing to basis function φ, “Linear Regression” is linear in w but NOT in x (which could be arbitrarily non-linear)! Φ = φ1(x1) φ2(x1) ...... φp(x1) . . φ1(xm) φ2(xm) ...... φn(xm) (1) Least Squares error and corresponding estimates: E ∗ = min
w E(w, D) = min w
( wTΦTΦw − 2yTΦw + yTy ) (2) w∗ = arg min
w
E(w, D) = arg min
w
m
∑
j=1
( n ∑
i=1
wiφi(xj) − yj )2 (3)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Recap: Geometric Interpretation of Least Square Solution
Let y∗ be a solution in the column space of Φ The least squares solution is such that the distance between y∗ and y is minimized Therefore, the line joining y∗ to y should be orthogonal to the column space of Φ ⇒ w = (ΦTΦ)−1ΦTy (4) Here ΦTΦ is invertible only if Φ has full column rank
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Building on questions on Least Squares Linear Regression
1 Is there a probabilistic interpretation?
Gaussian Error, Maximum Likelihood Estimate
2 Addressing overfitting
Bayesian and Maximum Aposteriori Estimates, Regularization
3 How to minimize the resultant and more complex error
functions?
Level Curves and Surfaces, Gradient Vector, Directional Derivative, Gradient Descent Algorithm, Convexity, Necessary and Sufficient Conditions for Optimality
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Probabilistic Modeling of Linear Regression
Linear Model: Y is a linear function of φ(x), subject to a random noise variable ε which we believe is ‘mostly’ bounded by some threshold σ: Y = wTφ(x) + ε ε ∼ N(0, σ2) Motivation: N(µ, σ2), has maximum entropy among all real-valued distributions with a specified variance σ2 3 − σ rule: About 68% of values drawn from N(µ, σ2) are within one standard deviation σ away from the mean µ; about 95% of the values lie within 2σ; and about 99.7% are within 3σ.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Figure 1: 3 − σ rule: About 68% of values drawn from N(µ, σ2) are within one standard deviation σ away from the mean µ; about 95% of the values lie within 2σ; and about 99.7% are within 3σ. Source: https://en.wikipedia.org/wiki/Normal_distribution
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Probabilistic Modeling of Linear Regression
Linear Model: Y is a linear function of φ(x), subject to a random noise variable ε which we believe is ‘mostly’ around some threshold σ: Y = wTφ(x) + ε ε ∼ N(0, σ2) This allows for the Probabilistic model P(yj|w, xj, σ2) = N(wTφ(xj), σ2) P(y|w, xj, σ2) =
m
∏
j=1
P(yj|w, xj, σ2) Another motivation: E[Y (w, xj)] =
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Probabilistic Modeling of Linear Regression
Linear Model: Y is a linear function of φ(x), subject to a random noise variable ε which we believe is ‘mostly’ around some threshold σ: Y = wTφ(x) + ε ε ∼ N(0, σ2) This allows for the Probabilistic model P(yj|w, xj, σ2) = N(wTφ(xj), σ2) P(y|w, xj, σ2) =
m
∏
j=1
P(yj|w, xj, σ2) Another motivation: E[Y (w, xj)] = wTφ(xj) = wT
0 + wT 1 φ1(xj) + ... + wT n φn(xj)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Estimating w: Maximum Likelihood
If ϵ ∼ N(0, σ2) and y = wTφ(x) + ϵ where w, φ(x) ∈ Rm then, given dataset D, find the most likely ˆ wML Recall: Pr(yj|xj, w) = 1 √ 2πσ2 exp ((yj − wTφ(xj))2 2σ2 ) From Probability of data to Likelihood of parameters: Pr(D|w) = Pr(y|x, w) =
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Estimating w: Maximum Likelihood
If ϵ ∼ N(0, σ2) and y = wTφ(x) + ϵ where w, φ(x) ∈ Rm then, given dataset D, find the most likely ˆ wML Recall: Pr(yj|xj, w) = 1 √ 2πσ2 exp ((yj − wTφ(xj))2 2σ2 ) From Probability of data to Likelihood of parameters: Pr(D|w) = Pr(y|x, w) =
m
∏
j=1
Pr(yj|xj, w) =
m
∏
j=1
1 √ 2πσ2 exp ((yj − wTφ(xj))2 2σ2 ) Maximum Likelihood Estimate ˆ wML = argmax
w
Pr(D|w) = Pr(y|x, w) = L(w|D)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Optimization Trick
Optimization Trick: Optimal point is invariant under monotonically increasing transformation (such as log )
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Optimization Trick
Optimization Trick: Optimal point is invariant under monotonically increasing transformation (such as log ) log L(w|D) = LL(w|D) = −m 2 ln(2πσ2) − 1 2σ2
m
∑
j=1
(wTφ(xj) − yj)2 For a fixed σ2 ˆ wML =
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Optimization Trick
Optimization Trick: Optimal point is invariant under monotonically increasing transformation (such as log ) log L(w|D) = LL(w|D) = −m 2 ln(2πσ2) − 1 2σ2
m
∑
j=1
(wTφ(xj) − yj)2 For a fixed σ2 ˆ wML = argmax LL(y1...ym|x1 . . . xm, w, σ2) = argmin
m
∑
j=1
(wTφ(xj) − yj)2 Note that this is same as the Least square solution!!
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Building on questions on Least Squares Linear Regression
1 Is there a probabilistic interpretation?
Gaussian Error, Maximum Likelihood Estimate
2 Addressing overfitting
Bayesian and Maximum Aposteriori Estimates, Regularization
3 How to minimize the resultant and more complex error
functions?
Level Curves and Surfaces, Gradient Vector, Directional Derivative, Gradient Descent Algorithm, Convexity, Necessary and Sufficient Conditions for Optimality
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Redundant Φ and Overfitting
Figure 2: Root Mean Squared (RMS) errors on sample train and test datasets as a function of the degree t of the polynomial being fit
Too many bends (t=9 onwards) in curve ≡ high values of some w′
i s. Try plotting values of wi’s using applet at
http://mste.illinois.edu/users/exner/java.f/leastsquares/#simulation
Train and test errors differ significantly
X^0 * 0.13252679175596802 X^1 * 6.836159339696569 X^2 *
- 10.198794083500966
X^3 * 8.298738913209064 X^4 *
- 3.766949862252123
X^5 * 1.0274981119277349 X^6 *
- 0.17218031550131038
X^7 * 0.017340835860554016 X^8 *
- 9.623065771393043E-4
X^9 * 2.2595409656184083E-5 X^0 *
- 1.4218758581602278
X^1 * 14.756472312089675 X^2 *
- 24.299789484296475
X^3 * 20.63606795357865 X^4 *
- 9.934453145766518
X^5 * 2.8975181063446613
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Bayesian Linear Regression
The Bayesian interpretation of probabilistic estimation is a logical extension that enables reasoning with uncertainty but in the light of some background belief Bayesian linear regression: A Bayesian alternative to Maximum Likelihood least squares regression Continue with Normally distributed errors Model the w using a prior distribution and use the posterior
- ver w as the result
Intuitive Prior:
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Bayesian Linear Regression
The Bayesian interpretation of probabilistic estimation is a logical extension that enables reasoning with uncertainty but in the light of some background belief Bayesian linear regression: A Bayesian alternative to Maximum Likelihood least squares regression Continue with Normally distributed errors Model the w using a prior distribution and use the posterior
- ver w as the result