Introduction to Machine Learning - CS725 Instructor: Prof. Ganesh - - PowerPoint PPT Presentation

introduction to machine learning cs725 instructor prof
SMART_READER_LITE
LIVE PREVIEW

Introduction to Machine Learning - CS725 Instructor: Prof. Ganesh - - PowerPoint PPT Presentation

Introduction to Machine Learning - CS725 Instructor: Prof. Ganesh Ramakrishnan Lecture 4 - Linear Regression - Probabilistic Interpretation and Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . .


slide-1
SLIDE 1

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Introduction to Machine Learning - CS725 Instructor: Prof. Ganesh Ramakrishnan Lecture 4 - Linear Regression - Probabilistic Interpretation and Regularization

slide-2
SLIDE 2

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Recap: Linear Regression is not Naively Linear

Need to determine w for the linear function f (x, w) = ∑n

i=1 wiφi(xj) = Φw which minimizes our error

function E (f (x, w), D) Owing to basis function φ, “Linear Regression” is linear in w but NOT in x (which could be arbitrarily non-linear)! Φ =     φ1(x1) φ2(x1) ...... φp(x1) . . φ1(xm) φ2(xm) ...... φn(xm)     (1)

slide-3
SLIDE 3

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Recap: Linear Regression is not Naively Linear

Need to determine w for the linear function f (x, w) = ∑n

i=1 wiφi(xj) = Φw which minimizes our error

function E (f (x, w), D) Owing to basis function φ, “Linear Regression” is linear in w but NOT in x (which could be arbitrarily non-linear)! Φ =     φ1(x1) φ2(x1) ...... φp(x1) . . φ1(xm) φ2(xm) ...... φn(xm)     (1) Least Squares error and corresponding estimates: E ∗ = min

w E(w, D) = min w

( wTΦTΦw − 2yTΦw + yTy ) (2) w∗ = arg min

w

E(w, D) = arg min

w

  

m

j=1

( n ∑

i=1

wiφi(xj) − yj )2   (3)

slide-4
SLIDE 4

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Recap: Geometric Interpretation of Least Square Solution

Let y∗ be a solution in the column space of Φ The least squares solution is such that the distance between y∗ and y is minimized Therefore, the line joining y∗ to y should be orthogonal to the column space of Φ ⇒ w = (ΦTΦ)−1ΦTy (4) Here ΦTΦ is invertible only if Φ has full column rank

slide-5
SLIDE 5

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Building on questions on Least Squares Linear Regression

1 Is there a probabilistic interpretation?

Gaussian Error, Maximum Likelihood Estimate

2 Addressing overfitting

Bayesian and Maximum Aposteriori Estimates, Regularization

3 How to minimize the resultant and more complex error

functions?

Level Curves and Surfaces, Gradient Vector, Directional Derivative, Gradient Descent Algorithm, Convexity, Necessary and Sufficient Conditions for Optimality

slide-6
SLIDE 6

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Probabilistic Modeling of Linear Regression

Linear Model: Y is a linear function of φ(x), subject to a random noise variable ε which we believe is ‘mostly’ bounded by some threshold σ: Y = wTφ(x) + ε ε ∼ N(0, σ2) Motivation: N(µ, σ2), has maximum entropy among all real-valued distributions with a specified variance σ2 3 − σ rule: About 68% of values drawn from N(µ, σ2) are within one standard deviation σ away from the mean µ; about 95% of the values lie within 2σ; and about 99.7% are within 3σ.

slide-7
SLIDE 7
slide-8
SLIDE 8

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 1: 3 − σ rule: About 68% of values drawn from N(µ, σ2) are within one standard deviation σ away from the mean µ; about 95% of the values lie within 2σ; and about 99.7% are within 3σ. Source: https://en.wikipedia.org/wiki/Normal_distribution

slide-9
SLIDE 9

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Probabilistic Modeling of Linear Regression

Linear Model: Y is a linear function of φ(x), subject to a random noise variable ε which we believe is ‘mostly’ around some threshold σ: Y = wTφ(x) + ε ε ∼ N(0, σ2) This allows for the Probabilistic model P(yj|w, xj, σ2) = N(wTφ(xj), σ2) P(y|w, xj, σ2) =

m

j=1

P(yj|w, xj, σ2) Another motivation: E[Y (w, xj)] =

slide-10
SLIDE 10

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Probabilistic Modeling of Linear Regression

Linear Model: Y is a linear function of φ(x), subject to a random noise variable ε which we believe is ‘mostly’ around some threshold σ: Y = wTφ(x) + ε ε ∼ N(0, σ2) This allows for the Probabilistic model P(yj|w, xj, σ2) = N(wTφ(xj), σ2) P(y|w, xj, σ2) =

m

j=1

P(yj|w, xj, σ2) Another motivation: E[Y (w, xj)] = wTφ(xj) = wT

0 + wT 1 φ1(xj) + ... + wT n φn(xj)

slide-11
SLIDE 11

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Estimating w: Maximum Likelihood

If ϵ ∼ N(0, σ2) and y = wTφ(x) + ϵ where w, φ(x) ∈ Rm then, given dataset D, find the most likely ˆ wML Recall: Pr(yj|xj, w) = 1 √ 2πσ2 exp ((yj − wTφ(xj))2 2σ2 ) From Probability of data to Likelihood of parameters: Pr(D|w) = Pr(y|x, w) =

slide-12
SLIDE 12

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Estimating w: Maximum Likelihood

If ϵ ∼ N(0, σ2) and y = wTφ(x) + ϵ where w, φ(x) ∈ Rm then, given dataset D, find the most likely ˆ wML Recall: Pr(yj|xj, w) = 1 √ 2πσ2 exp ((yj − wTφ(xj))2 2σ2 ) From Probability of data to Likelihood of parameters: Pr(D|w) = Pr(y|x, w) =

m

j=1

Pr(yj|xj, w) =

m

j=1

1 √ 2πσ2 exp ((yj − wTφ(xj))2 2σ2 ) Maximum Likelihood Estimate ˆ wML = argmax

w

Pr(D|w) = Pr(y|x, w) = L(w|D)

slide-13
SLIDE 13

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Optimization Trick

Optimization Trick: Optimal point is invariant under monotonically increasing transformation (such as log )

slide-14
SLIDE 14

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Optimization Trick

Optimization Trick: Optimal point is invariant under monotonically increasing transformation (such as log ) log L(w|D) = LL(w|D) = −m 2 ln(2πσ2) − 1 2σ2

m

j=1

(wTφ(xj) − yj)2 For a fixed σ2 ˆ wML =

slide-15
SLIDE 15

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Optimization Trick

Optimization Trick: Optimal point is invariant under monotonically increasing transformation (such as log ) log L(w|D) = LL(w|D) = −m 2 ln(2πσ2) − 1 2σ2

m

j=1

(wTφ(xj) − yj)2 For a fixed σ2 ˆ wML = argmax LL(y1...ym|x1 . . . xm, w, σ2) = argmin

m

j=1

(wTφ(xj) − yj)2 Note that this is same as the Least square solution!!

slide-16
SLIDE 16

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Building on questions on Least Squares Linear Regression

1 Is there a probabilistic interpretation?

Gaussian Error, Maximum Likelihood Estimate

2 Addressing overfitting

Bayesian and Maximum Aposteriori Estimates, Regularization

3 How to minimize the resultant and more complex error

functions?

Level Curves and Surfaces, Gradient Vector, Directional Derivative, Gradient Descent Algorithm, Convexity, Necessary and Sufficient Conditions for Optimality

slide-17
SLIDE 17

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Redundant Φ and Overfitting

Figure 2: Root Mean Squared (RMS) errors on sample train and test datasets as a function of the degree t of the polynomial being fit

Too many bends (t=9 onwards) in curve ≡ high values of some w′

i s. Try plotting values of wi’s using applet at

http://mste.illinois.edu/users/exner/java.f/leastsquares/#simulation

Train and test errors differ significantly

slide-18
SLIDE 18

X^0 * 0.13252679175596802 X^1 * 6.836159339696569 X^2 *

  • 10.198794083500966

X^3 * 8.298738913209064 X^4 *

  • 3.766949862252123

X^5 * 1.0274981119277349 X^6 *

  • 0.17218031550131038

X^7 * 0.017340835860554016 X^8 *

  • 9.623065771393043E-4

X^9 * 2.2595409656184083E-5 X^0 *

  • 1.4218758581602278

X^1 * 14.756472312089675 X^2 *

  • 24.299789484296475

X^3 * 20.63606795357865 X^4 *

  • 9.934453145766518

X^5 * 2.8975181063446613

slide-19
SLIDE 19
slide-20
SLIDE 20

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Bayesian Linear Regression

The Bayesian interpretation of probabilistic estimation is a logical extension that enables reasoning with uncertainty but in the light of some background belief Bayesian linear regression: A Bayesian alternative to Maximum Likelihood least squares regression Continue with Normally distributed errors Model the w using a prior distribution and use the posterior

  • ver w as the result

Intuitive Prior:

slide-21
SLIDE 21

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Bayesian Linear Regression

The Bayesian interpretation of probabilistic estimation is a logical extension that enables reasoning with uncertainty but in the light of some background belief Bayesian linear regression: A Bayesian alternative to Maximum Likelihood least squares regression Continue with Normally distributed errors Model the w using a prior distribution and use the posterior

  • ver w as the result

Intuitive Prior: Components of w should not become too large! Next: Illustration of Bayesian Estimation on a simple Coin-tossing example