introduction to machine learning cs725 instructor prof
play

Introduction to Machine Learning - CS725 Instructor: Prof. Ganesh - PowerPoint PPT Presentation

Introduction to Machine Learning - CS725 Instructor: Prof. Ganesh Ramakrishnan Lecture 4 - Linear Regression - Probabilistic Interpretation and Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . .


  1. Introduction to Machine Learning - CS725 Instructor: Prof. Ganesh Ramakrishnan Lecture 4 - Linear Regression - Probabilistic Interpretation and Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  2. Recap: Linear Regression is not Naively Linear Need to determine w for the linear function f ( x , w ) = ∑ n i =1 w i φ i ( x j ) = Φw which minimizes our error function E ( f ( x , w ) , D ) Owing to basis function φ , “Linear Regression” is linear in w but NOT in x (which could be arbitrarily non-linear)!   φ 1 ( x 1 ) φ 2 ( x 1 ) ...... φ p ( x 1 ) .   Φ = (1)   .   φ 1 ( x m ) φ 2 ( x m ) φ n ( x m ) ...... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  3. Recap: Linear Regression is not Naively Linear Need to determine w for the linear function f ( x , w ) = ∑ n i =1 w i φ i ( x j ) = Φw which minimizes our error function E ( f ( x , w ) , D ) Owing to basis function φ , “Linear Regression” is linear in w but NOT in x (which could be arbitrarily non-linear)!   φ 1 ( x 1 ) φ 2 ( x 1 ) ...... φ p ( x 1 ) .   Φ = (1)   .   φ 1 ( x m ) φ 2 ( x m ) φ n ( x m ) ...... Least Squares error and corresponding estimates: E ∗ = min ( ) w T Φ T Φw − 2y T Φw + y T y w E ( w , D ) = min (2) w ( n  ) 2  m w ∗ = arg min   ∑ ∑ E ( w , D ) = arg min w i φ i ( x j ) − y j w w . . . . . . . . . . . . . . . . . . . .   j = 1 i = 1 . . . . . . . . . . . . . . . . . . . . (3)

  4. Recap: Geometric Interpretation of Least Square Solution Let y ∗ be a solution in the column space of Φ The least squares solution is such that the distance between y ∗ and y is minimized Therefore, the line joining y ∗ to y should be orthogonal to the column space of Φ ⇒ w = ( Φ T Φ ) − 1 Φ T y (4) Here Φ T Φ is invertible only if Φ has full column rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  5. Building on questions on Least Squares Linear Regression 1 Is there a probabilistic interpretation? Gaussian Error, Maximum Likelihood Estimate 2 Addressing overfitting Bayesian and Maximum Aposteriori Estimates, Regularization 3 How to minimize the resultant and more complex error functions? Level Curves and Surfaces, Gradient Vector, Directional Derivative, Gradient Descent Algorithm, Convexity, Necessary and Sufficient Conditions for Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  6. Probabilistic Modeling of Linear Regression Linear Model: Y is a linear function of φ ( x ), subject to a random noise variable ε which we believe is ‘mostly’ bounded by some threshold σ : Y = w T φ ( x ) + ε ε ∼ N (0 , σ 2 ) Motivation: N ( µ, σ 2 ), has maximum entropy among all real-valued distributions with a specified variance σ 2 3 − σ rule: About 68% of values drawn from N ( µ, σ 2 ) are within one standard deviation σ away from the mean µ ; about 95% of the values lie within 2 σ ; and about 99.7% are within 3 σ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  7. Figure 1: 3 − σ rule: About 68% of values drawn from N ( µ, σ 2 ) are within one standard deviation σ away from the mean µ ; about 95% of the values lie within 2 σ ; and about 99.7% are within 3 σ . Source: https://en.wikipedia.org/wiki/Normal_distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  8. Probabilistic Modeling of Linear Regression Linear Model: Y is a linear function of φ ( x ), subject to a random noise variable ε which we believe is ‘mostly’ around some threshold σ : Y = w T φ ( x ) + ε ε ∼ N (0 , σ 2 ) This allows for the Probabilistic model P ( y j | w , x j , σ 2 ) = N ( w T φ ( x j ) , σ 2 ) m ∏ P ( y | w , x j , σ 2 ) = P ( y j | w , x j , σ 2 ) j =1 Another motivation: E [ Y ( w , x j )] = . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  9. Probabilistic Modeling of Linear Regression Linear Model: Y is a linear function of φ ( x ), subject to a random noise variable ε which we believe is ‘mostly’ around some threshold σ : Y = w T φ ( x ) + ε ε ∼ N (0 , σ 2 ) This allows for the Probabilistic model P ( y j | w , x j , σ 2 ) = N ( w T φ ( x j ) , σ 2 ) m ∏ P ( y | w , x j , σ 2 ) = P ( y j | w , x j , σ 2 ) j =1 Another motivation: E [ Y ( w , x j )] = w T φ ( x j ) = w T 0 + w T 1 φ 1 ( x j ) + ... + w T n φ n ( x j ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  10. Estimating w : Maximum Likelihood If ϵ ∼ N (0 , σ 2 ) and y = w T φ ( x ) + ϵ where w , φ ( x ) ∈ R m then, given dataset D , find the most likely ˆ w ML ( ( y j − w T φ ( x j )) 2 1 ) Recall: Pr( y j | x j , w ) = √ 2 πσ 2 exp 2 σ 2 From Probability of data to Likelihood of parameters : Pr( D| w ) = Pr( y | x , w ) = . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  11. Estimating w : Maximum Likelihood If ϵ ∼ N (0 , σ 2 ) and y = w T φ ( x ) + ϵ where w , φ ( x ) ∈ R m then, given dataset D , find the most likely ˆ w ML ( ( y j − w T φ ( x j )) 2 1 ) Recall: Pr( y j | x j , w ) = √ 2 πσ 2 exp 2 σ 2 From Probability of data to Likelihood of parameters : Pr( D| w ) = Pr( y | x , w ) = m m ( ( y j − w T φ ( x j )) 2 ) 1 ∏ ∏ Pr( y j | x j , w ) = √ 2 πσ 2 exp 2 σ 2 j =1 j =1 Maximum Likelihood Estimate w ML = argmax ˆ Pr( D| w ) = Pr( y | x , w ) = L ( w |D ) w . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  12. Optimization Trick Optimization Trick: Optimal point is invariant under monotonically increasing transformation (such as log ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  13. Optimization Trick Optimization Trick: Optimal point is invariant under monotonically increasing transformation (such as log ) log L ( w |D ) = LL ( w |D ) = m 1 − m ∑ 2 ln (2 πσ 2 ) − ( w T φ ( x j ) − y j ) 2 2 σ 2 j =1 For a fixed σ 2 w ML = ˆ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  14. Optimization Trick Optimization Trick: Optimal point is invariant under monotonically increasing transformation (such as log ) log L ( w |D ) = LL ( w |D ) = m 1 − m ∑ 2 ln (2 πσ 2 ) − ( w T φ ( x j ) − y j ) 2 2 σ 2 j =1 For a fixed σ 2 w ML = argmax LL ( y 1 ... y m | x 1 . . . x m , w , σ 2 ) ˆ m ∑ ( w T φ ( x j ) − y j ) 2 = argmin j =1 Note that this is same as the Least square solution!! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  15. Building on questions on Least Squares Linear Regression 1 Is there a probabilistic interpretation? Gaussian Error, Maximum Likelihood Estimate 2 Addressing overfitting Bayesian and Maximum Aposteriori Estimates, Regularization 3 How to minimize the resultant and more complex error functions? Level Curves and Surfaces, Gradient Vector, Directional Derivative, Gradient Descent Algorithm, Convexity, Necessary and Sufficient Conditions for Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  16. Redundant Φ and Overfitting Figure 2: Root Mean Squared (RMS) errors on sample train and test datasets as a function of the degree t of the polynomial being fit Too many bends (t=9 onwards) in curve ≡ high values of some w ′ i s . Try plotting values of w i ’s using applet at http://mste.illinois.edu/users/exner/java.f/leastsquares/#simulation Train and test errors differ significantly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  17. X^0 * 0.13252679175596802 X^1 * 6.836159339696569 X^2 * -10.198794083500966 X^3 * 8.298738913209064 X^4 * -3.766949862252123 X^5 * 1.0274981119277349 X^6 * -0.17218031550131038 X^7 * 0.017340835860554016 X^8 * -9.623065771393043E-4 X^9 * 2.2595409656184083E-5 X^0 * -1.4218758581602278 X^1 * 14.756472312089675 X^2 * -24.299789484296475 X^3 * 20.63606795357865 X^4 * -9.934453145766518 X^5 * 2.8975181063446613

  18. Bayesian Linear Regression The Bayesian interpretation of probabilistic estimation is a logical extension that enables reasoning with uncertainty but in the light of some background belief Bayesian linear regression : A Bayesian alternative to Maximum Likelihood least squares regression Continue with Normally distributed errors Model the w using a prior distribution and use the posterior over w as the result Intuitive Prior: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend