coms 4721 machine learning for data science lecture 3 1
play

COMS 4721: Machine Learning for Data Science Lecture 3, 1/24/2017 - PowerPoint PPT Presentation

COMS 4721: Machine Learning for Data Science Lecture 3, 1/24/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University R EGRESSION : P ROBLEM D EFINITION Data Measured pairs ( x , y ) , where


  1. COMS 4721: Machine Learning for Data Science Lecture 3, 1/24/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University

  2. R EGRESSION : P ROBLEM D EFINITION Data Measured pairs ( x , y ) , where x ∈ R d + 1 (input) and y ∈ R (output) Goal Find a function f : R d + 1 → R such that y ≈ f ( x ; w ) for the data pair ( x , y ) . f ( x ; w ) is the regression function and the vector w are its parameters. Definition of linear regression A regression method is called linear if the prediction f is a linear function of the unknown parameters w .

  3. L EAST S QUARES ( CONTINUED )

  4. L EAST SQUARES LINEAR REGRESSION Least squares solution Least squares finds the w that minimizes the sum of squared errors. The least squares objective in the most basic form where f ( x ; w ) = x T w is n � i w ) 2 = � y − Xw � 2 = ( y − Xw ) T ( y − Xw ) . ( y i − x T L = i = 1 We defined y = [ y 1 , . . . , y n ] T and X = [ x 1 , . . . , x n ] T . Taking the gradient with respect to w and setting to zero, we find that ∇ w L = 2 X T Xw − 2 X T y = 0 w LS = ( X T X ) − 1 X T y . ⇒ In other words, w LS is the vector that minimizes L .

  5. P ROBABILISTIC VIEW ◮ Last class, we discussed the geometric interpretation of least squares. ◮ Least squares also has an insightful probabilistic interpretation that allows us to analyze its properties. ◮ That is, given that we pick this model as reasonable for our problem, we can ask: What kinds of assumptions are we making?

  6. P ROBABILISTIC VIEW Recall: Gaussian density in n dimensions Assume a diagonal covariance matrix Σ = σ 2 I . The density is � � 1 − 1 p ( y | µ, σ 2 ) = 2 σ 2 ( y − µ ) T ( y − µ ) . 2 exp n ( 2 πσ 2 ) What if we restrict the mean to µ = Xw and find the maximum likelihood solution for w ?

  7. P ROBABILISTIC VIEW Maximum likelihood for Gaussian linear regression Plug µ = Xw into the multivariate Gaussian distribution and solve for w using maximum likelihood. ln p ( y | µ = Xw , σ 2 ) = w ML arg max w − 1 2 σ 2 � y − Xw � 2 − n 2 ln ( 2 πσ 2 ) . = arg max w Least squares (LS) and maximum likelihood (ML) share the same solution: − 1 w � y − Xw � 2 2 σ 2 � y − Xw � 2 ⇔ LS: arg min ML: arg max w

  8. P ROBABILISTIC VIEW ◮ Therefore, in a sense we are making an independent Gaussian noise assumption about the error, ǫ i = y i − x T i w . ◮ Other ways of saying this: iid 1) y i = x T ∼ N ( 0 , σ 2 ) , i w + ǫ i , ǫ i for i = 1 , . . . , n , ind ∼ N ( x T i w , σ 2 ) , for i = 1 , . . . , n , 2) y i 3) y ∼ N ( Xw , σ 2 I ) , as on the previous slides. ◮ Can we use this probabilistic line of analysis to better understand the maximum likelihood (i.e., least squares) solution?

  9. P ROBABILISTIC VIEW Expected solution Given: The modeling assumption that y ∼ N ( Xw , σ 2 I ) . We can calculate the expectation of the ML solution under this distribution, � � � � � E [( X T X ) − 1 X T y ] ( X T X ) − 1 X T y E [ w ML ] = = p ( y | X , w ) dy ( X T X ) − 1 X T E [ y ] = ( X T X ) − 1 X T Xw = = w Therefore w ML is an unbiased estimate of w , i.e., E [ w ML ] = w .

  10. R EVIEW : A N EQUALITY FROM PROBABILITY ◮ Even though the “expected” maximum likelihood solution is the correct one, should we actually expect to get something near it?

  11. R EVIEW : A N EQUALITY FROM PROBABILITY ◮ Even though the “expected” maximum likelihood solution is the correct one, should we actually expect to get something near it? ◮ We should also look at the covariance. Recall that if y ∼ N ( µ, Σ) , then Var [ y ] = E [( y − E [ y ])( y − E [ y ]) T ] = Σ .

  12. R EVIEW : A N EQUALITY FROM PROBABILITY ◮ Even though the “expected” maximum likelihood solution is the correct one, should we actually expect to get something near it? ◮ We should also look at the covariance. Recall that if y ∼ N ( µ, Σ) , then Var [ y ] = E [( y − E [ y ])( y − E [ y ]) T ] = Σ . ◮ Plugging in E [ y ] = µ , this is equivalently written as E [( y − µ )( y − µ ) T ] Var [ y ] = E [ yy T − y µ T − µ y T + µµ T ] = E [ yy T ] − µµ T = ◮ Immediately we also get E [ yy T ] = Σ + µµ T .

  13. P ROBABILISTIC VIEW Variance of the solution Returning to least squares linear regression, we wish to find E [( w ML − E [ w ML ])( w ML − E [ w ML ]) T ] Var [ w ML ] = E [ w ML w T ML ] − E [ w ML ] E [ w ML ] T . = 1 Aside: For matrices A , B and vector c , recall that ( ABc ) T = c T B T A T .

  14. P ROBABILISTIC VIEW Variance of the solution Returning to least squares linear regression, we wish to find E [( w ML − E [ w ML ])( w ML − E [ w ML ]) T ] Var [ w ML ] = E [ w ML w T ML ] − E [ w ML ] E [ w ML ] T . = The sequence of equalities follows: 1 E [( X T X ) − 1 X T yy T X ( X T X ) − 1 ] − ww T Var [ w ML ] = 1 Aside: For matrices A , B and vector c , recall that ( ABc ) T = c T B T A T .

  15. P ROBABILISTIC VIEW Variance of the solution Returning to least squares linear regression, we wish to find E [( w ML − E [ w ML ])( w ML − E [ w ML ]) T ] Var [ w ML ] = E [ w ML w T ML ] − E [ w ML ] E [ w ML ] T . = The sequence of equalities follows: 1 E [( X T X ) − 1 X T yy T X ( X T X ) − 1 ] − ww T Var [ w ML ] = ( X T X ) − 1 X T E [ yy T ] X ( X T X ) − 1 − ww T = 1 Aside: For matrices A , B and vector c , recall that ( ABc ) T = c T B T A T .

  16. P ROBABILISTIC VIEW Variance of the solution Returning to least squares linear regression, we wish to find E [( w ML − E [ w ML ])( w ML − E [ w ML ]) T ] Var [ w ML ] = E [ w ML w T ML ] − E [ w ML ] E [ w ML ] T . = The sequence of equalities follows: 1 E [( X T X ) − 1 X T yy T X ( X T X ) − 1 ] − ww T Var [ w ML ] = ( X T X ) − 1 X T E [ yy T ] X ( X T X ) − 1 − ww T = ( X T X ) − 1 X T ( σ 2 I + Xww T X T ) X ( X T X ) − 1 − ww T = 1 Aside: For matrices A , B and vector c , recall that ( ABc ) T = c T B T A T .

  17. P ROBABILISTIC VIEW Variance of the solution Returning to least squares linear regression, we wish to find E [( w ML − E [ w ML ])( w ML − E [ w ML ]) T ] Var [ w ML ] = E [ w ML w T ML ] − E [ w ML ] E [ w ML ] T . = The sequence of equalities follows: 1 E [( X T X ) − 1 X T yy T X ( X T X ) − 1 ] − ww T Var [ w ML ] = ( X T X ) − 1 X T E [ yy T ] X ( X T X ) − 1 − ww T = ( X T X ) − 1 X T ( σ 2 I + Xww T X T ) X ( X T X ) − 1 − ww T = ( X T X ) − 1 X T σ 2 IX ( X T X ) − 1 + · · · = ( X T X ) − 1 X T Xww T X T X ( X T X ) − 1 − ww T 1 Aside: For matrices A , B and vector c , recall that ( ABc ) T = c T B T A T .

  18. P ROBABILISTIC VIEW Variance of the solution Returning to least squares linear regression, we wish to find E [( w ML − E [ w ML ])( w ML − E [ w ML ]) T ] Var [ w ML ] = E [ w ML w T ML ] − E [ w ML ] E [ w ML ] T . = The sequence of equalities follows: 1 E [( X T X ) − 1 X T yy T X ( X T X ) − 1 ] − ww T Var [ w ML ] = ( X T X ) − 1 X T E [ yy T ] X ( X T X ) − 1 − ww T = ( X T X ) − 1 X T ( σ 2 I + Xww T X T ) X ( X T X ) − 1 − ww T = ( X T X ) − 1 X T σ 2 IX ( X T X ) − 1 + · · · = ( X T X ) − 1 X T Xww T X T X ( X T X ) − 1 − ww T σ 2 ( X T X ) − 1 = 1 Aside: For matrices A , B and vector c , recall that ( ABc ) T = c T B T A T .

  19. P ROBABILISTIC VIEW ◮ We’ve shown that, under the Gaussian assumption y ∼ N ( Xw , σ 2 I ) , Var [ w ML ] = σ 2 ( X T X ) − 1 . E [ w ML ] = w , ◮ When there are very large values in σ 2 ( X T X ) − 1 , the values of w ML are very sensitive to the measured data y (more analysis later). ◮ This is bad if we want to analyze and predict using w ML .

  20. R IDGE R EGRESSION

  21. R EGULARIZED LEAST SQUARES ◮ We saw how with least squares, the values in w ML may be huge. ◮ In general, when developing a model for data we often wish to constrain the model parameters in some way. ◮ There are many models of the form w � y − Xw � 2 + λ g ( w ) . w OPT = arg min ◮ The added terms are 1. λ > 0 : a regularization parameter, 2. g ( w ) > 0 : a penalty function that encourages desired properties about w .

  22. R IDGE REGRESSION Ridge regression is one g ( w ) that addresses variance issues with w ML . It uses the squared penalty on the regression coefficient vector w , � y − Xw � 2 + λ � w � 2 = w RR arg min w The term g ( w ) = � w � 2 penalizes large values in w . However, there is a tradeoff between the first and second terms that is controlled by λ . ◮ Case λ → 0 : w RR → w LS ◮ Case λ → ∞ : w RR → � 0

  23. R IDGE REGRESSION SOLUTION Objective: We can solve the ridge regression problem using exactly the same procedure as for least squares, � y − Xw � 2 + λ � w � 2 L = ( y − Xw ) T ( y − Xw ) + λ w T w . = Solution: First, take the gradient of L with respect to w and set to zero, ∇ w L = − 2 X T y + 2 X T Xw + 2 λ w = 0 Then, solve for w to find that w RR = ( λ I + X T X ) − 1 X T y .

  24. R IDGE REGRESSION GEOMETRY w 2 There is a tradeoff between squared error and penalty on w . w LS We can write both in terms of λw T w level sets : Curves where function evaluation gives the same number. (w-w LS ) T (X T X)(w-w LS ) The sum of these gives a new set of levels with a unique minimum. w 1 You can check that we can write: x 1 � y − Xw � 2 + λ � w � 2 = ( w − w LS ) T ( X T X )( w − w LS )+ λ w T w +( const. w.r.t. w ) .

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend