regression methods
play

Regression Methods 1. Linear Regression with only one parameter, - PowerPoint PPT Presentation

0. Regression Methods 1. Linear Regression with only one parameter, and without offset; MLE and MAP estimation CMU, 2012 fall, Tom Mitchell, Ziv Bar-Joseph, midterm, pr. 3 2. Consider real-valued variables X and Y . The Y variable is


  1. 0. Regression Methods

  2. 1. Linear Regression with only one parameter, and without offset; MLE and MAP estimation CMU, 2012 fall, Tom Mitchell, Ziv Bar-Joseph, midterm, pr. 3

  3. 2. Consider real-valued variables X and Y . The Y variable is generated, conditional on X , from the following process: ε ∼ N (0 , σ 2 ) Y = aX + ε, where every ε is an independent variable, called a noise term, which is drawn from a Gaussian distribution with mean 0 , and standard deviation σ . This is a one-feature linear regression model, where a is the only weight parameter. The conditional probability of Y has the distribution p ( Y | X, a ) ∼ N ( aX, σ 2 ) , so it can be written as � � 1 − 1 2 σ 2 ( Y − aX ) 2 √ p ( Y | X, a ) = 2 πσ exp

  4. 3. MLE estimation a. Assume we have a training dataset of n pairs ( X i , Y i ) for i = 1 , . . . , n , and σ is known. Which ones of the following equations correctly represent the maximum likelihood problem for estimating a ? Say yes or no to each one. More than one of them should have the answer yes . � � 1 − 1 � 2 σ 2 ( Y i − aX i ) 2 √ i. arg max a 2 πσ exp i � � 1 − 1 � 2 σ 2 ( Y i − aX i ) 2 √ arg max a 2 πσ exp ii. i � � − 1 � 2 σ 2 ( Y i − aX i ) 2 arg max a i exp iii. � � − 1 � 2 σ 2 ( Y i − aX i ) 2 iv. arg max a i exp � i ( Y i − aX i ) 2 arg max a v. � i ( Y i − aX i ) 2 argmin a vi.

  5. 4. Answer: def. L D ( a ) = p ( Y 1 , . . . , Y n | a ) = p ( Y 1 , . . . , Y n | X 1 , . . . , X n , a ) n n � � 1 − 1 � � i.i.d. 2 σ 2 ( Y i − aX i ) 2 √ = p ( Y i | X i , a ) = 2 πσ exp i =1 i =1 Therefore n � � 1 − 1 � def. 2 σ 2 ( Y i − aX i ) 2 √ = arg max L D ( a ) = arg max exp ( ii. ) a MLE a a 2 πσ i =1 � � � � n n � � n 1 − 1 1 1 � � 2 σ 2 ( Y i − aX i ) 2 2 σ 2 ( Y i − aX i ) 2 √ √ − = arg max exp = arg max 2 πσ ) n exp 2 πσ ( a a i =1 i =1 n � � − 1 � 2 σ 2 ( Y i − aX i ) 2 = arg max exp ( iv. ) a i =1 n � � n − 1 − 1 � � 2 σ 2 ( Y i − aX i ) 2 2 σ 2 ( Y i − aX i ) 2 = arg max ln exp = arg max a a i =1 i =1 n n − 1 � � ( Y i − aX i ) 2 = arg min ( Y i − aX i ) 2 = arg max ( vi. ) 2 σ 2 a a i =1 i =1

  6. 5. b. Derive the maximum likelihood estimate of the parameter a in terms of the training example X i ’s and Y i ’s. We recommend you start with the simplest form of the problem you found above. Answer: � � n n n n � � � � ( Y i − aX i ) 2 = arg min a 2 X 2 Y 2 i − 2 a a MLE = arg min X i Y i + i a a i =1 i =1 i =1 i =1 −− 2 � n � n i =1 X i Y i i =1 X i Y i = = 2 � n � n i =1 X 2 i =1 X 2 i i

  7. 6. MAP estimation Let’s put a prior on a . Assume a ∼ N (0 , λ 2 ) , so � � 1 − 1 2 λ 2 a 2 p ( a | λ ) = √ 2 πλ exp The posterior probability of a is p ( Y 1 , . . . , Y n | X 1 , . . . , X n , a ) p ( a | λ ) p ( a | Y 1 , . . . , Y n , X 1 , . . . , X n , λ ) = � a ′ p ( Y 1 , . . . , Y n | X 1 , . . . , X n , a ′ ) p ( a ′ | λ ) da ′ We can ignore the denominator when doing MAP estimation. c. Assume σ = 1 , and a fixed prior parameter λ . Solve for the MAP estimate of a , [ln p ( Y 1 , . . . , Y n | X 1 , . . . , X n , a ) + ln p ( a | λ )] argmax a Your solution should be in terms of X i ’s, Y i ’s, and λ .

  8. 7. Answer: p ( Y 1 , . . . , Y n | X 1 , . . . , X n , a ) · p ( a | λ ) � n �� � � � − a 2 1 − 1 1 � 2 σ 2 ( Y i − aX i ) 2 √ √ = 2 πσ exp · 2 πλ exp 2 λ 2 i =1 � n �� � � � − a 2 1 − 1 1 � σ =1 2( Y i − aX i ) 2 √ · √ = exp exp 2 λ 2 2 π 2 πλ i =1 Therefore the MAP optimization problem is � � n 1 2 π − 1 1 1 � ( Y i − aX i ) 2 + ln 2 λ 2 a 2 √ √ 2 πλ − arg max n ln 2 a i =1 � � n − 1 1 � ( Y i − aX i ) 2 − 2 λ 2 a 2 = arg max 2 a i =1 � n � n � � � � n n ( Y i − aX i ) 2 + a 2 i + 1 � � � � a 2 X 2 Y 2 − 2 a = arg min = arg min X i Y i + i λ 2 λ 2 a a i =1 i =1 i =1 i =1 � n i =1 X i Y i ⇒ a MAP = i + 1 � n i =1 X 2 λ 2

  9. 8. d. Under the following conditions, how do the prior and conditional likelihood curves change? Do a MLE and a MAP become closer together, or further apart? p ( Y 1 , . . . , Y n | X 1 , . . . , X n , a ) p ( a | λ ) prior | a MLE − a MAP | conditional likelihood: probability: increase or wider, narrower, or wider, narrower, decrease? same? or same? As λ → ∞ As λ → 0 More data: as n → ∞ (fixed λ )

  10. 9. Answer: p ( Y 1 , . . . , Y n | X 1 , . . . , X n , a ) p ( a | λ ) prior | a MLE − a MAP | conditional likelihood: probability: increase or wider, narrower, or wider, narrower, decrease? same? or same? As λ → ∞ wider same decrease As λ → 0 narrower same increase More data: as n → ∞ same narrower decrease (fixed λ )

  11. 10. Linear Regression – the general case: • MLE and the Least Squares Loss function ◦ Nonlinear Regression • [ L 2 /Ridge] Regularization; MAP CMU, 2015 spring, Alex Smola, HW1, pr. 2

  12. 11. The objective of this problem is to gain knowledge on linear regression, Maximum Like- lihood Estimation (MLE), Maximum-a-Posteriori Estimation (MAP) and the variants of regression problems with introduction of regularization terms. Part I: Linear Regression — MLE and Least Squares Consider a linear model with some Gaussian noise: Y i = X i · w + b + ε i where ε i ∼ N (0 , σ 2 ) , i = 1 , . . . , n. (1) where Y i ∈ R is a scalar, X i ∈ R d is a d -dimensional vector, b ∈ R is a constant, w ∈ R d is d -dimensional weight on X i , and ε i is a i.i.d. Gaussian noise with variance σ 2 . Given the data X i , i = 1 , . . . , n , our goal is to estimate w and b which specify the model. We will show that solving the linear model ( 1 ) with the MLE method is the same as solving the following Least Squares problem : β ( Y − X ′ β ) ⊤ ( Y − X ′ β ) , arg min (2) i = (1 , X i ) ⊤ , X ′ = ( X ′ n ) ⊤ and β = ( b, w ) ⊤ . where Y = ( Y 1 , . . . , Y n ) ⊤ , X ′ 1 , . . . , X ′

  13. 12. a. From the model ( 1 ), derive the conditional distribution of Y i | X i , w, b . Remember that X i is a fixed data point. Answer: Note that Y i | X i ; w, b ∼ N ( X i · w + b, σ 2 ) , thus we can write the p.d.f. of Y i | X i , w, b in the following form: � ( y i − X i · w − b ) 2 � 1 f ( Y i = y i | X i ; w, b ) = √ 2 πσ exp . 2 σ 2

  14. 13. b. Assuming i.i.d. between each ε i , i = 1 , . . . , n , give an explicit expression for the log-likelihood, ℓ ( Y | β ) of the data. Note : The notation for Y and β was given at ( 2 ). Given that the ε i ’s are i.i.d., it follows that P ( Y | β ) = � i P ( Y i | w, b ) . Remark that we are just omitting X i for convenience, as the problem explicitly tells that X i are fixed points. Answer: Given y = ( y 1 , . . . , y n ) ⊤ , since Y i are independent as ε i ’s are i.i.d. and X i ’s are given, the likelihood of Y | β is as follows: n n � � − ( y i − X i · w − b ) 2 1 � � f ( Y = y | β ) f ( y i | w, b ) = √ = 2 πσ exp 2 σ 2 i =1 i =1 � n � � n � � i =1 ( y i − X i · w − b ) 2 1 √ = exp − . 2 σ 2 2 πσ Now, taking the ln , the log-likelihood of Y | β is as follows: n √ 1 � ( y i − X i · w − b ) 2 . ℓ ( Y = y | β ) = n ln( 2 πσ ) − (3) 2 σ 2 i =1

  15. 14. c. Now show that solving for β that maximizes the log-likelihood (i.e., MLE), is the same as solving the Least Square problem of ( 2 ). Answer: To maximize the log-likelihood ℓ ( Y = y | β ) , we want to focus on the second term since the first term of the log-likelihood ( 3 ) is a constant. In short, to maximize the second term of ( 3 ), we want to minimize � n i =1 ( y i − X i · w − b ) 2 . Writing it in the matrix-vector from, we get: n � ( y i − X i · w − b ) 2 = min β ( y − X ′ β ) ⊤ ( y − X ′ β ) , max ℓ ( Y = y | β ) = min β β i =1 i = (1 , X i ) ⊤ , X ′ = ( X ′ n ) ⊤ and β = ( b, w ) ⊤ . where again, Y = ( Y 1 , . . . , Y n ) ⊤ , X ′ 1 , . . . , X ′

  16. 15. d. Derive β that maximizes the log-likelihood. Hint : You may find useful the following formulas: a ∂ ∂ ∂ (6 c ) ∂ ( XY ) = X ∂Y ∂z + ∂X ∂X a ⊤ X = ∂X X ⊤ a = a ∂X X ⊤ AX = ( A + A ⊤ ) X (5 a ) (5 b ) ∂z Y ∂z Answer: Setting the objective function J ( β ) as J ( β ) = ( y − X ′ β ) ⊤ ( y − X ′ β ) , implies (see rules (6c), (5a) and (5b) in the document mentioned in the Hint ): ∇ β J ( β ) = 2 X ′⊤ ( X ′ β − y ) . The log-likelihood maximizer ˆ β can be found by solving the following optimality condi- tion: β ) = 0 ⇔ X ′⊤ ( X ′ ˆ β − y ) = 0 ⇔ X ′⊤ X ′ ˆ ∇ β J (ˆ β = X ′⊤ y ⇔ ˆ β = ( X ′⊤ X ′ ) − 1 X ′⊤ y. (4) Note that ( X ′⊤ X ′ ) − 1 is possible because it was assumed that X has full rank on the column space. a From Matrix Identities , Sam Roweis, 1999, http://www.cs.nyu.edu/ ∼ roweis/notes/matrixid.pdf .

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend