regression methods
play

Regression Methods 1. Linear Regression and Logistic Regression: - PowerPoint PPT Presentation

0. Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common property CMU, 2004 fall, Andrew Moore, HW2, pr. 4 2. Linear Regression and Logistic Regression: Definitions Given an input vector X , linear


  1. 0. Regression Methods

  2. 1. Linear Regression and Logistic Regression: definitions, and a common property CMU, 2004 fall, Andrew Moore, HW2, pr. 4

  3. 2. Linear Regression and Logistic Regression: Definitions Given an input vector X , linear regression models a real-valued output Y as Y | X ∼ Normal ( µ ( X ) , σ 2 ) , where µ ( X ) = β ⊤ X = β 0 + β 1 X 1 + . . . + β p X p . Given an input vector X , logistic regression models a binary output Y by Y | X ∼ Bernoulli ( θ ( X )) , where the Bernoulli parameter is related to β ⊤ X by the logit transformation θ ( X ) def. 1 − θ ( X ) = β ⊤ X. logit ( θ ( X )) = log

  4. 3. a. For each of the two regression models defined above, write the log likelihood function and its gradient with respect to the parameter vector β = ( β 0 , β 1 , . . . , β p ) . Answer: For linear regression , we can write the log likelihood function as: � n �� − ( y i − µ ( x i ) 2 1 � � √ LL ( β ) = log 2 πσ exp 2 σ 2 i =1 n − ( y i − β ⊤ x i ) 2 � 1 � �� � √ = log 2 πσ exp 2 σ 2 i =1 n √ 1 � ( y i − β ⊤ x i ) 2 − n log( 2 πσ ) − = 2 σ 2 i =1 n √ 1 � ( y i − β ⊤ x i ) ⊤ ( y i − β ⊤ x i ) . = − n log( 2 πσ ) − 2 σ 2 i =1 Therefore, its gradient is: n � ( y i − β ⊤ x i ) x i ∇ β LL ( β ) = i =1

  5. 4. For logistic regression : θ ( X ) θ ( X ) 1 − θ ( X ) = β ⊤ X ⇔ e β ⊤ X = 1 − θ ( X ) ⇔ e β ⊤ X = θ ( X )(1 + e β ⊤ X ) log Therefore, e β ⊤ X 1 1 1 + e − β ⊤ X and 1 − θ ( X ) = θ ( X ) = 1 + e β ⊤ X = 1 + e β ⊤ X . Note that Y | X ∼ Bernoulli ( θ ( X )) means that P ( Y = 1 | X ) = θ ( X ) and P ( Y = 0 | X ) = 1 − θ ( X ) , which can be equivalently written as P ( Y = y | X ) = θ ( X ) y (1 − θ ( X )) 1 − y for all y ∈ { 0 , 1 } .

  6. 5. So, in this case the log likelihood function is: � n � � { θ ( x i ) y i (1 − θ ( x i )) 1 − y i } LL ( β ) = log i =1 n � { y i log θ ( x i ) + (1 − y i ) log(1 − θ ( x i )) } = i =1 n � { y i ( β ⊤ x i + log(1 − θ ( x i )) + (1 − y i ) log(1 − θ ( x i )) } = i =1 n { y i ( β ⊤ x i ) − log(1 + e β ⊤ x i ) } � = i =1 And therefore, n � � n e β ⊤ x i � � ∇ β LL ( β ) = y i x i − = ( y i − θ ( x i )) x i 1 + e β ⊤ x i x i i =1 i =1

  7. 6. Remark Actually, in the above solutions the full log likelihood function should look like the following first: n � log-likelihood = log p ( x i , y i ) i =1 n � = log ( p Y | X ( y i | x i ) p X ( x i )) i =1 �� n � n � �� � � p Y | X ( y i | x i ) · = log p X ( x i ) i =1 i =1 n n � � p Y | X ( y i | x i ) + log = log p X ( x i ) i =1 i =1 = LL + LL x Because LL x does not depend on the parameter β , when doing MLE we could just consider maximizing LL .

  8. 7. Show that for each of the two regression models above, at the MLE ˆ b. β has the following property: n n E [ Y | X = x i , β = ˆ � � y i x i = β ] x i . i =1 i =1 Answer: For linear regression : For logistic regression : n n n n � � (ˆ β ⊤ x i ) x i . ∇ β LL ( β ) = 0 ⇒ y i x i = � � ∇ β LL ( β ) = 0 ⇒ y i x i = θ ( x i ) x i . i =1 i =1 i =1 i =1 Since Y | X ∼ Normal ( µ ( X ) , σ 2 ) , Since Y | X ∼ Bernoulli ( θ ( X )) , E [ Y | X = x i , β = ˆ β ] = µ ( x i ) = ˆ β ⊤ x i . e ˆ β ⊤ x i E [ Y | X = x i , β = ˆ β ] = θ ( x i ) = β ⊤ x i . 1 + e ˆ i =1 E [ Y | X = x i , β = ˆ So � n i =1 y i x i = � n β ] x i . So � n i =1 y i x i = � n i =1 E [ Y | X = x i , β = ˆ β ] x i .

  9. 8. Linear Regression with only one parameter; MLE and MAP estimation CMU, 2012 fall, Tom Mitchell, Ziv Bar-Joseph, midterm, pr. 3

  10. 9. Consider real-valued variables X and Y . The Y variable is generated, conditional on X , from the following process: ε ∼ N (0 , σ 2 ) Y = aX + ε, where every ε is an independent variable, called a noise term, which is drawn from a Gaussian distribution with mean 0 , and standard deviation σ . This is a one-feature linear regression model, where a is the only weight parameter. The conditional probability of Y has the distribution p ( Y | X, a ) ∼ N ( aX, σ 2 ) , so it can be written as � � 1 − 1 2 σ 2 ( Y − aX ) 2 √ p ( Y | X, a ) = 2 πσ exp

  11. 10. MLE estimation a. Assume we have a training dataset of n pairs ( X i , Y i ) for i = 1 , . . . , n , and σ is known. Which ones of the following equations correctly represent the maximum likelihood problem for estimating a ? Say yes or no to each one. More than one of them should have the answer yes . 1 � − 1 � 2 σ 2 ( Y i − aX i ) 2 � √ arg max a 2 πσ exp i. i 1 � − 1 � � √ 2 σ 2 ( Y i − aX i ) 2 arg max a 2 πσ exp ii. i � � − 1 � 2 σ 2 ( Y i − aX i ) 2 arg max a i exp iii. � � − 1 2 σ 2 ( Y i − aX i ) 2 � iv. arg max a i exp i ( Y i − aX i ) 2 arg max a � v. � i ( Y i − aX i ) 2 argmin a vi.

  12. 11. Answer: def. L D ( a ) = p ( Y 1 , . . . , Y n | a ) = p ( Y 1 , . . . , Y n | X 1 , . . . , X n , a ) n n � � 1 − 1 i.i.d. � � 2 σ 2 ( Y i − aX i ) 2 √ = p ( Y i | X i , a ) = 2 πσ exp i =1 i =1 Therefore n � � 1 − 1 def. � 2 σ 2 ( Y i − aX i ) 2 √ = arg max L D ( a ) = arg max exp ( ii. ) a MLE a a 2 πσ i =1 n � n � � n � 1 � − 1 � 1 1 � � 2 σ 2 ( Y i − aX i ) 2 2 σ 2 ( Y i − aX i ) 2 √ √ − = arg max exp = arg max 2 πσ ) n exp 2 πσ ( a a i =1 i =1 n � − 1 � � 2 σ 2 ( Y i − aX i ) 2 = arg max exp ( iv. ) a i =1 n n � − 1 � − 1 � � 2 σ 2 ( Y i − aX i ) 2 2 σ 2 ( Y i − aX i ) 2 = arg max ln exp = arg max a a i =1 i =1 n n − 1 ( Y i − aX i ) 2 = arg min � � ( Y i − aX i ) 2 = arg max ( vi. ) 2 σ 2 a a i =1 i =1

  13. 12. b. Derive the maximum likelihood estimate of the parameter a in terms of the training example X i ’s and Y i ’s. We recommend you start with the simplest form of the problem you found above. Answer: � � n n n n ( Y i − aX i ) 2 = arg min � � � � a 2 X 2 Y 2 i − 2 a a MLE = arg min X i Y i + i a a i =1 i =1 i =1 i =1 −− 2 � n � n i =1 X i Y i i =1 X i Y i = = 2 � n � n i =1 X 2 i =1 X 2 i i

  14. 13. MAP estimation Let’s put a prior on a . Assume a ∼ N (0 , λ 2 ) , so � � 1 − 1 2 λ 2 a 2 p ( a | λ ) = √ 2 πλ exp The posterior probability of a is p ( Y 1 , . . . , Y n | X 1 , . . . , X n , a ) p ( a | λ ) p ( a | Y 1 , . . . , Y n , X 1 , . . . , X n , λ ) = � a ′ p ( Y 1 , . . . , Y n | X 1 , . . . , X n , a ′ ) p ( a ′ | λ ) da ′ We can ignore the denominator when doing MAP estimation. c. Assume σ = 1 , and a fixed prior parameter λ . Solve for the MAP estimate of a , [ln p ( Y 1 , . . . , Y n | X 1 , . . . , X n , a ) + ln p ( a | λ )] argmax a Your solution should be in terms of X i ’s, Y i ’s, and λ .

  15. 14. Answer: p ( Y 1 , . . . , Y n | X 1 , . . . , X n , a ) · p ( a | λ ) � n �� − a 2 � � � 1 − 1 1 � 2 σ 2 ( Y i − aX i ) 2 √ √ = 2 πσ exp · 2 πλ exp 2 λ 2 i =1 � n �� − a 2 1 � − 1 1 � � σ =1 � 2( Y i − aX i ) 2 √ · √ = exp exp 2 λ 2 2 π 2 πλ i =1 Therefore the MAP optimization problem is � n � 1 2 π − 1 1 1 ( Y i − aX i ) 2 + ln � 2 λ 2 a 2 √ √ 2 πλ − arg max n ln 2 a i =1 � � n − 1 1 ( Y i − aX i ) 2 − � 2 λ 2 a 2 = arg max 2 a i =1 � n � n � � � n n � ( Y i − aX i ) 2 + a 2 i + 1 � � � � a 2 X 2 Y 2 − 2 a = arg min = arg min X i Y i + i λ 2 λ 2 a a i =1 i =1 i =1 i =1 � n i =1 X i Y i ⇒ a MAP = i + 1 � n i =1 X 2 λ 2

  16. 15. d. Under the following conditions, how do the prior and conditional likelihood curves change? Do a MLE and a MAP become closer together, or further apart? p ( Y 1 , . . . , Y n | X 1 , . . . , X n , a ) p ( a | λ ) prior | a MLE − a MAP | conditional likelihood: probability: increase or wider, narrower, or wider, narrower, decrease? same? or same? As λ → ∞ As λ → 0 More data: as n → ∞ (fixed λ )

  17. 16. Answer: p ( Y 1 , . . . , Y n | X 1 , . . . , X n , a ) p ( a | λ ) prior | a MLE − a MAP | conditional likelihood: probability: increase or wider, narrower, or wider, narrower, decrease? same? or same? As λ → ∞ wider same decrease As λ → 0 narrower same increase More data: as n → ∞ same narrower decrease (fixed λ )

  18. 17. Linear Regression in R 2 [ without “intercept” term ] with either Gaussian or Laplace noise CMU, 2009 fall, Carlos Guestrin, HW3, pr. 1.5.2 CMU, 2012 fall, Eric Xing, Aarti Singh, HW1, pr. 2

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend