Regression Methods 1. Linear Regression with only one parameter, - PowerPoint PPT Presentation

0. Regression Methods

1. Linear Regression with only one parameter, and without offset; MLE and MAP estimation CMU, 2012 fall, Tom Mitchell, Ziv Bar-Joseph, midterm, pr. 3

2. Consider real-valued variables X and Y . The Y variable is generated, conditional on X , from the following process: ε ∼ N (0 , σ 2 ) Y = aX + ε, where every ε is an independent variable, called a noise term, which is drawn from a Gaussian distribution with mean 0 , and standard deviation σ . This is a one-feature linear regression model, where a is the only weight parameter. The conditional probability of Y has the distribution p ( Y | X, a ) ∼ N ( aX, σ 2 ) , so it can be written as � � 1 − 1 2 σ 2 ( Y − aX ) 2 √ p ( Y | X, a ) = 2 πσ exp

3. MLE estimation a. Assume we have a training dataset of n pairs ( X i , Y i ) for i = 1 , . . . , n , and σ is known. Which ones of the following equations correctly represent the maximum likelihood problem for estimating a ? Say yes or no to each one. More than one of them should have the answer yes . � � 1 − 1 � 2 σ 2 ( Y i − aX i ) 2 √ i. arg max a 2 πσ exp i � � 1 − 1 � 2 σ 2 ( Y i − aX i ) 2 √ arg max a 2 πσ exp ii. i � � − 1 � 2 σ 2 ( Y i − aX i ) 2 arg max a i exp iii. � � − 1 � 2 σ 2 ( Y i − aX i ) 2 iv. arg max a i exp � i ( Y i − aX i ) 2 arg max a v. � i ( Y i − aX i ) 2 argmin a vi.

4. Answer: def. L D ( a ) = p ( Y 1 , . . . , Y n | a ) = p ( Y 1 , . . . , Y n | X 1 , . . . , X n , a ) n n � � 1 − 1 � � i.i.d. 2 σ 2 ( Y i − aX i ) 2 √ = p ( Y i | X i , a ) = 2 πσ exp i =1 i =1 Therefore n � � 1 − 1 � def. 2 σ 2 ( Y i − aX i ) 2 √ = arg max L D ( a ) = arg max exp ( ii. ) a MLE a a 2 πσ i =1 � � � � n n � � n 1 − 1 1 1 � � 2 σ 2 ( Y i − aX i ) 2 2 σ 2 ( Y i − aX i ) 2 √ √ − = arg max exp = arg max 2 πσ ) n exp 2 πσ ( a a i =1 i =1 n � � − 1 � 2 σ 2 ( Y i − aX i ) 2 = arg max exp ( iv. ) a i =1 n � � n − 1 − 1 � � 2 σ 2 ( Y i − aX i ) 2 2 σ 2 ( Y i − aX i ) 2 = arg max ln exp = arg max a a i =1 i =1 n n − 1 � � ( Y i − aX i ) 2 = arg min ( Y i − aX i ) 2 = arg max ( vi. ) 2 σ 2 a a i =1 i =1

5. b. Derive the maximum likelihood estimate of the parameter a in terms of the training example X i ’s and Y i ’s. We recommend you start with the simplest form of the problem you found above. Answer: � � n n n n � � � � ( Y i − aX i ) 2 = arg min a 2 X 2 Y 2 i − 2 a a MLE = arg min X i Y i + i a a i =1 i =1 i =1 i =1 −− 2 � n � n i =1 X i Y i i =1 X i Y i = = 2 � n � n i =1 X 2 i =1 X 2 i i

6. MAP estimation Let’s put a prior on a . Assume a ∼ N (0 , λ 2 ) , so � � 1 − 1 2 λ 2 a 2 p ( a | λ ) = √ 2 πλ exp The posterior probability of a is p ( Y 1 , . . . , Y n | X 1 , . . . , X n , a ) p ( a | λ ) p ( a | Y 1 , . . . , Y n , X 1 , . . . , X n , λ ) = � a ′ p ( Y 1 , . . . , Y n | X 1 , . . . , X n , a ′ ) p ( a ′ | λ ) da ′ We can ignore the denominator when doing MAP estimation. c. Assume σ = 1 , and a fixed prior parameter λ . Solve for the MAP estimate of a , [ln p ( Y 1 , . . . , Y n | X 1 , . . . , X n , a ) + ln p ( a | λ )] argmax a Your solution should be in terms of X i ’s, Y i ’s, and λ .

7. Answer: p ( Y 1 , . . . , Y n | X 1 , . . . , X n , a ) · p ( a | λ ) � n �� − a 2 1 − 1 1 � 2 σ 2 ( Y i − aX i ) 2 √ √ = 2 πσ exp · 2 πλ exp 2 λ 2 i =1 � n �� − a 2 1 − 1 1 � σ =1 2( Y i − aX i ) 2 √ · √ = exp exp 2 λ 2 2 π 2 πλ i =1 Therefore the MAP optimization problem is � � n 1 2 π − 1 1 1 � ( Y i − aX i ) 2 + ln 2 λ 2 a 2 √ √ 2 πλ − arg max n ln 2 a i =1 � � n − 1 1 � ( Y i − aX i ) 2 − 2 λ 2 a 2 = arg max 2 a i =1 � n � n � � � � n n ( Y i − aX i ) 2 + a 2 i + 1 � � � � a 2 X 2 Y 2 − 2 a = arg min = arg min X i Y i + i λ 2 λ 2 a a i =1 i =1 i =1 i =1 � n i =1 X i Y i ⇒ a MAP = i + 1 � n i =1 X 2 λ 2

8. d. Under the following conditions, how do the prior and conditional likelihood curves change? Do a MLE and a MAP become closer together, or further apart? p ( Y 1 , . . . , Y n | X 1 , . . . , X n , a ) p ( a | λ ) prior | a MLE − a MAP | conditional likelihood: probability: increase or wider, narrower, or wider, narrower, decrease? same? or same? As λ → ∞ As λ → 0 More data: as n → ∞ (fixed λ )

9. Answer: p ( Y 1 , . . . , Y n | X 1 , . . . , X n , a ) p ( a | λ ) prior | a MLE − a MAP | conditional likelihood: probability: increase or wider, narrower, or wider, narrower, decrease? same? or same? As λ → ∞ wider same decrease As λ → 0 narrower same increase More data: as n → ∞ same narrower decrease (fixed λ )

10. Linear Regression – the general case: • MLE and the Least Squares Loss function ◦ Nonlinear Regression • [ L 2 /Ridge] Regularization; MAP CMU, 2015 spring, Alex Smola, HW1, pr. 2

11. The objective of this problem is to gain knowledge on linear regression, Maximum Like- lihood Estimation (MLE), Maximum-a-Posteriori Estimation (MAP) and the variants of regression problems with introduction of regularization terms. Part I: Linear Regression — MLE and Least Squares Consider a linear model with some Gaussian noise: Y i = X i · w + b + ε i where ε i ∼ N (0 , σ 2 ) , i = 1 , . . . , n. (1) where Y i ∈ R is a scalar, X i ∈ R d is a d -dimensional vector, b ∈ R is a constant, w ∈ R d is d -dimensional weight on X i , and ε i is a i.i.d. Gaussian noise with variance σ 2 . Given the data X i , i = 1 , . . . , n , our goal is to estimate w and b which specify the model. We will show that solving the linear model ( 1 ) with the MLE method is the same as solving the following Least Squares problem : β ( Y − X ′ β ) ⊤ ( Y − X ′ β ) , arg min (2) i = (1 , X i ) ⊤ , X ′ = ( X ′ n ) ⊤ and β = ( b, w ) ⊤ . where Y = ( Y 1 , . . . , Y n ) ⊤ , X ′ 1 , . . . , X ′

12. a. From the model ( 1 ), derive the conditional distribution of Y i | X i , w, b . Remember that X i is a fixed data point. Answer: Note that Y i | X i ; w, b ∼ N ( X i · w + b, σ 2 ) , thus we can write the p.d.f. of Y i | X i , w, b in the following form: � ( y i − X i · w − b ) 2 � 1 f ( Y i = y i | X i ; w, b ) = √ 2 πσ exp . 2 σ 2

13. b. Assuming i.i.d. between each ε i , i = 1 , . . . , n , give an explicit expression for the log-likelihood, ℓ ( Y | β ) of the data. Note : The notation for Y and β was given at ( 2 ). Given that the ε i ’s are i.i.d., it follows that P ( Y | β ) = � i P ( Y i | w, b ) . Remark that we are just omitting X i for convenience, as the problem explicitly tells that X i are fixed points. Answer: Given y = ( y 1 , . . . , y n ) ⊤ , since Y i are independent as ε i ’s are i.i.d. and X i ’s are given, the likelihood of Y | β is as follows: n n � � − ( y i − X i · w − b ) 2 1 � � f ( Y = y | β ) f ( y i | w, b ) = √ = 2 πσ exp 2 σ 2 i =1 i =1 � n � � n � � i =1 ( y i − X i · w − b ) 2 1 √ = exp − . 2 σ 2 2 πσ Now, taking the ln , the log-likelihood of Y | β is as follows: n √ 1 � ( y i − X i · w − b ) 2 . ℓ ( Y = y | β ) = n ln( 2 πσ ) − (3) 2 σ 2 i =1

14. c. Now show that solving for β that maximizes the log-likelihood (i.e., MLE), is the same as solving the Least Square problem of ( 2 ). Answer: To maximize the log-likelihood ℓ ( Y = y | β ) , we want to focus on the second term since the first term of the log-likelihood ( 3 ) is a constant. In short, to maximize the second term of ( 3 ), we want to minimize � n i =1 ( y i − X i · w − b ) 2 . Writing it in the matrix-vector from, we get: n � ( y i − X i · w − b ) 2 = min β ( y − X ′ β ) ⊤ ( y − X ′ β ) , max ℓ ( Y = y | β ) = min β β i =1 i = (1 , X i ) ⊤ , X ′ = ( X ′ n ) ⊤ and β = ( b, w ) ⊤ . where again, Y = ( Y 1 , . . . , Y n ) ⊤ , X ′ 1 , . . . , X ′

15. d. Derive β that maximizes the log-likelihood. Hint : You may find useful the following formulas: a ∂ ∂ ∂ (6 c ) ∂ ( XY ) = X ∂Y ∂z + ∂X ∂X a ⊤ X = ∂X X ⊤ a = a ∂X X ⊤ AX = ( A + A ⊤ ) X (5 a ) (5 b ) ∂z Y ∂z Answer: Setting the objective function J ( β ) as J ( β ) = ( y − X ′ β ) ⊤ ( y − X ′ β ) , implies (see rules (6c), (5a) and (5b) in the document mentioned in the Hint ): ∇ β J ( β ) = 2 X ′⊤ ( X ′ β − y ) . The log-likelihood maximizer ˆ β can be found by solving the following optimality condi- tion: β ) = 0 ⇔ X ′⊤ ( X ′ ˆ β − y ) = 0 ⇔ X ′⊤ X ′ ˆ ∇ β J (ˆ β = X ′⊤ y ⇔ ˆ β = ( X ′⊤ X ′ ) − 1 X ′⊤ y. (4) Note that ( X ′⊤ X ′ ) − 1 is possible because it was assumed that X has full rank on the column space. a From Matrix Identities , Sam Roweis, 1999, http://www.cs.nyu.edu/ ∼ roweis/notes/matrixid.pdf .

Regression Methods 1. Linear Regression with only one parameter, - PowerPoint PPT Presentation

0. Regression Methods 1. Linear Regression with only one parameter, and without offset; MLE and MAP estimation CMU, 2012 fall, Tom Mitchell, Ziv Bar-Joseph, midterm, pr. 3 2. Consider real-valued variables X and Y . The Y variable is

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Kernel Methods for Regression Support Vector Regression Gaussian Mixture Regression Gaussian

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

R Regression Methods Interrogate R Output Objects Paul E. Johnson Center for Research Methods

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Business Statistics CONTENTS Multiple regression Dummy regressors Assumptions of regression

Lecture 8: Regression Trees Instructor: Saravanan Thirumuruganathan CSE 5334 Saravanan

Multiple Regression and Logistic Regression I Dajiang Liu @PHS 525 Apr-14-2016 Multiple

Planning and Optimization B2. Regression: Introduction & STRIPS Case Malte Helmert and

Multiple regression STAT 401 - Statistical Methods for Research Workers Jarad Niemi Iowa State

10-601 Machine Learning Regression Outline Regression vs Classification Linear regression

Linear regression How to measure the accuracy of linear regression models Linear Regression

CS70: Lecture 35. Regression (contd.): Linear and Beyond CS70: Lecture 35. Regression (contd.):

Analysis of variance and regression Other types of regression models Other types of regression

Linear Models for Regression Greg Mori - CMPT 419/726 Bishop PRML Ch. 3 Regression Linear Basis

Implementation of a Backprojection Algorithm on CELL Mario Koerner Moscow-Bavarian Joint

Midterm 2 Review Week 10, Fri Mar 18 http://www.ugrad.cs.ubc.ca/~cs314/Vjan2005 News

Implementation of In an object-oriented ( OO ) language , all values are objects, belonging to a

Wideband Feedback Systems Full-Function Instability Control System J.D. Fox 1 LARP Ecloud

Helping Patients Change Behavior: A Motivational Interviewing (MI) Approach Barbara L Beebe

Mentoring for Career & Leadership Development 2018 AWIS DC Mentoring Circles Kickoff Event

What are Connected? 1 L aszl o Fejes T oths work on maximum density packing density

The curse of dimensionality Julie Delon Laboratoire MAP5, UMR CNRS 8145 Universit Paris

Sambuz

Useful Links

Newsletter

Mail Us

Regression Methods 1. Linear Regression with only one parameter, - PowerPoint PPT Presentation

0. Regression Methods 1. Linear Regression with only one parameter, and without offset; MLE and MAP estimation CMU, 2012 fall, Tom Mitchell, Ziv Bar-Joseph, midterm, pr. 3 2. Consider real-valued variables X and Y . The Y variable is

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Kernel Methods for Regression Support Vector Regression Gaussian Mixture Regression Gaussian

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

R Regression Methods Interrogate R Output Objects Paul E. Johnson Center for Research Methods

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Business Statistics CONTENTS Multiple regression Dummy regressors Assumptions of regression

Lecture 8: Regression Trees Instructor: Saravanan Thirumuruganathan CSE 5334 Saravanan

Multiple Regression and Logistic Regression I Dajiang Liu @PHS 525 Apr-14-2016 Multiple

Planning and Optimization B2. Regression: Introduction &amp; STRIPS Case Malte Helmert and

Multiple regression STAT 401 - Statistical Methods for Research Workers Jarad Niemi Iowa State

10-601 Machine Learning Regression Outline Regression vs Classification Linear regression

Linear regression How to measure the accuracy of linear regression models Linear Regression

CS70: Lecture 35. Regression (contd.): Linear and Beyond CS70: Lecture 35. Regression (contd.):

Analysis of variance and regression Other types of regression models Other types of regression

Linear Models for Regression Greg Mori - CMPT 419/726 Bishop PRML Ch. 3 Regression Linear Basis

Implementation of a Backprojection Algorithm on CELL Mario Koerner Moscow-Bavarian Joint

Midterm 2 Review Week 10, Fri Mar 18 http://www.ugrad.cs.ubc.ca/~cs314/Vjan2005 News

Implementation of In an object-oriented ( OO ) language , all values are objects, belonging to a

Wideband Feedback Systems Full-Function Instability Control System J.D. Fox 1 LARP Ecloud

Helping Patients Change Behavior: A Motivational Interviewing (MI) Approach Barbara L Beebe

Mentoring for Career &amp; Leadership Development 2018 AWIS DC Mentoring Circles Kickoff Event

What are Connected? 1 L aszl o Fejes T oths work on maximum density packing density

The curse of dimensionality Julie Delon Laboratoire MAP5, UMR CNRS 8145 Universit Paris

Sambuz

Useful Links

Newsletter

Mail Us

Planning and Optimization B2. Regression: Introduction & STRIPS Case Malte Helmert and

Mentoring for Career & Leadership Development 2018 AWIS DC Mentoring Circles Kickoff Event