generalized linear models
play

Generalized Linear Models David Rosenberg New York University - PowerPoint PPT Presentation

Generalized Linear Models David Rosenberg New York University April 12, 2015 David Rosenberg (New York University) DS-GA 1003 April 12, 2015 1 / 20 Conditional Gaussian Regression Gaussian Regression Input space X = R d , Output space Y =


  1. Generalized Linear Models David Rosenberg New York University April 12, 2015 David Rosenberg (New York University) DS-GA 1003 April 12, 2015 1 / 20

  2. Conditional Gaussian Regression Gaussian Regression Input space X = R d , Output space Y = R � w T x , σ 2 � Hypothesis space consists of functions f : x �→ N . For each x , f ( x ) returns a particular Gaussian density with variance σ 2 . Choice of w determines the function. For some parameter w ∈ R d , can write our prediction function as [ f w ( x )]( y ) = p w ( y | x ) = N ( y | w T x , σ 2 ) , where σ 2 > 0. Given some i.i.d. data D = { ( x 1 , y 1 ) ,..., ( x n , y n ) } , how to assess the fit? David Rosenberg (New York University) DS-GA 1003 April 12, 2015 2 / 20

  3. Conditional Gaussian Regression Gaussian Regression: Likelihood Scoring Suppose we have data D = { ( x 1 , y 1 ) ,..., ( x n , y n ) } . Compute the model likelihood for D : n � p w ( D ) = p w ( y i | x i ) [by independence] i = 1 Maximum Likelihood Estimation (MLE) finds w maximizing p w ( D ) . Equivalently, maximize the data log-likelihood: n � w ∗ = argmax log p w ( y i | x i ) w ∈ R d i = 1 Let’s start solving this! David Rosenberg (New York University) DS-GA 1003 April 12, 2015 3 / 20

  4. Conditional Gaussian Regression Gaussian Regression: MLE The conditional log-likelhood is: n � log p w ( y i | x i ) i = 1 n −( y i − w T x i ) 2 � � 1 � �� log 2 π exp = √ 2 σ 2 σ i = 1 n n −( y i − w T x i ) 2 � � � � � 1 � = log + √ 2 σ 2 σ 2 π i = 1 i = 1 � �� � independent of w MLE is the w where this is maximized. Note that σ 2 is irrelevant to finding the maximizing w . Can drop the negative sign and make it a minimization problem. David Rosenberg (New York University) DS-GA 1003 April 12, 2015 4 / 20

  5. Conditional Gaussian Regression Gaussian Regression: MLE The MLE is n � w ∗ = argmin ( y i − w T x i ) 2 w ∈ R d i = 1 This is exactly the objective function for least squares. From here, can use usual approaches to solve for w ∗ (linear algebra, calculus, iterative methods etc.) NOTE: Parameter vector w only interacts with x by an inner product David Rosenberg (New York University) DS-GA 1003 April 12, 2015 5 / 20

  6. Poisson Regression Poisson Regression: Setup Input space X = R d , Output space Y = { 0 , 1 , 2 , 3 , 4 ,... } Hypothesis space consists of functions f : x �→ Poisson ( λ ( x )) . That is, for each x , f ( x ) returns a Poisson with mean λ ( x ) . What function? Recall λ > 0. GLMs (and Poisson is a special case) have a linear dependence on x . Standard approach is to take w T x � � λ ( x ) = exp , for some parameter vector w . Note that range of λ ( x ) = ( 0 , ∞ ) , (appropriate for the Poisson parameter). David Rosenberg (New York University) DS-GA 1003 April 12, 2015 6 / 20

  7. Poisson Regression Poisson Regression: Likelihood Scoring Suppose we have data D = { ( x 1 , y 1 ) ,..., ( x n , y n ) } . Last time we found the log-likelihood for Poisson was: n � log p ( D , λ ) = [ y i log λ − λ − log ( y i ! )] i = 1 � w T x � Plugging in λ ( x ) = exp , we get n � � � � w T x �� � w T x � � log p ( D , λ ) = y i log exp − exp − log ( y i ! ) i = 1 n � y i w T x − exp w T x � � � � = − log ( y i ! ) i = 1 Maximize this w.r.t. w to find the Poisson regression. No closed form for optimum, but it’s concave, so easy to optimize. David Rosenberg (New York University) DS-GA 1003 April 12, 2015 7 / 20

  8. Bernoulli Regression Linear Probabilistic Classifiers Setting: X = R d , Y = { 0 , 1 } For each X = x , p ( Y = 1 | x ) = θ . (i.e. Y has a Bernoulli ( θ ) distribution) θ may vary with x . For each x ∈ R d , just want to predict θ ∈ [ 0 , 1 ] . Two steps: �→ w T x �→ f ( w T x ) x , ���� ���� � �� � ∈ R D ∈ R ∈ [ 0 , 1 ] where f : R → [ 0 , 1 ] is called the transfer or inverse link function. Probability model is then p ( Y = 1 | x ) = f ( w T x ) David Rosenberg (New York University) DS-GA 1003 April 12, 2015 8 / 20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend