Generalized linear models Sren Hjsgaard Department of Mathematical - PowerPoint PPT Presentation

Generalized linear models Søren Højsgaard Department of Mathematical Sciences Aalborg University, Denmark October 29, 2012 Printed: October 29, 2012 File: glim-slides.tex

2 Contents

3 1 Densities for generalized linear models Consider densitites of the form � yθ − b ( θ ) � f Y ( y ; θ, φ ) = exp − c ( y, φ ) (1) a ( φ ) where φ is a fixed parameter (not part of the model).

4 1.1 Mean and variance For random variables with densities of the above form we have: ( ∗ ) E ( y ) = b ′ ( θ ) ( ∗∗ ) V ar( y ) = b ′′ ( θ ) a ( φ ) (2) ∂θ f Y ( y ; θ, φ ) = y − b ′ ( θ ) ∂ Notice that a ( φ ) f Y ( y ; θ, φ ) . Assuming that the order of integration and differentiation can be interchanged we get: � y − b ′ ( θ ) d ∂ � � 0 = f Y dy = ∂θ f Y dy = f Y dy (3) dθ a ( φ ) 1 �� 1 � yf Y dy − b ′ ( θ ) a ( φ ) [ µ − b ′ ( θ )] = f Y dy = a ( φ ) from which (2) (*) follows.

5 By differentiating twice we get d 2 � y − b ′ ( θ ) ∂ � � � 0 = f Y dy = f Y dy dθ 2 ∂θ a ( φ ) � � � 2 � − b ′′ ( θ ) � y − b ′ ( θ ) = a ( φ ) + f Y dy a ( φ ) From this (2) (**) follows since � �� y − µ � 2 � − b ′′ ( θ ) 0 = f Y dy (4) a ( φ ) a ( φ ) ( y − µ ) 2 f Y dy − b ′′ ( θ ) 1 � � = f Y dy a ( φ ) 2 a ( φ ) a ( φ ) 2 V ar( Y ) − b ′′ ( θ ) 1 = a ( φ )

6 2 The components of a generalized linear models Consider the situation where Y 1 , . . . , Y n are independent random variables and Y i has density f Y ( y ; θ i , φ ) of the form (1). In the following it is assumed that a ( φ ) = φ/w where w is a known weight. 2.1 Linear predictor and link function To each Y there is associated a vector x of explanatory variables and a fixed number ω . The linear predictor is η = η ( β ) = ω + x ⊤ β. I generalized linear models (GLIMs) the linear predictor is related to the mean µ = E ( Y ) by the link function g () η = g ( µ ) as follows µ = g − 1 ( η ) = g − 1 ( ω + x ⊤ β )

7 The density (1) is parameterized by the parameter θ and we therefore need to establish the connections between θ and µ . From (2) we have that E ( Y ) = µ = b ′ ( θ ) . Letting the inverse function of b ′ denoted by H . Then θ = ( b ′ ) − 1 ( µ ) = H ( µ ) = H ( g − 1 ( ω + x ⊤ β )) which establishes the link between the parameter θ and x ⊤ β .

8 2.2 Variance function From (2) we also have that V ar( Y ) = b ′′ ( θ ) a ( φ ) . Since θ = H ( µ ) we have V ar( Y ) = b ′′ ( H ( µ )) a ( φ ) . The function V ( µ ) = b ′′ ( H ( µ )) is called the variance function as it expresses how the variance depends on the mean. 2.3 Canonical link The connection between θ and the linear predictor η is given by θ = ( b ′ ) − 1 ( g − 1 ( η )) . So if b ′ = g − 1 then θ = η . In this case g is said to be the canonical link function. Using the canonical link function ensures that the log likelihood is concave such it has a unique maximum.

9 2.4 Example: Bernoulli distribution and canonical link Consider the Bernoulli distribution Pr ( Y = y ) = p y (1 − p ) 1 − y , y ∈ { 0 , 1 } Rewrite as � � Pr ( Y = y ) = exp y log p + ( n − y ) log(1 − p ) � p � = exp y log 1 − p + log(1 − p ) � � = exp yθ − log(1 + exp( θ )) e θ p where we have introduced the logit as θ = g ( p ) = log 1 − p . Notice that p = 1+ e θ . e θ Hence b ( θ ) = log(1 + e θ ) , w = 1 and φ = 1 . Now b ′ ( θ ) = 1+ e θ = p and e θ b ′′ ( θ ) = (1+ e θ ) 2 = p (1 − p ) .

10 3 Estimation: Iteratively reweighted least squares Consider the situation where Y 1 , ..., Y n are independent where the distribution of Y i has density of the form (1) with weight w i . We have • E ( Y i ) = µ i • V ar( Y i ) = V ( µ i ) φ/w i • g ( µ i ) = η i = ω i + x ⊤ i β Then the log–likelihood becomes l ( β ) = 1 � � w i ( y i θ i − b ( θ i )) − c ( y i , φ ) φ i i where θ i = H ( µ i ) = H ( g − 1 ( η i )) = H ( g − 1 ( ω i + x ⊤ i β )) .

11 Imagine a Taylor expansion of g ( Y i ) around µ i : g ( Y i ) ≈ g ( µ i ) + g ′ ( µ i )( Y i − µ i ) = Z i The key to the algorithm is to work with adjusted dependent variables z i . To actually calculate these we need an initial guess of µ i . More about this later. We have E ( Z i ) = g ( µ i ) = η i and V ar( Z i ) = ( g ′ ( µ i )) 2 V ( µ i ) φ/w i . In vector form we get � � ( g ′ ( µ i )) 2 V ( µ i ) /w i E ( Z ) = η = ω + Xβ, V ar( Z ) = φ diag = φW ( µ ) The least squares estimate of β is ˆ β = ( X ⊤ W − 1 X ) − 1 X ⊤ W − 1 ( z − ω ) Notice: Both W and z depend on β so some iterative scheme is needed.

12 Initialization: Since g ( µ i ) = ω i + x ⊤ i β we must have g ( y i ) ≈ ω i + x ⊤ i β . So we may start by applying the link function to data, and obtaining an initial estimate of β as β 0 = ( X ⊤ X ) − 1 X ⊤ ( g ( y ) − ω ) β m from the m th iteration we calculate Iteration: Given ˆ z m +1 = Xβ m + diag ( y − µ m ) = Xβ m + r g ′ ( µ m � � i ) Update W accordingly as W = W ( µ m ) and calculate β m +1 = ( X ⊤ W − 1 X ) − 1 X ⊤ W − 1 ( z m +1 − ω ) Notice that β m +1 = β m + ( X ⊤ W − 1 X ) − 1 X ⊤ W − 1 ( r − ω ) This means that α = β m +1 − β m is determined as the MLE in the model r − ω ∼ N ( Xα, W ) .

13 Saturated model, deviance and estimation of φ 3.1 Recall that the log–likelihood l has the form l ( β ) = l ( µ, y ) = l 1 ( µ, y ) /φ . The saturated model is obtained by making no restrictions on µ so in this case µ = y and l ( y, y ) is the log–likelihood for the saturated model. ˆ The goodness of fit of a generalized linear model is investigated using the deviance: D (ˆ µ, y ) = − 2( l (ˆ µ, y ) − l ( y, y )) = − 2( l 1 (ˆ µ, y ) − l 1 ( y, y )) /φ = D 1 (ˆ µ, y ) /φ Under certain assumptions on the covariates X , it can be shown that D will be asymptotically χ 2 n − p distributed where p is the rank of X .

14 • When φ is known, this result provides a goodness–of–fit test of the model. • When φ is unknown, this results provides a natural estimate of φ as φ = D 1 (ˆ µ, y ) ˜ n − p (but in this case we do not have a goodness–of–fit test).

15 3.2 Binomial distribution Suppose Y i ∼ Bin ( n i , µ i ) where g ( µ i ) = η i = x ⊤ i β for i = 1 , . . . , N . There are different ways of putting the binomial distribution into the GLIM framework. First notice that φ = 1 and w i = 1 so these terms can be ignored. For later use, let X = [ x 1 : · · · : x N ] ⊤ and v i = { g ′ ( µ i ) } 2 µ i (1 − µ i ) ; We may regard Y i as a sum of n i independent Bernoulli variables: n i � Y i = u ij , u ij ∼ Bern ( p i ) j =1 We may therefore fit the binomial distribution this way, but this is potentially very inefficient in terms of storage and computing time. An alternative approach is described below.

16 In the Bernoulli setup: • The row vector of covariates x ⊤ i must then be repeated n i times: Let ˜ i , ˜ X = [ ˜ X 1 : · · · : ˜ X i = 1 n i x ⊤ X N ] ⊤ . With M = � i n i , X is an M × p matrix. • Let also ˜ V i = v i I n i ; ˜ V = diag( V 1 , . . . , V N ) . z i = ( z i 1 , . . . , z in i ) ⊤ and • Finally, let z ij = g ( µ i ) + g ′ ( µ i )( u ij − µ i ) ; ˜ z N ) ⊤ . z = (˜ ˜ z 1 , . . . , ˜ We omit the subscripts n i indicating dimensions of 1 and I . We then have [ 1 x 1 1 ⊤ : · · · : 1 X ⊤ ˜ ˜ V − 1 x N 1 ⊤ ] = v 1 v N 1 z 1 + . . . 1 X ⊤ ˜ ˜ V − 1 ˜ x 1 1 ⊤ ˜ x N 1 ⊤ ˜ z = z N v 1 v N [ n 1 1 + · · · + n M X ⊤ ˜ V − 1 ˜ ˜ x 1 x ⊤ x N x ⊤ X = N ] v 1 v N

17 Recall that z ij = g ( µ i ) + g ′ ( µ i )( u ij − µ i ) . Let � y i z i = 1 z ij = 1 � � 1 ⊤ z i = g ( µ i ) + g ′ ( µ i ) ¯ n i ˜ − µ i n i n i n i j z N ) ⊤ . Then and let ¯ z = (¯ z 1 , . . . , ¯ z i ) = 1 { g ′ ( µ i ) } 2 µ i (1 − µ i ) = v i E (¯ z i ) = g ( µ i ) V ar(¯ n i n i Let V = diag( v 1 n 1 , . . . , v N n N ) . Then [ n 1 x 1 : · · · : n N X ⊤ V − 1 = x N ] v 1 v N n 1 z 1 + · · · + n N X ⊤ V − 1 ¯ z = x 1 ¯ x N ¯ z N v 1 v N n 1 1 + · · · + n N X ⊤ V − 1 X x 1 x ⊤ x N x ⊤ = N v 1 v N

18 Notice that 1 ⊤ ˜ z i = � j z ij = n i ¯ z i . This means that X ⊤ ˜ X ⊤ ˜ V − 1 ˜ ˜ ˜ V − 1 ˜ z = X ⊤ V − 1 ¯ X = X ⊤ V − 1 X z, So iterating using ( ˜ X, ˜ V ) or ( X, V ) produces the same result.

19 4 Ressources • The standard reference on generalized linear models is McCullagh and Nelder Generalized Linear Models • http://www.commanster.eu/ms/glm.pdf

20 5 EXERCISES: Generalized linear models The function model.matrix() can be used for generating a model matrix X from a given formula and a dataset. So in the following we shall assume that we have a vector of responses y , a model matrix X and possibly also a weight vector w . 1. Implement your own function myglm() which takes ( y, X, w ) as input. In doing so, you may find it helpful to use that the various families are implemented in R . See e.g. ?binomial . You may seek inspiration from the function glm.fit . 2. The result from your function could be of a certain class, and you could create interesting methods. For example, it would be nice to obtain the approximate covariance matrix of ˆ β , which is ( X ⊤ W − 1 X ) − 1

Generalized linear models Sren Hjsgaard Department of Mathematical - PowerPoint PPT Presentation

Generalized linear models Sren Hjsgaard Department of Mathematical Sciences Aalborg University, Denmark October 29, 2012 Printed: October 29, 2012 File: glim-slides.tex 2 Contents 3 1 Densities for generalized linear models Consider

Overview of logistic regression Richard Erickson Instructor DataCamp Generalized Linear Models

Limitations of linear models Richard Erickson Instructor DataCamp Generalized Linear Models in

Introduction to the R Statistical Computing Environment Linear and Generalized Linear Models in R

Generalized linear models Christopher F Baum EC 823: Applied Econometrics Boston College, Spring

Introduction to General and Generalized Linear Models Generalized Linear Models - part II Henrik

Introduction to General and Generalized Linear Models Generalized Linear Models - part I Henrik

Multiple logistic regression Richard Erickson Instructor DataCamp Generalized Linear Models in

Workshop 11.2a: Generalized Linear Mixed Effects Models (GLMM) Murray Logan February 7, 2017

Introduction to General and Generalized Linear Models Generalized Linear Models - part III Henrik

Generalized Nonlinear Models gnm : a Package for Generalized Nonlinear Models Same form as

Generalized Additive Models September 10, 2019 Generalized Additive Models September 10, 2019 1

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Introduction to Data Science: Logistic 0 1 1 according to a data fit criterion. account

Statistics for Applications Chapter 10: Generalized Linear Models (GLMs) 1/52 Linear model A

Bias reduction in generalized nonlinear models Ioannis Kosmidis and David Firth Department of

Proper Generalized Decomposition for Linear and Non-Linear Stochastic Models Olivier Le Matre 1

!"!#$%$#&'($)**$+$,-..$/0))$ 1234$5.6745$8923$:;9254$-<7$=255>$

Introduction to Bandits R emi Munos SequeL project: Sequential Learning

On the Solution of Optimization and Variational Problems with Imperfect Information Uday V.

Upper confidence bound strategy on stochastical bandits Multiarmed bandit: K arms, at each step we

Exponential family & Generalized Linear Models (GLIMs) Probabilistic Graphical Models Sharif

GLM notes of JKKs lecture I am using I for the incidence matrix, while John used X in lecture. C

Three Components of Global Linear Models f is a function that maps a structure ( x, y ) to a

MINING Final Review Instructor: Yizhou Sun yzsun@cs.ucla.edu December 6, 2017 Learnt

Generalized linear models Sren Hjsgaard Department of Mathematical - PowerPoint PPT Presentation

Generalized linear models Sren Hjsgaard Department of Mathematical Sciences Aalborg University, Denmark October 29, 2012 Printed: October 29, 2012 File: glim-slides.tex 2 Contents 3 1 Densities for generalized linear models Consider

Overview of logistic regression Richard Erickson Instructor DataCamp Generalized Linear Models

Limitations of linear models Richard Erickson Instructor DataCamp Generalized Linear Models in

Introduction to the R Statistical Computing Environment Linear and Generalized Linear Models in R

Generalized linear models Christopher F Baum EC 823: Applied Econometrics Boston College, Spring

Introduction to General and Generalized Linear Models Generalized Linear Models - part II Henrik

Introduction to General and Generalized Linear Models Generalized Linear Models - part I Henrik

Multiple logistic regression Richard Erickson Instructor DataCamp Generalized Linear Models in

Workshop 11.2a: Generalized Linear Mixed Effects Models (GLMM) Murray Logan February 7, 2017

Introduction to General and Generalized Linear Models Generalized Linear Models - part III Henrik

Generalized Nonlinear Models gnm : a Package for Generalized Nonlinear Models Same form as

Generalized Additive Models September 10, 2019 Generalized Additive Models September 10, 2019 1

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Introduction to Data Science: Logistic 0 1 1 according to a data fit criterion. account

Statistics for Applications Chapter 10: Generalized Linear Models (GLMs) 1/52 Linear model A

Bias reduction in generalized nonlinear models Ioannis Kosmidis and David Firth Department of

Proper Generalized Decomposition for Linear and Non-Linear Stochastic Models Olivier Le Matre 1

!&quot;!#$%$#&amp;'($)**$+$,-..$/0))$ 1234$5.6745$8923$:;9254$-&lt;7$=255&gt;$

Introduction to Bandits R emi Munos SequeL project: Sequential Learning

On the Solution of Optimization and Variational Problems with Imperfect Information Uday V.

Upper confidence bound strategy on stochastical bandits Multiarmed bandit: K arms, at each step we

Exponential family &amp; Generalized Linear Models (GLIMs) Probabilistic Graphical Models Sharif

GLM notes of JKKs lecture I am using I for the incidence matrix, while John used X in lecture. C

Three Components of Global Linear Models f is a function that maps a structure ( x, y ) to a

MINING Final Review Instructor: Yizhou Sun yzsun@cs.ucla.edu December 6, 2017 Learnt

!"!#$%$#&'($)**$+$,-..$/0))$ 1234$5.6745$8923$:;9254$-<7$=255>$

Exponential family & Generalized Linear Models (GLIMs) Probabilistic Graphical Models Sharif