Generalized linear models Sren Hjsgaard Department of Mathematical - - PowerPoint PPT Presentation

generalized linear models
SMART_READER_LITE
LIVE PREVIEW

Generalized linear models Sren Hjsgaard Department of Mathematical - - PowerPoint PPT Presentation

Generalized linear models Sren Hjsgaard Department of Mathematical Sciences Aalborg University, Denmark October 29, 2012 Printed: October 29, 2012 File: glim-slides.tex 2 Contents 3 1 Densities for generalized linear models Consider


slide-1
SLIDE 1

Generalized linear models

Søren Højsgaard Department of Mathematical Sciences Aalborg University, Denmark October 29, 2012

Printed: October 29, 2012 File: glim-slides.tex

slide-2
SLIDE 2

2

Contents

slide-3
SLIDE 3

3

1 Densities for generalized linear models

Consider densitites of the form fY (y; θ, φ) = exp yθ − b(θ) a(φ) − c(y, φ)

  • (1)

where φ is a fixed parameter (not part of the model).

slide-4
SLIDE 4

4

1.1 Mean and variance

For random variables with densities of the above form we have: (∗) E(y) = b′(θ) (∗∗) Var(y) = b′′(θ)a(φ) (2) Notice that

∂ ∂θfY (y; θ, φ) = y−b′(θ) a(φ) fY (y; θ, φ).

Assuming that the order of integration and differentiation can be interchanged we get: = d dθ

  • fY dy =

∂θ fY dy = y − b′(θ) a(φ) fY dy (3) = 1 a(φ)

  • yfY dy − b′(θ)
  • fY dy
  • =

1 a(φ) [µ − b′(θ)] from which (2) (*) follows.

slide-5
SLIDE 5

5

By differentiating twice we get = d2 dθ2

  • fY dy =

∂θ y − b′(θ) a(φ) fY

  • dy

= −b′′(θ) a(φ) + y − b′(θ) a(φ) 2 fY dy From this (2) (**) follows since = y − µ a(φ) 2 − b′′(θ) a(φ)

  • fY dy

(4) = 1 a(φ)2

  • (y − µ)2fY dy − b′′(θ)

a(φ)

  • fY dy

= 1 a(φ)2 Var(Y ) − b′′(θ) a(φ)

slide-6
SLIDE 6

6

2 The components of a generalized linear models

Consider the situation where Y1, . . . , Yn are independent random variables and Yi has density fY (y; θi, φ) of the form (1). In the following it is assumed that a(φ) = φ/w where w is a known weight.

2.1 Linear predictor and link function

To each Y there is associated a vector x of explanatory variables and a fixed number ω. The linear predictor is η = η(β) = ω + x⊤β. I generalized linear models (GLIMs) the linear predictor is related to the mean µ = E(Y ) by the link function g() η = g(µ) as follows µ = g−1(η) = g−1(ω + x⊤β)

slide-7
SLIDE 7

7

The density (1) is parameterized by the parameter θ and we therefore need to establish the connections between θ and µ. From (2) we have that E(Y ) = µ = b′(θ). Letting the inverse function of b′ denoted by H. Then θ = (b′)−1(µ) = H(µ) = H(g−1(ω + x⊤β)) which establishes the link between the parameter θ and x⊤β.

slide-8
SLIDE 8

8

2.2 Variance function

From (2) we also have that Var(Y ) = b′′(θ)a(φ). Since θ = H(µ) we have Var(Y ) = b′′(H(µ))a(φ). The function V (µ) = b′′(H(µ)) is called the variance function as it expresses how the variance depends on the mean.

2.3 Canonical link

The connection between θ and the linear predictor η is given by θ = (b′)−1(g−1(η)). So if b′ = g−1 then θ = η. In this case g is said to be the canonical link function. Using the canonical link function ensures that the log likelihood is concave such it has a unique maximum.

slide-9
SLIDE 9

9

2.4 Example: Bernoulli distribution and canonical link

Consider the Bernoulli distribution Pr(Y = y) = py(1 − p)1−y, y ∈ {0, 1} Rewrite as Pr(Y = y) = exp

  • y log p + (n − y) log(1 − p)
  • =

exp

  • y log

p 1 − p + log(1 − p)

  • =

exp

  • yθ − log(1 + exp(θ))
  • where we have introduced the logit as θ = g(p) = log

p 1−p. Notice that p = eθ 1+eθ .

Hence b(θ) = log(1 + eθ), w = 1 and φ = 1. Now b′(θ) =

eθ 1+eθ = p and

b′′(θ) =

eθ (1+eθ)2 = p(1 − p).

slide-10
SLIDE 10

10

3 Estimation: Iteratively reweighted least squares

Consider the situation where Y1, ..., Yn are independent where the distribution of Yi has density of the form (1) with weight wi. We have

  • E(Yi) = µi
  • Var(Yi) = V (µi)φ/wi
  • g(µi) = ηi = ωi + x⊤

i β

Then the log–likelihood becomes l(β) = 1 φ

  • i

wi(yiθi − b(θi)) −

  • i

c(yi, φ) where θi = H(µi) = H(g−1(ηi)) = H(g−1(ωi + x⊤

i β)).

slide-11
SLIDE 11

11

Imagine a Taylor expansion of g(Yi) around µi: g(Yi) ≈ g(µi) + g′(µi)(Yi − µi) = Zi The key to the algorithm is to work with adjusted dependent variables zi. To actually calculate these we need an initial guess of µi. More about this later. We have E(Zi) = g(µi) = ηi and Var(Zi) = (g′(µi))2V (µi)φ/wi. In vector form we get E(Z) = η = ω + Xβ, Var(Z) = φ diag

  • (g′(µi))2V (µi)/wi
  • = φW(µ)

The least squares estimate of β is ˆ β = (X⊤W −1X)−1X⊤W −1(z − ω) Notice: Both W and z depend on β so some iterative scheme is needed.

slide-12
SLIDE 12

12

Initialization: Since g(µi) = ωi + x⊤

i β we must have g(yi) ≈ ωi + x⊤ i β. So we

may start by applying the link function to data, and obtaining an initial estimate of β as β0 = (X⊤X)−1X⊤(g(y) − ω) Iteration: Given ˆ βm from the mth iteration we calculate zm+1 = Xβm + diag

  • g′(µm

i )

  • (y − µm) = Xβm + r

Update W accordingly as W = W(µm) and calculate βm+1 = (X⊤W −1X)−1X⊤W −1(zm+1 − ω) Notice that βm+1 = βm + (X⊤W −1X)−1X⊤W −1(r − ω) This means that α = βm+1 − βm is determined as the MLE in the model r − ω ∼ N(Xα, W).

slide-13
SLIDE 13

13

3.1 Saturated model, deviance and estimation of φ

Recall that the log–likelihood l has the form l(β) = l(µ, y) = l1(µ, y)/φ. The saturated model is obtained by making no restrictions on µ so in this case ˆ µ = y and l(y, y) is the log–likelihood for the saturated model. The goodness of fit of a generalized linear model is investigated using the deviance: D(ˆ µ, y) = −2(l(ˆ µ, y) − l(y, y)) = −2(l1(ˆ µ, y) − l1(y, y))/φ = D1(ˆ µ, y)/φ Under certain assumptions on the covariates X, it can be shown that D will be asymptotically χ2

n−p distributed where p is the rank of X.

slide-14
SLIDE 14

14

  • When φ is known, this result provides a goodness–of–fit test of the model.
  • When φ is unknown, this results provides a natural estimate of φ as

˜ φ = D1(ˆ µ, y) n − p (but in this case we do not have a goodness–of–fit test).

slide-15
SLIDE 15

15

3.2 Binomial distribution

Suppose Yi ∼ Bin(ni, µi) where g(µi) = ηi = x⊤

i β for i = 1, . . . , N.

There are different ways of putting the binomial distribution into the GLIM

  • framework. First notice that φ = 1 and wi = 1 so these terms can be ignored.

For later use, let X = [x1 : · · · : xN]⊤ and vi = {g′(µi)}2µi(1 − µi); We may regard Yi as a sum of ni independent Bernoulli variables: Yi =

ni

  • j=1

uij, uij ∼ Bern(pi) We may therefore fit the binomial distribution this way, but this is potentially very inefficient in terms of storage and computing time. An alternative approach is described below.

slide-16
SLIDE 16

16

In the Bernoulli setup:

  • The row vector of covariates x⊤

i must then be repeated ni times: Let

˜ Xi = 1nix⊤

i , ˜

X = [ ˜ X1 : · · · : ˜ XN]⊤. With M =

i ni, X is an M × p

matrix.

  • Let also ˜

Vi = viIni; ˜ V = diag(V1, . . . , VN).

  • Finally, let zij = g(µi) + g′(µi)(uij − µi); ˜

zi = (zi1, . . . , zini)⊤ and ˜ z = (˜ z1, . . . , ˜ zN)⊤. We omit the subscripts ni indicating dimensions of 1 and I. We then have ˜ X⊤ ˜ V −1 = [ 1 v1 x11⊤ : · · · : 1 vN xN1⊤] ˜ X⊤ ˜ V −1˜ z = 1 v1 x11⊤˜ z1 + . . . 1 vN xN1⊤˜ zN ˜ X⊤ ˜ V −1 ˜ X = [n1 v1 x1x⊤

1 + · · · + nM

vN xNx⊤

N]

slide-17
SLIDE 17

17

Recall that zij = g(µi) + g′(µi)(uij − µi). Let ¯ zi = 1 ni

  • j

zij = 1 ni 1⊤

ni ˜

zi = g(µi) + g′(µi) yi ni − µi

  • and let ¯

z = (¯ z1, . . . , ¯ zN)⊤. Then E(¯ zi) = g(µi) Var(¯ zi) = 1 ni {g′(µi)}2µi(1 − µi) = vi ni Let V = diag( v1

n1 , . . . , vN nN ). Then

X⊤V −1 = [n1 v1 x1 : · · · : nN vN xN] X⊤V −1¯ z = n1 v1 x1¯ z1 + · · · + nN vN xN ¯ zN X⊤V −1X = n1 v1 x1x⊤

1 + · · · + nN

vN xNx⊤

N

slide-18
SLIDE 18

18

Notice that 1⊤˜ zi =

j zij = ni¯

  • zi. This means that

˜ X⊤ ˜ V −1˜ z = X⊤V −1¯ z, ˜ X⊤ ˜ V −1 ˜ X = X⊤V −1X So iterating using ( ˜ X, ˜ V ) or (X, V ) produces the same result.

slide-19
SLIDE 19

19

4 Ressources

  • The standard reference on generalized linear models is McCullagh and Nelder

Generalized Linear Models

  • http://www.commanster.eu/ms/glm.pdf
slide-20
SLIDE 20

20

5 EXERCISES: Generalized linear models

The function model.matrix() can be used for generating a model matrix X from a given formula and a dataset. So in the following we shall assume that we have a vector of responses y, a model matrix X and possibly also a weight vector w.

  • 1. Implement your own function myglm() which takes (y, X, w) as input. In

doing so, you may find it helpful to use that the various families are implemented in R. See e.g. ?binomial. You may seek inspiration from the function glm.fit.

  • 2. The result from your function could be of a certain class, and you could create

interesting methods. For example, it would be nice to obtain the approximate covariance matrix of ˆ β, which is (X⊤W −1X)−1