Introduction to General and Generalized Linear Models Mixed effects - - PowerPoint PPT Presentation

introduction to general and generalized linear models
SMART_READER_LITE
LIVE PREVIEW

Introduction to General and Generalized Linear Models Mixed effects - - PowerPoint PPT Presentation

Introduction to General and Generalized Linear Models Mixed effects models - Part IV Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby January 2011 Henrik Madsen Poul


slide-1
SLIDE 1

Introduction to General and Generalized Linear Models

Mixed effects models - Part IV Henrik Madsen Poul Thyregod

Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby

January 2011

Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January 2011 1 / 23

slide-2
SLIDE 2

This lecture

General mixed effects models Laplace approximation

Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January 2011 2 / 23

slide-3
SLIDE 3

General mixed effects models

General mixed effects models

Let us now look at methods to deal with nonlinear and non-normal mixed effects models. In general it will be impossible to obtain closed form solutions and hence numerical methods must be used. Estimation and inference will be based on likelihood principles.

Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January 2011 3 / 23

slide-4
SLIDE 4

General mixed effects models

General mixed effects models

The general mixed effects model can be represented by its likelihood function: LM(θ; y) =

  • Rq L(θ; u, y)du

where y is the observed random variables, θ is the model parameters to be estimated, and U is the q unobserved random variables or effects. The likelihood function L is the joint likelihood of both the observed and the unobserved random variables. The likelihood function for estimating θ is the marginal likelihood LM

  • btained by integrating out the unobserved random variables.

Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January 2011 4 / 23

slide-5
SLIDE 5

General mixed effects models

General mixed effects models

The integral shown on the previous slide is generally difficult to solve if the number of unobserved random variables is more than a few, i.e. for large values of q. A large value of q significantly increases the computational demands due to the product rule which states that if an integral is sampled in m points per dimension to evaluate it, the total number of samples needed is mq, which rapidly becomes infeasible even for a limited number of random effects. The likelihood function gives a very broad definition of mixed models: the

  • nly requirement for using mixed modeling is to define a joint likelihood

function for the model of interest. In this way mixed modeling can be applied to any likelihood based statistical modeling. Examples of applications are linear mixed models (LMM) and nonlinear mixed models (NLMM), generalized linear mixed models, but also models based on Markov chains, ODEs or SDEs.

Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January 2011 5 / 23

slide-6
SLIDE 6

General mixed effects models

Hierarchical models

As for the Gaussian linear mixed models it is useful to formulate the model as a hierarchical model containing a first stage model

fY |u(y; u, β)

which is a model for the data given the random effects, and a second stage model

fU(u; Ψ)

which is a model for the random effects. The total set of parameters is θ = (β, Ψ). Hence the joint likelihood is given as

L(β, Ψ; u, y) = fY |u(y; u, β)fU(u; Ψ)

Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January 2011 6 / 23

slide-7
SLIDE 7

General mixed effects models

Hierarchical models

To obtain the likelihood for the model parameters (β, Ψ) the unobserved random effects are again integrated out. The likelihood function for estimating (β, Ψ) is as before the marginal likelihood LM(β, Ψ; y) =

  • Rq L(β, Ψ; u, y)du

where q is the number of random effects, and β and Ψ are the parameters to be estimated.

Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January 2011 7 / 23

slide-8
SLIDE 8

General mixed effects models

Grouping structures and nested effects

For nonlinear mixed models where no closed form solution to the likelihood function is available it is necessary to invoke some form of numerical approximation to be able to estimate the model parameters. The complexity of this problem is mainly dependent on the dimensionality of the integration problem which in turn is dependent on the dimension of U and in particular the grouping structure in the data for the random effects. These structures include a single grouping, nested grouping, partially crossed and crossed random effects. For problems with only one level of grouping the marginal likelihood can be simplified as LM(β, Ψ; y) =

M

  • i=1
  • Rqi

fY |ui(y; ui, β)fUi(ui; Ψ) dui where qi is the number of random effects for group i and M is the number

  • f groups.

Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January 2011 8 / 23

slide-9
SLIDE 9

General mixed effects models

Grouping structures and nested effects

Instead of having to solve an integral of dimension q it is only necessary to solve M smaller integrals of dimension qi. In typical applications there is often just one or only a few random effects for each group, and this thus greatly reduces the complexity of the integration problem. If the data has a nested grouping structure a reduction of the dimensionality of the integral similar to that shown on the previous slide can be performed. An example of a nested grouping structure is data collected from a number of schools, a number of classes within each school and a number of students from each class.

Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January 2011 9 / 23

slide-10
SLIDE 10

General mixed effects models

Grouping structures and nested effects

If the nonlinear mixed model is extended to include any structure of random effects such as crossed or partially crossed random effects it is required to evaluate the full multi-dimensional integral Estimation in these models can efficiently be handled using the multivariate Laplace approximation, which only samples the integrand in one point common to all dimensions.

Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January 2011 10 / 23

slide-11
SLIDE 11

Laplace approximation

The Laplace approximation

For a given set of model parameters θ the joint log-likelihood ℓ(θ, u, y) = log(L(θ, u, y)) is approximated by a second order Taylor approximation around the optimum ˜ u = uθ of the log-likelihood function w.r.t. the unobserved random variables u, i.e. ℓ(θ, u, y) ≈ ℓ(θ, ˜ u, y) − 1 2(u − ˜ u)T H(˜ u)(u − ˜ u) where the first-order term of the Taylor expansion disappears since the expansion is done around the optimum ˜ u and H(˜ u) = −ℓ′′

uu(θ, u, y)|u=˜ u is the negative

Hessian of the joint log-likelihood evaluated at ˜ u which will simply be referred to as “the Hessian”.

Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January 2011 11 / 23

slide-12
SLIDE 12

Laplace approximation

The Laplace approximation

Using the approximation in the Laplace approximation of the marginal log-likelihood becomes ℓM,LA(θ, y) = log

  • Rq exp
  • ℓ(θ, ˜

u, y) − 1

2(u − ˜

u)T H(˜ u)(u − ˜ u)

  • du

= ℓ(θ, ˜ u, y) − 1

2 log

  • H(˜

u) 2π

  • Yhe integral is eliminated by transforming it to an integration of a

multivariate Gaussian density with mean ˜ u and covariance H−1(˜ u).

Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January 2011 12 / 23

slide-13
SLIDE 13

Laplace approximation

The Laplace approximation

The Laplace likelihood only approximates the marginal likelihood for mixed models with nonlinear random effects and thus maximizing the Laplace likelihood will result in some amount of error in the resulting estimates. It can be shown that joint log-likelihood converges to a quadratic function of the random effect for increasing number of observations per random effect and thus that the Laplace approximation is asymptotically exact. In practical applications the accuracy of the Laplace approximation may still be of concern, but often improved numerical approximation

  • f the marginal likelihood (such as Gaussian quadrature) may easily

be computationally infeasible to perform. Another option for improving the accuracy is Importance sampling.

Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January 2011 13 / 23

slide-14
SLIDE 14

Laplace approximation

Two-level hierarchical model

For the two-level or hierarchical model it is readily seen that the joint log-likelihood is ℓ(θ, u, y) = ℓ(β, Ψ, u, y) = log fY |u(y; u, β) + log fU(u; Ψ) which implies that the Laplace approximation becomes ℓM,LA(θ, y) = log fY |u(y; ˜ u, β) + log fU(˜ u; Ψ) − 1 2 log

  • H(˜

u) 2π

  • It is clear that as long as a likelihood function of the random effects

and model parameters can be defined it is possible to use the Laplace likelihood for estimation in a mixed model framework.

Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January 2011 14 / 23

slide-15
SLIDE 15

Laplace approximation

Gaussian second stage model

Let us assume that the second stage model is zero mean Gaussian, i.e. u ∼ N(0, Ψ) which means that the random effect distribution is completely described by its covariance matrix Ψ. In this case the Laplace likelihood in becomes ℓM,LA(θ, y) = log fY |u(y; ˜ u, β) − 1 2 log |Ψ| − 1 2 ˜ uT Ψ−1 ˜ u − 1 2 log |H(˜ u)| where it is seen that we still have no assumptions on the first stage model fY |u(y; u, β).

Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January 2011 15 / 23

slide-16
SLIDE 16

Laplace approximation

Gaussian second stage model

If we furthermore assume that the first stage model is Gaussian Y |U = u ∼ N(µ(β, u), Σ) then the Laplace likelihood can be further specified. For the hierarchical Gaussian model it is rather easy to obtain a numerical approximation of the Hessian H at the optimum, ˜ u H(˜ u) ≈ µ′

uΣ−1µ′ u T + Ψ−1

where µ′

u is the partial derivative with respect to u.

The approximation in is called Gauss-Newton approximation In some contexts estimation using this approximation is also called the First Order Conditional Estimation (FOCE) method.

Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January 2011 16 / 23

slide-17
SLIDE 17

Laplace approximation

Automatic differentiation

A simpel and efficient way to use the Laplace transformation technique outlined is via the open source software package AD Model Builder, which takes advantages of automatic differentiation. Any calculation done via a computer program can be broken down to a long chain of simple operations like ‘+’, ‘−’, ‘·’, ‘/’, ‘exp ’, ‘log’, ‘sin’, ‘cos’, ‘tan’, ‘ √ ’, and so on. It is simple to write down the analytical derivative of each of these

  • perations by themselves.

If our log-likelihood function ℓ consisted of only a few of these simple

  • perations, then it would be tractable to use the chain rule

(f ◦ g)′(x) = f ′(g(x))g′(x) to find the analytical gradient ∇ℓ of the log-likelihood function ℓ.

Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January 2011 17 / 23

slide-18
SLIDE 18

Laplace approximation

Automatic differentiation

Automatic differentiation is a technique where the chain rule is used by the computer program itself. When the program evaluates the log-likelihood it keeps track of all the operations used along the way, and then runs the program backwards (reverse mode automatic differentiation) and uses the chain rule to update the derivatives one simple operation at a time. Automatic differentiation is accurate, and the computational cost of evaluating the gradient is surprisingly low.

Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January 2011 18 / 23

slide-19
SLIDE 19

Laplace approximation

Automatic differentiation

Theorem The computational cost of evaluating the gradient of the log-likelihood ∇ℓ with reverse mode automatic differentiation is less than four times the computational cost of evaluating the log-likelihood function ℓ itself. This holds no matter how many parameters the model contain. It is surprising that computational cost does not depend on how many parameters the model contain. There is however a practical concern. The computational cost mentioned above is measured in the number of operations, but reverse mode automatic differentiation requires all the intermediate variables in the calculation of the negative log-likelihood to be stored in the computer’s memory, so if the calculation is lengthy, for instance consisting of a long iterative procedure, then the memory requirements can be enormous.

Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January 2011 19 / 23

slide-20
SLIDE 20

Laplace approximation

Automatic differentiation combined with the Laplace approximation

Finding the gradient of the Laplace approximation of the marginal log-likelihood is challenging, because the approximation itself includes the result of a function minimization, and not just a straightforward sequence of simple operations. It is however possible, but requires up to third order derivatives to be computed internally by clever successive application of automatic differentiation.

Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January 2011 20 / 23

slide-21
SLIDE 21

Laplace approximation

Importance sampling

Importance sampling is a re-weighting technique for approximating integrals w.r.t. a density f by simulation in cases where it is not feasible to simulate from the distribution with density f. Instead it uses samples from a different distribution with density g, where the support of g includes the support of f. For general mixed effects models it is possible to simulate from the distribution with density proportional to the second order Taylor approximation

  • L(θ, ˆ

uθ, Y ) = exp

  • ℓ(θ, ˆ

uθ, Y ) − 1

2(u − ˆ

uθ)T (−ℓ′′

uu(θ, u, Y )|u=ˆ uθ)(u − ˆ

uθ)

  • as, apart from a normalization constant, it is the density φˆ

uθ, ˆ Vθ(u) of

a multivariate normal with mean ˆ uθ and covariance ˆ Vθ = H−1(ˆ uθ) = (−ℓ′′

uu(θ, u, Y )|u=ˆ uθ)−1.

Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January 2011 21 / 23

slide-22
SLIDE 22

Laplace approximation

Importance sampling

The integral to be approximated can be rewritten as: LM(θ, Y ) =

  • L(θ, u, Y )du =

L(θ, u, Y ) φˆ

uθ, ˆ Vθ(u) φˆ uθ, ˆ Vθ(u)du.

So if u(i), i = 1, . . . , N is simulated from the multivariate normal distribution with mean ˆ uθ and covariance ˆ Vθ, then the integral can be approximated by the mean of the importance weights LM(θ, Y ) = 1 N L(θ, u(i), Y ) φˆ

uθ, ˆ Vθ(u(i))

Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January 2011 22 / 23

slide-23
SLIDE 23

Laplace approximation

AD Model Builder

AD Model Builder is a programming language that builds on C++. It includes helper functions for reading in data, defining model parameters, and implementing and optimizing the negative log-likelihood function. The central feature is automatic differentiation (AD), which is implemented in such a way that the user rarely has to think about it at all. AD Model Builder can be used for fixed effects models, but in addition it includes Laplace approximation and importance sampling for dealing with general mixed effects models. AD Model Builder is developed by Dr. Dave Fournier and was a commercial product for many years. Recently AD Model Builder has been placed in the public domain (see http://www.admb-project.org).

Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall January 2011 23 / 23