applied multilevel modelling an introduction
play

Applied multilevel modelling an introduction James Carpenter - PowerPoint PPT Presentation

Applied multilevel modelling an introduction James Carpenter Email: jrc@imbi.uni-freiburg.de www.missingdata.org.uk London School of Hygiene & Tropical Medicine and Institute for Medical Biometry, Freiburg, Germany Friday, 16th


  1. Random effects U j is d length row vector of level 2 (e.g. school) specific random effects. Pre-multiply by d-length column vector z ij to give impact of level-2 specific effect on observation i . Example 1: Random intercepts: U j = u 0 j ∼ N (0 , σ 2 u 0 ) , z ij = 1 for all i . Var ( z ′ ij u 0 j ) = Var ( u 0 j ) = σ 2 u 0 ; Cov ( z ′ ij u 0 j , z ′ i ′ j u 0 j ) = Cov ( u 0 j , u 0 j ) = σ 2 u 0 . σ 2 σ 2   . . . u u . . ... . . Thus Σ u =  .   . .  σ 2 σ 2 . . . u u slide 25/61

  2. Example 2: random intercepts and slopes � � �� � � �� σ 2 u 0 j 0 σ u 0 ,u 1 u 0 U j = ∼ N , . σ 2 u 1 j 0 σ u 0 ,u 1 u 0 Then z ij = (1 , x ij ) , so effect of random terms on school j at student i is u 0 j + x ij u 1 j . Now, Var ( z ′ ij u 0 j ) = Var ( u 0 j + x ij u 1 j ) = σ 2 u 0 + x 2 ij σ 2 u 1 + 2 x ij σ u 0 ,u 1 ; Cov ( z ′ ij u 0 j , z ′ i ′ j u 0 j ) = Cov ( u 0 j + x ij u 1 j , u 0 j + x i ′ j u 1 j ) = σ 2 u 0 + x ij x i ′ j σ 2 u 1 + ( x ij + x i ′ j ) σ u 0 u 1 . Hence Σ u , which is rather messy! slide 26/61

  3. Putting this together Putting the random intercepts and error model together gives Y ij = ( β 0 + u 0 j ) + β 1 x ij + e ij u 0 j ∼ N (0 , σ 2 u ) e 0 j ∼ N (0 , σ 2 e ) Var Y ij = σ 2 u 0 + σ 2 ( gives diagonal elements of Σ) e Cov ( Y ij , Y i ′ j ) = σ 2 ( gives off-diagonal elements of Σ) u 0 ✗ ✔ Hence, correlation between results of students in the same school is the same, no matter how far apart their initial literacy scores. ✖ ✕ slide 27/61

  4. Random intercepts: schematic Y=( α +u 2 )+ β x Y=( α +u 1 )+ β x Y= α + β x + + Y=( α +u 3 )+ β x + Y=( α +u 4 )+ β x + + Y=( α +u 5 )+ β x Response u 2 u 1 u 3 u 4 u 5 literacy (11 years), x slide 28/61

  5. And for random intercepts and slopes: Y ij = ( β 0 + u 0 j ) + ( β 1 + u 1 j ) x ij + e ij � � �� � � �� σ 2 u 0 j 0 u 0 ∼ N , σ 2 u 1 j 0 σ u 0 u 1 u 1 e 0 j ∼ N (0 , σ 2 e ) Var Y ij = σ 2 u 0 + x 2 ij σ 2 u 1 + 2 σ u 0 u 1 x ij + σ 2 e ( gives diagonal elements of Σ) Cov ( Y ij , Y i ′ j ) = σ 2 u 0 + x ij x i ′ j σ 2 u 1 + ( x ij + x i ′ j ) σ u 0 u 1 ( gives off-diagonal elements of Σ) ✗ ✔ Now correlation between results of students in the same school can decline as distance between their literacy scores increases. ✖ ✕ slide 29/61

  6. Random intercepts & slopes: schematic Y=( α +u 2 )+( β + v 2 )x Y= α + β x Y=( α +u 1 )+( β + v 1 )x Y=( α +u 3 )+( β + v 3 )x Y=( α +u 5 )+( β + v 5 )x Response Y=( α +u 4 )+( β + v 4 )x literacy (11 years), x (Note u 0 j ↔ u j ; u 1 j ↔ v j . ) slide 30/61

  7. Fitting random intercepts model slide 31/61

  8. Plotting school level residuals � (Plot shows ˆ u 0 j ± 2 Var ˆ u 0 j ) slide 32/61

  9. Interpretation Can we deduce that schools at the right hand end are better - that they give students with the same 11 year literacy a better education? (jargon term: value added ) ✞ ☎ The UK department of Education does! ✝ ✆ BUT Our model assumed each school has the same slope: this is equivalent to the correlation between students’ 16 year results in a school being the same, no matter how far apart their 11 year literacy . Let’s fit a random intercepts and slopes model to test this. slide 33/61

  10. Random intercept & slope model Log likelihood ratio test: 9357 . 3 − 9316 . 9 = 40; cf χ 2 2 , p = 2 . 1 × 10 − 9 . slide 34/61

  11. School specific slopes (Note school lines do not extend out of range of their students’ 11 year literacy intake) slide 35/61

  12. Interpretation Picture is complex! Some schools appear good for students with high literacy at 11, but less good for students with low literacy at 11. For some schools it is the reverse. This may reflect specific teaching strategies. Some schools are poor overall. Use of a random intercepts model alone for these data is misleading for parents and a disservice to schools. See http://www.mlwin.com/ hgpersonal/league_tables_England_2004_commentary.htm slide 36/61

  13. Back to variance model We decomposed Σ = Σ u + Σ w + Σ u. Now consider Σ w . If Var y ij increases (say with time), such as in growth data, the variance/covariance is non-stationary. However, if the variance is constant over time and the covariance depends only on the time between observations, it is stationary . We have seen that random effects (e.g. random intercepts and slopes) often give good models for non-stationary processes. However, they are not so good for stationary processes. Even if the variance is non-stationary, there may be a stationary component. slide 37/61

  14. Stationary correlation structure Again have w ′ j = ( w ij , w 2 j , . . . , w Ij ) . Recall trials example; see individuals repeatedly over times i = 1 , 2 , . . . , I. A stationary covariance process has I by I matrix Σ w = 0 1 1 ρ ( | 1 − 2 | = 1) ρ ( | 1 − I | = I − 1) . . . B C ρ ( | 2 − 1 | = 1) 1 ρ ( | 2 − I | = I − 2) . . . B C B C . . . B C τ 2 . . . B C . . . . . . B C B C B C ρ ( | ( I − 1) − 1 | = I − 2) . . . 1 ρ ( | ( I − 1) − I | = 1) B C @ A ρ ( | I − 1 | = I − 1) ρ ( | I − ( I − 1) | = 1) 1 . . . where ρ (0) = 1 , and ρ ( | ω | ) → 0 as | ω | → ∞ . Forms for ρ ( . ) include AR(1), exp( − α | ω | ) , · · · slide 38/61

  15. Drawing it together Recall trials example. For an individual with I measurements, their variance/covariance matrix is the I by I matrix Σ = Σ u + Σ w + Σ e , Do we need these three components? Depends on data set • Some problems (typically growth) observations sufficiently spaced out that Σ w = 0 . • Or, perhaps because of a standard manufacturing process association will be dominated by Σ w and Σ u = 0 . Remaining error, Σ e � = 0 ! slide 39/61

  16. The variogram The sample variogram gives a graphical guide to the correlation structure. It is less prone to local fluctuations than the sample correlation matrix. It is valid for limited non-stationary data, provided the increments Y ( i + ω ) − Y i do not depend on i. If the variance process is stationary, let ω be the time between observations, and ρ ( ω ) the correlation between them. Then the variogram is γ ( ω ) = σ 2 (1 − ρ ( ω )) , where σ 2 is the ‘process’ variance. The variogram goes to 0 as ω → ∞ . slide 40/61

  17. Estimating the variogram Fit a ‘full’ model to the data, i.e. a model with the most general mean and variance structure you are interested in. Calculate 1. the residuals, e ij ; 2. the half-squared differences, ν ijk = 1 2 ( e ij − e ik ) 2 . 3. the corresponding time differences, δ ijk = t ij − t ik . Then the sample variogram, γ ( ω ) , is the average of all the ν ijk corresponding for which δ ijk = ω. Finally, estimate the ‘process variance’ by the average of all 1 2( y ij − y lk ) , for which i � = l. For more details see Diggle et al. (2002). slide 41/61

  18. Interpreting the variogram γ (ω) process variance random intercept variance 2 σ u serial correlation 2 τ error variance 2 σ e ω slide 42/61

  19. Application to asthma study We estimate the variogram to guide the choice of variance model for the asthma study. In designed studies, such as this, the choice ‘full model’ or ‘unstructured model’ is fairly clear. We fit a different mean for each treatment group at each time, and an unstructured covariance matrix. We also fit a random term at a 3rd level — investigator. slide 43/61

  20. Asthma study: model Let φ il = 1 if patient i has treatment l = 1 , . . . . 5 We have observations at 2, 4, 8 & 12 weeks, corresponding to j = 1 , 2 , 3 , 4 . Let k index investigator. The model is y ijk = φ il β jl + u j + ν k , ν k ∼ N (0 , σ 2 ν )   u 1 u 2    ∼ N (0 , Σ) , (unstructured; 10 parameters)    u 3   u 4 slide 44/61

  21. Unstructured model for asthma data slide 45/61

  22. Estimated variogram 0.6 0.4 ^( ω ) γ 0.2 0.0 2 4 6 8 10 Weeks between observations, ω Correlation virtually constant over time. Error variance ≈ 0 . 1 . Random intercept variance ≈ 0 . 5 . slide 46/61

  23. Simpler variance model Variogram suggests random intercepts model adequate. Fit this model and compare with unstructured model: − 2 × ℓ Model No. variance parameters Unstruc. var 10 1831.2 Rand. ints 2 1845.9 Difference in − 2 × log likelihood is 14.7. Compare with χ 2 8 , p=0.064. The variogram can • greatly simplify choosing the variance model, and • be used to compare the sample and model correlation. Non-stationary processes must be made approximately stationary first. slide 47/61

  24. Asthma study: fitted vs observed means 2.3 raw data Active, 100mcg model 2.2 FEV1 (litres) 2.1 2.0 Placebo 1.9 0 2 4 6 8 10 12 Weeks since randomisation Estimated means valid if data MAR. Borderline investigator effect vanishes after baseline adjustment. slide 48/61

  25. (Restricted) maximum likelihood Our model is y ∼ MN ( Xβ, Σ) , with log-likelihood − 0 . 5 { log | Σ | + ( y − Xβ ) ′ Σ − 1 ( y − Xβ ) } . Maximise by direct Newton-Raphson search (e.g. Raudenbush and Bryk (2002), ch. 14). Often, iterative generalised least squares is faster (see below). To correct the downward bias of ML estimators, use REML log-likelihood − 0 . 5 { log | Σ | + ( y − Xβ ) ′ Σ − 1 ( y − Xβ ) + log | X ′ Σ − 1 X |} . See, e.g. Verbeke and Molenberghs (2000), p. 43. For moderately large data sets the results are similar, though in uncommon situations with many fixed parameters the two can give quite different answers (Verbeke and Molenberghs, 2000, p. 198). slide 49/61

  26. Testing Changes in REML log-likelihoods cannot generally be used to compare nested models unless X is unchanged. So maximum likelihood may be preferred for model building (Goldstein, 2003, p. 36). MLwiN always gives non-REML likelihood Often Wald, or F-tests used for fixed effects, and likelihood ratio tests for random effects. Asymptotically, fixed effects estimates ( ˆ β ) and variance term estimates independent; however, standard errors can be too small in small data sets. Kenward and Roger (1997) give an adjustment, implemented in SAS, for this. slide 50/61

  27. IGLS estimation Goldstein (1986) showed how to maximise the multivariate normal likelihood iteratively using two regressions. Recall: iid 1. If y i = x ′ ∼ N (0 , σ 2 ) then i β + ǫ i , ǫ i ˆ β = ( X ′ X ) − 1 X ′ Y. iid 2. If y i = x ′ ∼ N (0 , σ 2 ) but Cov ( Y ) = Σ , i β + ǫ i , ǫ i not known ˆ β = ( X ′ Σ − 1 X ) − 1 X ′ Σ − 1 Y. slide 51/61

  28. Model: Y ij = β 0 + β 1 x ij + u 0 j + e ij   1 x 11 � � 1 x 21   β 0   E Y = . .   . . β 1 . .     1 x IJ ↑ ↑ X β 1. Guess Σ , ie guess Cov ( Y ) , set β 1 = ( X ′ Σ − 1 X ) − 1 X ′ Σ − 1 Y ˆ 2. Calculate residuals, R = Y − X ˆ β 3. IDEA Since E ( RR ′ ) = Σ , the covariance matrix of Y, create a regression model to estimate Σ . After estimating Σ , update β is step 1. Iterate till convergence slide 52/61

  29. How do we estimate Σ ? E ( R ′ R ) =     r 2 σ 2 u 0 + σ 2 σ 2 r 12 r 11 . . . . . . 11 e u 0 r 2 σ 2 σ 2 u 0 + σ 2 r 21 r 11 . . . . . .     22 u 0 e     . . ... ... ... ...  .   .  = . .                 So vec ( R ′ R ) = � �     r 2 1 1 σ 2 11 e + ω r 21 r 11 0 1 σ 2     u 0     .  .    1 1 = ↑ Υ .     . .     . . . .         ← Z slide 53/61

  30. Estimation of Σ (ctd) A regression, BUT r ’s not iid; turns out Cov ( vec ( RR )) = Σ ⋆ = Σ � Σ Thus ˆ Υ = ( Z ′ Σ ⋆ − 1 Z ) Z ′ Σ ⋆ − 1 R Full details in Goldstein and Rasbash (1992). For discussion of residuals in multilevel modelling see, e.g. Robinson (1991). slide 54/61

  31. Software Most packages offer software for continuous data. Some comments on packages I have tried to use (no implied judgment on other packages): SAS, proc MIXED — the commercial airliner (expensive) Excellent for fitting a wide range of standard models; provides a wide range of (stationary) covariance models. Kenward-Roger adjustment to standard errors/degrees of freedom for small samples implemented (ddfm=kr). Not intuitive for beginners Probably the best for regulatory pharma-industry analyses slide 55/61

  32. Software (ctd) MLwin — the light aircraft (free to UK academics) Fast, flexible multilevel modelling package — but you may crash! Fits an extremely wide range of models. Very strong on random effects models; weak on stationary covariance models. Intuitive for teaching. Stata (GLLAMM) — overland travel ( ∼ 200 USD) Has a very general, but very slow, multilevel/latent variable modelling package. A faster package for continuous data with Stata 9.0. slide 56/61

  33. Software (ctd) WinBUGS (free) Very flexible package for model fitting using MCMC, with R/S+ like syntax. I like to have a good idea what the answer is before I use it. R (free) Non-linear estimation package limited compared with SAS and MLwiN; Trellis graphics and data manipulation awkward. Syntax hard for teaching ✞ ☎ For up-to-date reviews (re)-visit www.mlwin.com ✝ ✆ slide 57/61

  34. Summary • Once you look, multilevel structure is everywhere. • This is an important aspect of the data; ignoring it leads to misleading results. • Distinguish between the mean model and the correlation model; understand the impact of random effects models on the correlation. • Estimation: REML is usually preferable. Don’t use changes in REML likelihood for fixed effects inference, though. • Get started using the tutorial material at http://tramss.data-archive.ac.uk/Software/MLwiN.asp • Further references: see Carpenter (2005). slide 58/61

  35. References Carpenter, J. R. (2005) Multilevel models. In Encyclopaedic Companion to Medical Statistics (Eds B. Everitt and C. Palmer). London: Hodder and Stoughton. Diggle, P . J., Heagerty, P ., Liang, K.-Y. and Zeger, S. L. (2002) Analysis of longitudinal data (second edition) . Oxford: Oxford University Press. Goldstein, H. (1986) Multilevel mixed linear model analysis using iterative generalized least squares. Biometrika , 73 , 43–56. Goldstein, H. (2003) Multilevel statistical models (second edition) . London: Arnold. slide 59/61

  36. References Goldstein, H. and Rasbash, J. (1992) Efficient computational procedures for the estimation of parameters in multilevel models based on iterative generalised least squares. Computational Statistics and Data Analysis , 13 , 63–71. Kenward, M. G. and Roger, J. H. (1997) Small sample inference for fixed effects from restricted maximum likelihood. Biometrics , 53 , 983–997. Raudenbush, S. W. and Bryk, A. S. (2002) Hierarchical linear models: applications and data analysis methods (second edition) . London: Sage. Robinson, G. K. (1991) That BLUP is a good thing: the estimation of random effects. Statistical Science , 6 , 15–51. slide 60/61

  37. References Verbeke, G. and Molenberghs, G. (2000) Linear Mixed Models for Longitudinal Data . New York: Springer Verlag. slide 61/61

  38. II: Extensions • Multilevel models for discrete data • Marginal vs Conditional models • Example: clinical trial • Estimation issues • Software • Models for cross-classified data • Notation • Example: Scottish educational data • Models for multiple membership cross-classified data • Notation • Example: Danish poultry farming slide 1/53

  39. Review: random intercepts Suppose j indexes subjects (level 2) and i indexes repeated observations (level 1). Recall the random intercepts model: y ij = ( β 0 + u j ) + β 1 x ij + e ij u j ∼ N (0 , σ 2 u ) e ij ∼ N (0 , σ 2 e ) . If y ij is now binary, a natural generalisation is logit Pr( y ij = 1) = ( β 0 + u j ) + β 1 x ij u 0 i ∼ N (0 , σ 2 u ) slide 2/53

  40. Interpretation In the previous model logit Pr( y ij = 1) is really logit Pr( y ij = 1 | x ij , u j ) . Thus, this is a Subject Specific (SS) model; β is the log-odds ratio of the effect of x for a specific u j . We can also construct Population Averaged (PA) models logit Pr( y ij = 1 | x ij ) = β p 0 + β p 1 x ij Define expit as the inverse logit function. Then E u expit ( β 0 + β 1 x ij + u j ) � = expit ( β 0 + β 1 x ij ) . So PA estimates do not equal SS estimates. The exception is the normal model, where expit is replaced by the identity. slide 3/53

  41. Interpretation (ctd) Let β denote SS estimates, β p PA estimates. In general • | β p | < | β | , with equality if σ 2 u = 0 . • If σ 2 u is large, the two will be very different. Example y ij indicates coronary heart disease for subject j at time i x 1 ij indicates the subject smokes x 2 ij indicates one of their parents had CHD β p 1 : effect of smoking on log-odds of CHD in population β 1 : effect of stopping smoking for a given subject β p 2 : Effect of parental CHD in the population β 2 : Effect of change in parental CHD for given individual ?? slide 4/53

  42. Choosing between PA and SS It depends on the question: SS more appropriate when population averaged effects are important: • In epidemiology & trials, where we are interested in the average effect of treatment on a population • In education, when we are interested in the average effects of interventions across populations BUT: the same process in two populations with different heterogeneity will lead to different β p . SS preferable for • Estimating effects on individuals, schools • Modelling sources of variability. slide 5/53

  43. Example: Clinical trial 241 patients randomised to receive either a placebo or active treatment. Three treatment periods. In each period, each patient undergoes a series of tests (the number varying between patients). Each test has a binary response (1=success). Dropout Treatment group Placebo Active Period 1 0 0 Period 2 10 5 Period 3 25 12 Completers 82 107 slide 6/53

  44. No. of tests by period and intervention 40 30 20 10 Period 1 Period 2 Period 3 Period 1 Period 2 Period 3 Placebo Treatment Active Treatment slide 7/53

  45. SS models Let k index patient, j period and i test, and δ k = 1 if patient k has active treatment. A general model is E y ijk = µ jk , expit ( µ jk ) = α j + δ k β j + base k γ j + ω jk , where either: (i) ω jk = u k , u k ∼ N (0 , σ 2 u ) — random intercepts (ii) ω jk = ( u 0 k + u 1 k t j ) , ( u 0 k , u 1 k ) ∼ N 2 (0 , Ω) — random intercepts and slopes (iii) ω jk = u jk and         σ 2 u 1 k 0 u 1     σ 2  ∼ N  , . u 2 k 0 σ u 0 u 2       u 2      σ 2  u 3 k 0 σ u 1 u 3 σ u 2 u 3   u 3 slide 8/53

  46. Results: correlation matrices We are able to fit these models because of the repeated observations on patients in periods. Correlation matrices are: (i) Random intercepts: − 2 ℓ = 4941; var: 5.9, corr: 1. (ii) Random intercepts and slopes: − 2 ℓ = 4671 (+2 paras)   9 . 99 0 . 86 6 . 76  .    0 . 46 0 . 83 8 . 29 (iii) Unstructured: − 2 ℓ = 4580 . (+3 paras)   9 . 97 0 . 72 8 . 40     0 . 45 0 . 66 7 . 26 slide 9/53

  47. Results: treatment effects Baseline adjusted treatment effects: Model period 1 period 2 period 3 RI 1.0 (0.22) 0.5 (0.23) 1.2 (0.40) RI+S 1.6 (0.65) 0.4 (0.41) 1.3 (0.53) Unstruc 1.4 (0.68) 0.7 (0.58) 1.2 (0.54) Note how these estimates depend on the random effects model. This is generally true. It’s worth taking time over the random effects model. slide 10/53

  48. Getting PA coefficients from SS ones Sometimes it is useful to obtain PA estimates from SS ones (NB can’t go the other way!). This is particularly easy when we have a designed study, so that the coefficients for the random terms in the model are the same for each patient at each time. For simplicity we need to estimate the mean at each time, rather than fitting a slope across time. As before the model is: E y ijk = µ jk , h ( µ jk ) = α j + δ k β j + base k γ j + z ′ j u k u k ∼ N (0 , Σ u ) . The table below, derived from Zeger et al. (1988), shows how to obtain the PA coefficients. slide 11/53

  49. Table for obtaining PA from SS effects Link function, h ( . ) Random intercepts, z j = 1 General random structure z j α p α p j = α j + σ 2 log u / 2 j = β j + z ′ j Σ u z j / 2 β p β p j = β j j = β j γ p γ p j = γ j j = γ j � α p α p � probit j = α j / 1 + σ 2 j = α j / | I + Σ z j z ′ j | u � β p β p � j = β j / 1 + σ 2 j = β j / | I + Σ z j z ′ j | u � γ p γ p � j = γ j / 1 + σ 2 j = γ j / | I + Σ z j z ′ j | u � α p α p � logistic j ≈ α j / 1 + 0 . 34584 σ 2 j ≈ α j / | I + 0 . 34584Σ z j z ′ j | u � β p β p � j ≈ β j / 1 + 0 . 34584 σ 2 j ≈ β j / | I + 0 . 34584Σ z j z ′ j | u � γ p γ p � j ≈ γ j / 1 + 0 . 34584 σ 2 j ≈ γ j / | I + 0 . 34584Σ z j z ′ j | u slide 12/53

  50. Accuracy of logistic approximation σ 2 = 1 4 σ 2 = 8 σ 2 = 16 η ( 1 + 0.34584 × σ 2 ) 0.5 2 0 −2 −4 −4 −2 0 2 4 η p slide 13/53

  51. Example We use data from period 3 of the trial described above, and fit a random intercepts model with (i) probit link and (ii) logistic link. We compare using the transformation with fitting an independence GEE (=GLM here) with robust SEs. Link GLM estimates Transformed SS SS estimates estimates probit 0.323 (0.190) 0.318 (0.167) 0.544 (0.285) logit 0.636 (0.407) 0.577 (0.301) 1.100 (0.573) For the logistic model σ 2 u = 7 . 6 . Agreement improves as sample size increases. slide 14/53

  52. Using simulation to obtain PA estimates Sometimes it is easier to use simulation. Consider the random intercepts model: 1. Fit the model. Draw M values of the SS random effect σ 2 u ) : u 1 , . . . , u M . from from N (0 , ˆ 2. For m = 1 , . . . , M and for a particular x, compute π m = expit ( β 0 + β 1 x + z ′ u m ) 3. Mean (population averaged) probability is M π p = 1 � π m . ˆ M m =1 slide 15/53

  53. Estimation: PA models If Var ( y j ) were known, we could use the score equation for β when the data follow a log-linear model: J � ∂µ j � � Var ( y j ) − 1 ( y j − µ j ) = 0 . s ( β ) = ∂β j =1 Liang and Zeger (1986) showed that if we write Var ( y j ) as √ Var ( y j , β, α ) , then if we use any J -consistent estimator for α, the estimates of β obtained by solving this are asymptotically as efficient as those we would get were α known. In practice, usually an estimating equation for α is formed and the resulting Generalised Estimating Equations (GEEs) solved simultaneously (e.g. Prentice (1988)). slide 16/53

  54. GEEs: notes on estimation • ˆ β is nearly efficient relative to ML estimates provided Var [ E y j ] is reasonably approximated. • ˆ β is consistent as J → ∞ even if the covariance structure of y j is incorrect. • Once the mean model is chosen, the robustness of inferences can be checked by trying various covariance structures, and comparing the parameter estimates and their robust standard errors. • As GEEs are moment-based estimators, they are invalid with missing data, unless it is missing completely randomly. Further details in Diggle et al. (2002). slide 17/53

  55. Estimation: SS models For illustration, consider the random intercepts model: u j ∼ N (0 , σ 2 logit Pr( y ij = 1) = β 0 + u j , u ) . The likelihood is � I � ∞ � (1 − Y ij ) � J u 2 � Y ij � � 1 1 1 j − � � u du × e 2 σ 2 1 + e − ( β 0 + u j ) 1 + e ( β 0 + u j ) � 2 πσ 2 −∞ u j =1 i =1 J �� � � = f ( Y j ; β, u j ) g ( u j , Σ) du j . j =1 slide 18/53

  56. Obtaining parameter estimates The likelihood for individual j is � L ( β, σ 2 f ( Y j | u j , β ) g ( u j , σ 2 u | Y j ) = u ) du j . When f, g normal this integral is the multivariate normal distribution of the data, maximised as described in session 1. Otherwise, the integral is intractable. Options are • Numerical integration (e.g. SAS NLMIXED) (slow for many random effects) • Quasi likelihood methods slide 19/53

  57. Penalised Quasi-likelihood Consider observation ij, and drop subscripts: y = µ ( η ) + ǫ = expit ( Xβ + Zu ) + ǫ. After update t, have β t , u t , say. Expand about true β, u : y ij ≈ µ ( Xβ t + Zu t ) + ∂µ ∂η X ( β − β t ) + ∂µ ∂η Z ( u − u t ) + ǫ. Re-arrange: y − µ ( Xβ t + Zu t ) + ∂µ ∂η Xβ t + ∂µ ∂η Zu t = ∂µ ∂η Xβ + ∂µ ∂η Zu + ǫ. I.e. y ⋆ = X ⋆ β + Z ⋆ u + ǫ. Constrain Var ǫ = µ (1 − µ ) , and obtain new estimates with 1 step of fitting routine for normal data. slide 20/53

  58. Comments on quasi-likelihood methods • Estimation at almost the same speed as for normal models. • No estimate of the log-likelihood. • Estimates can be badly biased if – fitted values close to 0 or 1 – some individuals have few observations • Obvious solution is to try the 2nd order Taylor expansion — called PQL(2) – bias is substantially reduced, but can’t always be fitted • There have been various proposals for simulation-based bias correction. Ng et al. (2005) consider these, and conclude Simulated Maximum Likelihood (SML) is best. slide 21/53

  59. SML method for bias correction � Recall likelihood: L ( β, Σ | y ) = f ( y | u, β ) g ( u, Σ) du. Obtain initial estimates ˆ β, ˆ Σ , from PQL(2). Set ˜ Σ = ˆ Σ . Then � f ( y | u, β ) g ( u, Σ) g ( u, ˜ Σ) L ( β, Σ) = du g ( u, ˜ Σ) H ≈ 1 f ( y | u h , β ) g ( u h , Σ) � g ( u h , ˜ H Σ) h =1 iid ∼ g ( u, ˜ where u 1 , . . . , u h Σ) . Search for β, Σ that maximise this Monte-Carlo likelihood estimate (keeping ˜ Σ fixed). slide 22/53

  60. Simulation study Compare PQL (2) followed by SML with SAS proc NLMIXED. Use example of Kuk (1995). Simulate 100 data sets from Y ij ∼ bin (6 , π ij logit ( π ij ) = 0 . 2 + u j + 0 . 1 x ij u j ∼ N (0 , 1) J = 15 level 2 units with I = 2 level 1 units each. x = − 15 , = 14 , . . . , 14 . slide 23/53

  61. Results Values are average estimates (Mean Squared Error) σ 2 β 0 β 1 Parameter True values 0.2 0.1 1 SML (H=500) 0.203 0.099 0.939 (0.136) (0.00134) (0.561) SAS NLMIXED 0.190 0.097 0.927 (0.135) (0.00134) (0.478) NLMIXED needs starting values: gave 0 for β 0 , β 1 and 1 for σ 2 . NLMIXED had convergence problems on 17/100 data sets: these are excluded. Conclude (i) SAS NLMIXED larger bias, smaller variance; BUT we only used H=500 (ii) PQL(2) + SML seems to work well. slide 24/53

  62. Example: Bangladesh data A sub-sample of the 1988 Bangladesh fertility survey Huq and Cleland (1990). All the women had been married at some time. i = 1 , . . . , 1934 women (level 1) from j = 1 , . . . , 60 districts (level 2). 2–118 women in each district. � 1 if the women reported contraceptive use Response: y ij = . 0 otherwise Covariates: urban resident; age; no. of children slide 25/53

  63. Model Let 1[ ] is an indicator for the event in brackets. The model is: logit { Pr( Y ij = 1) } = ( β 0 + u 0 j ) + ( β 1 + u 1 j ) × 1[ urban resident ] + β 2 × (centred age) + β 3 × 1[ 1 child ] + β 4 × 1[ 2 children ] + β 5 × 1[ ≥ 3 children ] , � � �� � � �� σ 2 u 0 j 0 u 0 ∼ N , . σ 2 u 1 j 0 σ u 0 u 1 u 1 slide 26/53

  64. Results Broadly, odds of contraception use higher for urban residents and increases with no. of children. Used 6 different estimation routines. Most variation in variance estimates: 2nd order 2nd order PQL EM-Laplace2 MCMC NLMIXED GLLAMM  Method PQL + SML (H=3000) Package (+ matlab ) MLwiN HLM MLwiN SAS Stata σ 2 0.396 0.398 0.379 0.435 0.388 0.390 u 0 (0.118) (0.132) (NA) (0.144) (0.129) (0.129) − 0.414 − 0.407 − 0.391 − 0.455 − 0.404 − 0.406 σ u 0 u 1 (0.160) (0.177) (NA) (0.191) (0.175) (0.176) σ 2 0.686 0.661 0.627 0.770 0.664 0.666 u 1 (0.284) (0.339) (NA) (0.340) (0.322) (0.322) − 2 ℓ : RI+S NA 2398.9 5953.1 (DIC) 2386.4 2398.7 2398.6 RI only NA 2413.0 5968.1 (DIC) 2408.7 2413.7 2413.7 Change NA 14.1 15.0 (21.3) 15.0 15.0 slide 27/53

  65. Implications: software • HLM v6: EM-Laplace2: No SE estimated for random components. In another problem, estimated random components are smaller than estimates from SML/NLMIXED. • HLM v5: Laplace6: convergence problems. • GLLAMM is slow • MCMC (Gamma diffuse priors on variances) gives inflated estimates, relative to ML. • NLMIXED needs starting values (here from PQL(2)) • SML appears to work well Conclude: PQL(2) plus SML or NLMIXED for ‘bias correction’ usually gives best answers. slide 28/53

  66. Cross classified data So far we have considered only hierarchical structures. However, social structures are not always hierarchical . People often belong to more than one grouping at a given hierarchical level. E.g. neighbourhood and school may both have effects on educational outcomes: – a school may contain children from several neighbourhoods; – children from one neighbourhood may attend different schools Children are nested in a cross classification of neighbourhood and school . slide 29/53

  67. Classification diagrams S4 S1 S2 S3 School Children C10 C1 C2 C3 C4 C5 C6 C7 C8 C9 Neighbourhood N1 N2 N3 Neighbourhood Neighbourhood School School Pupil Pupil Nested structure Cross classified structure slide 30/53

  68. Standard notation Y i ( j 1 j 2 ) is the response for pupil i in neighbourhood j 1 and school j 2 . The subscripts j 1 and j 2 are bracketed together to indicate that these classifications are at the same level, i.e. pupils are nested within a cross-classification of neighbourhoods and schools. A basic cross-classified model may be written: y i ( j 1 j 2 ) = β ′ x i ( j 1 j 2 ) + u 1 j 1 + u 2 j 2 + e i ( j 1 j 2 ) . u 1 j 1 is the random neighbourhood effect u 1 j 2 is the random school effect slide 31/53

  69. Alternative notation For hierarchical models we have one subscript per level , and nesting is implied by reading from left to right. E.g. ijk denotes the i th level 1 unit within the j th level 2 unit within the k th level 3 unit. For cross-classified models, we can group together indices for classifications at the same level using parentheses (see previous slide). However, having one subscript per classification becomes cumbersome. Using an alternative notation, we have a single subscript , no matter how many classifications there are. slide 32/53

  70. Data matrix for cross-classification Single subscript notation. Let i index children. i Neighbourhood( i ) School( i ) 1 1 1 2 2 1 3 1 1 4 2 2 5 1 2 6 2 3 7 2 3 8 3 3 9 3 4 10 2 4 slide 33/53

  71. Cross classified model Using the single-subscript notation, y i is the outcome for child i. Classification1 is child, 2 is neighbourhood and 3 is school. = β ′ x i + u (2) nbhd ( i ) + u (3) y i sch ( i ) + e i u (2) ∼ N (0 , Ω (2) u ) Random departure due to neighbourhood nbhd ( i ) u (3) ∼ N (0 , Ω (3) u ) random departure due to school sch ( i ) e i ∼ N (0 , Ω e ) individual-level residual Covariates may be defined for any of the 3 classifications Coefficients may be allowed to vary across neighbourhood or school. slide 34/53

  72. Cross-classified model: notation From the previous slide, the model is: y i = β ′ x i + u (2) nbhd ( i ) + u (3) sch ( i ) + e i . Thus, for pupils 1 and 10 in the data set 2 slides back, y 1 = β ′ x 1 + u (2) + u (3) + e 1 ; 1 1 y 10 = β ′ x 10 + u (2) + u (3) + e 10 . 2 4 slide 35/53

  73. Other cross-classified structures • Pupils within primary schools by secondary schools • Patients within GPs by hospitals • Survey respondents within sampling clusters by interviewers • Repeated measures within raters by individuals (e.g. patients by nurses) • Students following modular degree courses, e.g. Simonite and Browne (2003) slide 36/53

  74. Example: Scottish school children 3435 children who attended 148 primary schools and 19 secondary schools in Fife, Scotland. Classifications: 1 – student; 2 – primary school; 3 – secondary school y i – overall achievement at age 16 for student i x i – verbal reasoning at age 12 (mean centred) 2-level cross classified model: y i = β 0 + β 1 x i + u (2) prim ( i ) + u (3) sec ( i ) + e i slide 37/53

  75. Results Hierarchical Cross-classified model model Fixed β 0 5.99 5.98 β 1 0.16 (0.003) 0.16 (0.003) Random σ (2) (primary) — 0.27 (0.06) u σ (3) (secondary) 0.28 (0.06) 0.01 (0.02) u σ e (student residual) 4.26 (0.10) 4.25 (0.10) Most of the variation in results at 16 years can be attributed to primary schools — an intriguing result for educational researchers overlooked without a cross-classified model. slide 38/53

  76. Cross-classification: covariance matrix Recall in our hierarchical model, the covariance matrix was block diagonal:   Σ 0 0 . . . 0   0 Σ 0 . . . 0     ...   Σ full = . 0 0 . . . 0       0 0 . . . Σ 0     0 0 . . . 0 Σ With a cross classified model this is no longer true. E.g. for our example, suppose we first classify pupils within secondary schools. For every student who shares a primary school with a student from another secondary school, there is an off diagonal term — Ω (2) (primary). slide 39/53

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend