Regression tree-based diagnostics for linear multilevel models - - PowerPoint PPT Presentation

regression tree based diagnostics for linear multilevel
SMART_READER_LITE
LIVE PREVIEW

Regression tree-based diagnostics for linear multilevel models - - PowerPoint PPT Presentation

Longitudinal and clustered data and multilevel models Goodness-of-fit and regression trees Performance of tree-based lack-of-fit tests Application to real data Conclusion Regression tree-based diagnostics for linear multilevel models Jeffrey


slide-1
SLIDE 1

Longitudinal and clustered data and multilevel models Goodness-of-fit and regression trees Performance of tree-based lack-of-fit tests Application to real data Conclusion

Regression tree-based diagnostics for linear multilevel models

Jeffrey S. Simonoff New York University May 11, 2011

The Fourth Erich L. Lehmann Symposium Regression tree-based diagnostics for linear multilevel models

slide-2
SLIDE 2

Longitudinal and clustered data and multilevel models Goodness-of-fit and regression trees Performance of tree-based lack-of-fit tests Application to real data Conclusion

Longitudinal and clustered data

Panel or longitudinal data, in which we observe many individuals

  • ver multiple periods, offers a particularly rich opportunity for

understanding and prediction, as we observe the different paths that a variable might take across individuals. Clustered data, where observations have a nested structure, also reflect this hierarchical character. Such data, often on a large scale, are seen in many applications:

◮ test scores of students over time ◮ test scores of students across classes, teachers, or schools ◮ blood levels of patients over time ◮ transactions by individual customers over time ◮ tracking of purchases of individual products over time

The Fourth Erich L. Lehmann Symposium Regression tree-based diagnostics for linear multilevel models

slide-3
SLIDE 3

Longitudinal and clustered data and multilevel models Goodness-of-fit and regression trees Performance of tree-based lack-of-fit tests Application to real data Conclusion

Longitudinal data

I will refer to such data as longitudinal data here, but all of the content applies equally to other clustered data. The analysis of longitudinal data is especially rewarding with large amounts of data, as this allows the fitting of complex or highly structured functional forms to the data. We observe a panel of individuals i = 1, ..., I at times t = 1, ..., Ti. A single observation period for an individual (i, t) is termed an

  • bservation; for each observation, we observe a vector of

covariates, xit = (xit1, ..., xitK)′, and a response, yit.

The Fourth Erich L. Lehmann Symposium Regression tree-based diagnostics for linear multilevel models

slide-4
SLIDE 4

Longitudinal and clustered data and multilevel models Goodness-of-fit and regression trees Performance of tree-based lack-of-fit tests Application to real data Conclusion

Longitudinal data models

Because we observe each individual multiple times, we may find that the individuals differ in systematic ways; e.g., y may tend to be higher for all observation periods for individual i than for other individuals with the same covariate values because of characteristics of that individual that do not depend on the

  • covariates. This pattern can be represented by an “effect” specific

to each individual (for example, an individual-specific intercept) that shifts all predicted values for individual i up by a fixed amount: yit = Zitbi + f (xit1, ..., xitK) + εit.

The Fourth Erich L. Lehmann Symposium Regression tree-based diagnostics for linear multilevel models

slide-5
SLIDE 5

Longitudinal and clustered data and multilevel models Goodness-of-fit and regression trees Performance of tree-based lack-of-fit tests Application to real data Conclusion

Mixed effects models

◮ If f is linear in the parameters and the bi are taken as fixed or

potentially correlated with the predictors, then this is a linear fixed effects model.

◮ If f is linear in the parameters and the bi are assumed to be

random (often Gaussian) and uncorrelated with the predictors, then the model is a linear mixed effects model. Conceptually, random effects are appropriate when the observed set of individuals can be viewed as a sample from a large population of individuals, while fixed effects are appropriate when the observed set of individuals represents the only ones about which there is interest.

The Fourth Erich L. Lehmann Symposium Regression tree-based diagnostics for linear multilevel models

slide-6
SLIDE 6

Longitudinal and clustered data and multilevel models Goodness-of-fit and regression trees Performance of tree-based lack-of-fit tests Application to real data Conclusion Testing for model violations RE-EM trees

Linear model and goodness-of-fit

The most commonly-used choice of f is unsurprisingly the linear model yit = Zitbi + Xitβ + εit, assuming errors ε that are normally distributed with constant

  • variance. This model has the advantage of simplicity of

interpretation, but as is always the case, if the assumptions of the model do not hold inferences drawn can be misleading. Such model violations include nonlinearity and heteroscedasticity. If specific violations are assumed, tests such as likelihood ratio tests can be constructed, but omnibus goodness-of-fit tests would be useful to help identify unspecified model violations.

The Fourth Erich L. Lehmann Symposium Regression tree-based diagnostics for linear multilevel models

slide-7
SLIDE 7

Longitudinal and clustered data and multilevel models Goodness-of-fit and regression trees Performance of tree-based lack-of-fit tests Application to real data Conclusion Testing for model violations RE-EM trees

Regression trees and goodness-of-fit

The idea discussed here is a simple one that has (perhaps) been underutilized through the years: since the errors are supposed to be unstructured if the model assumptions hold, examining the residuals using a method that looks for unspecified structure can be used to identify model violations. A natural method for this is a regression tree. Miller (1996) proposed using a CART regression tree (Breiman, Friedman, Olshen, and Stone, 1984) for this purpose in the context

  • f identifying unmodeled nonlinearity in linear least squares

regression, terming it a diagnostic tree. They note that evidence for a signal left in the residuals (and hence a violation of assumptions) comes from a final tree that splits in the growing phase and is not ultimately pruned back to its root node.

The Fourth Erich L. Lehmann Symposium Regression tree-based diagnostics for linear multilevel models

slide-8
SLIDE 8

Longitudinal and clustered data and multilevel models Goodness-of-fit and regression trees Performance of tree-based lack-of-fit tests Application to real data Conclusion Testing for model violations RE-EM trees

Proposed method

Su, Tsai, and Wang (2009) altered this idea slightly by simultaneously including both linear and tree-based terms in one model, terming it an augmented tree, assessing whether the tree-based terms are deemed necessary in the joint model. They also note that building a diagnostic tree using squared residuals as a response can be used to test for heteroscedasticity. We propose adapting the diagnostic tree idea to longitudinal/clustered data.

The Fourth Erich L. Lehmann Symposium Regression tree-based diagnostics for linear multilevel models

slide-9
SLIDE 9

Longitudinal and clustered data and multilevel models Goodness-of-fit and regression trees Performance of tree-based lack-of-fit tests Application to real data Conclusion Testing for model violations RE-EM trees

Proposed method

◮ Fit the linear mixed effects model. ◮ Fit an appropriate regression tree to the residuals from this

model to explore nonlinearity.

◮ Fit an appropriate regression tree to the absolute residuals

from the model to explore heteroscedasticity (squared residuals are more non-Gaussian and lead to poorer performance). A final tree that splits from the root node rejects the null model.

The Fourth Erich L. Lehmann Symposium Regression tree-based diagnostics for linear multilevel models

slide-10
SLIDE 10

Longitudinal and clustered data and multilevel models Goodness-of-fit and regression trees Performance of tree-based lack-of-fit tests Application to real data Conclusion Testing for model violations RE-EM trees

Trees for longitudinal and clustered data

There has been a limited amount of work on adapting regression trees to longitudinal/clustered data. Segal (1992) and De’Ath (2002) proposed the use of multivariate regression trees in which the response variable was the vector yi = (yi1, ..., yiT). At each node, a vector of means, µ(g), is produced, where µt(g) is the estimated value for yit at node g. Galimberti and Montanari (2002) and Lee (2005, 2006) proposed similar types of tree

  • models. Unfortunately, these tree estimators have several

weaknesses, including the inability to be used for the prediction of future periods for the same individuals. Sela and Simonoff (2009) proposed a tree-based method that accounts for the longitudinal structure of the data while avoiding these difficulties.

The Fourth Erich L. Lehmann Symposium Regression tree-based diagnostics for linear multilevel models

slide-11
SLIDE 11

Longitudinal and clustered data and multilevel models Goodness-of-fit and regression trees Performance of tree-based lack-of-fit tests Application to real data Conclusion Testing for model violations RE-EM trees

“EM”-type algorithm

Consider again a general mixed effects model yit = Zitbi + f (xit1, ..., xitK) + εit. If the random effects, bi, were known, the model implies that we could fit a regression tree to yit − Zitbi to estimate f via a tree

  • structure. If the fixed effects, f , were known, then we could

estimate the random effects using a traditional random effects linear model with fixed effects corresponding to the fitted values, f (xi). This alternation between the estimation of different parameters is reminiscent of (although is not) the EM algorithm, as used by Laird and Ware (1982); for this reason, we call the resulting estimator a Random Effects/EM Tree, or RE-EM Tree.

The Fourth Erich L. Lehmann Symposium Regression tree-based diagnostics for linear multilevel models

slide-12
SLIDE 12

Longitudinal and clustered data and multilevel models Goodness-of-fit and regression trees Performance of tree-based lack-of-fit tests Application to real data Conclusion Testing for model violations RE-EM trees

Estimation of a RE-EM Tree

  • 1. Initialize the estimated random effects, ˆ

bi, to zero.

  • 2. Iterate through the following steps until the estimated random

effects, ˆ bi, converge:

2.1 Estimate a regression tree approximating f , based on the target variable, yit − Zitˆ bi, and predictors, xit· = (xit1, ..., xitK), for i = 1, ..., I and t = 1, ..., Ti. The tree is originally

  • vergrown, and then pruned back using the one-SE rule of

Breiman et al. (1984). Use this regression tree to create a set

  • f indicator variables, I(xit· ∈ gp), where gp ranges over all of

the terminal nodes in the tree. 2.2 Fit the linear random effects model, yit = Zitbi + I(xit· ∈ gp)µp + εit using ML or REML. Extract ˆ bi from the estimated model using the Empirical Bayes estimates.

The Fourth Erich L. Lehmann Symposium Regression tree-based diagnostics for linear multilevel models

slide-13
SLIDE 13

Longitudinal and clustered data and multilevel models Goodness-of-fit and regression trees Performance of tree-based lack-of-fit tests Application to real data Conclusion Testing for model violations RE-EM trees

Estimation of a RE-EM Tree

This algorithm has several advantages over other approaches.

◮ The fitting of the regression tree uses built-in methods for

missing data.

◮ Different numbers of time points for different individuals are

easily handled, as is prediction of response values for future time points.

◮ The fixed effects portion of the model can be based on

time-varying or nonvarying predictors.

◮ The fitting of the random effects portion of the model can be

based on either independence within individuals, or a specified autocorrelation structure, thus allowing for complex correlation structure within individuals.

◮ Multilevel hierarchies are easily handled.

The Fourth Erich L. Lehmann Symposium Regression tree-based diagnostics for linear multilevel models

slide-14
SLIDE 14

Longitudinal and clustered data and multilevel models Goodness-of-fit and regression trees Performance of tree-based lack-of-fit tests Application to real data Conclusion Null size Power of nonlinearity test Power of heteroscedasticity test

Structure of simulations

We use limited Monte Carlo simulations to investigate the properties of the method. We examine number of individuals I ranging from 50 to 200 with number of time points T ranging from 10 to 100 (implying number of observations I × T ranging from 500 to 20,000). Simulations show that properties are driven by the number of observations, not I or T separately. The null linear model is based on 5 normally-distributed predictors with mean 10 and standard deviation 1, β′ = (1, 2, −3, 4, −5), with the null model including 5 additional predictors with zero slopes; σ2

ε = σ2 b = 1.

The Fourth Erich L. Lehmann Symposium Regression tree-based diagnostics for linear multilevel models

slide-15
SLIDE 15

Longitudinal and clustered data and multilevel models Goodness-of-fit and regression trees Performance of tree-based lack-of-fit tests Application to real data Conclusion Null size Power of nonlinearity test Power of heteroscedasticity test

Size of tests

Even though the growing/pruning rules for the tree are not designed to directly control Type I error, it turns out that they do at a roughly .05 level, resulting in a generally conservative test.

Total number of observations Size 0.01 0.02 0.03 0.04 0.05 2000 4000 6000 8000 10000

The Fourth Erich L. Lehmann Symposium Regression tree-based diagnostics for linear multilevel models

slide-16
SLIDE 16

Longitudinal and clustered data and multilevel models Goodness-of-fit and regression trees Performance of tree-based lack-of-fit tests Application to real data Conclusion Null size Power of nonlinearity test Power of heteroscedasticity test

Different slopes

E(y) = E0(y) ±x10 αx6, α = .25(.25)1

2000 4000 6000 8000 10000 0.0 0.2 0.4 0.6 0.8 1.0 Total number of observations Power |

x6< 7.075 x6>=7.237 x6< 8.181 x10>=0 x10< 0 x6>=10.09 x6>=11.14 x6>=12.38 x6>=9.3 x6< 10.29 x6< 9.219 x6< 11.16 −1.56 −1.54 1.39 −2.49 −1.21 −0.432 0.231 0.892 −0.848 −0.178 0.466 1.31 1.64

The Fourth Erich L. Lehmann Symposium Regression tree-based diagnostics for linear multilevel models

slide-17
SLIDE 17

Longitudinal and clustered data and multilevel models Goodness-of-fit and regression trees Performance of tree-based lack-of-fit tests Application to real data Conclusion Null size Power of nonlinearity test Power of heteroscedasticity test

Quadratic term

E(y) = E0(y) ± αx2

6, α = .05(.05).2, E(x6) = 0

2000 4000 6000 8000 10000 0.0 0.2 0.4 0.6 0.8 1.0 Total number of observations Power |

x6< 1.423 x6>=−1.121 −0.107 0.329 0.518

The Fourth Erich L. Lehmann Symposium Regression tree-based diagnostics for linear multilevel models

slide-18
SLIDE 18

Longitudinal and clustered data and multilevel models Goodness-of-fit and regression trees Performance of tree-based lack-of-fit tests Application to real data Conclusion Null size Power of nonlinearity test Power of heteroscedasticity test

Product term

E(y) = E0(y) ± αx6x7, α = .25(.25)1, E(x6) = E(x7) = 0

2000 4000 6000 8000 10000 0.0 0.2 0.4 0.6 0.8 1.0 Total number of observations Power |

x7>=−2.585 x7< −2.29 x6>=0.1818 x7>=−2.153 x6< −2.126 x7>=0.2782 x7>=−0.7866 x6< 0.3178 x7>=0.4306 x6< −0.1611 x7>=0.952 x6< −1.637 x6>=−0.6019 x7>=−0.833 x7>=−0.1116 x6>=−1.236 x7< −0.1561 x6>=1.243 x7< −1.426 x7< −1.203 x7< 0.7681 x6< 1.114 x7< 1.242 x6>=−0.6033 x6>=−0.5968 −2.490.904 −2.03 0.196 2.16 −2.26 −0.905 −0.47 0.126 −0.0221 −0.140.4130.876 1.67 −2.58 −0.992 −0.878 −0.242 0.235 0.724 1.35 2.22 −0.3723.23 −0.2743.86

The Fourth Erich L. Lehmann Symposium Regression tree-based diagnostics for linear multilevel models

slide-19
SLIDE 19

Longitudinal and clustered data and multilevel models Goodness-of-fit and regression trees Performance of tree-based lack-of-fit tests Application to real data Conclusion Null size Power of nonlinearity test Power of heteroscedasticity test

Heteroscedasticity related to nonpredictor

σ2

y = |x6|α, α = .125(.125).5

2000 4000 6000 8000 10000 0.0 0.2 0.4 0.6 0.8 1.0 Total number of observations Power |

x6< 0.8202 x6>=−0.7112 0.51 0.855 0.896

The Fourth Erich L. Lehmann Symposium Regression tree-based diagnostics for linear multilevel models

slide-20
SLIDE 20

Longitudinal and clustered data and multilevel models Goodness-of-fit and regression trees Performance of tree-based lack-of-fit tests Application to real data Conclusion Null size Power of nonlinearity test Power of heteroscedasticity test

Heteroscedasticity related to subgroups

σ2

y = 1 ±x10 α, α = .0625(.0625).25

2000 4000 6000 8000 10000 0.0 0.2 0.4 0.6 0.8 1.0 Total number of observations Power |

x10< 0.5 0.655 0.948

The Fourth Erich L. Lehmann Symposium Regression tree-based diagnostics for linear multilevel models

slide-21
SLIDE 21

Longitudinal and clustered data and multilevel models Goodness-of-fit and regression trees Performance of tree-based lack-of-fit tests Application to real data Conclusion Null size Power of nonlinearity test Power of heteroscedasticity test

Heteroscedasticity related to predictor

σ2

y = |x5|α, α = .125(.125).5 E(x5) = 0

2000 4000 6000 8000 10000 0.0 0.2 0.4 0.6 0.8 1.0 Total number of observations Power |

x5< 1.179 x5>=−0.8098 x5< 0.4661 0.476 0.71 0.868 0.967

The Fourth Erich L. Lehmann Symposium Regression tree-based diagnostics for linear multilevel models

slide-22
SLIDE 22

Longitudinal and clustered data and multilevel models Goodness-of-fit and regression trees Performance of tree-based lack-of-fit tests Application to real data Conclusion Null size Power of nonlinearity test Power of heteroscedasticity test

Heteroscedasticity related to expected response

σ2

y = |E(y)|α, α = .0625(.0625).25

2000 4000 6000 8000 10000 0.0 0.2 0.4 0.6 0.8 1.0 Total number of observations Power |

x5< 9.863 x4>=9.274 x4>=9.86 x5< 10.89 x5< 11.51 x3< 11.06 0.897 1.64 1.37 2.26 2.33 3.53 4.09

The Fourth Erich L. Lehmann Symposium Regression tree-based diagnostics for linear multilevel models

slide-23
SLIDE 23

Longitudinal and clustered data and multilevel models Goodness-of-fit and regression trees Performance of tree-based lack-of-fit tests Application to real data Conclusion Spruce tree growth Software transactions pricing

Spruce tree growth

Diggle, Liang, and Zeger (1994) and Venables and Ripley (2002) discuss a longitudinal growth study. The response is the log-size of 79 Sitka spruce trees, two-thirds of which were grown in

  • zone-enriched chambers, measured at five time points.

First, a linear model based on treatment status and time is fit, but the tree-based nonlinearity test indicates lack of fit related to time.

The Fourth Erich L. Lehmann Symposium Regression tree-based diagnostics for linear multilevel models

slide-24
SLIDE 24

Longitudinal and clustered data and multilevel models Goodness-of-fit and regression trees Performance of tree-based lack-of-fit tests Application to real data Conclusion Spruce tree growth Software transactions pricing

Test of fit of linear model

|

Time>=242.5 Time< 163 −0.125 −0.108 0.0777

The Fourth Erich L. Lehmann Symposium Regression tree-based diagnostics for linear multilevel models

slide-25
SLIDE 25

Longitudinal and clustered data and multilevel models Goodness-of-fit and regression trees Performance of tree-based lack-of-fit tests Application to real data Conclusion Spruce tree growth Software transactions pricing

Test of fit of different slopes model

A natural alternative model is one allowing for different slopes for the treatment and control groups, but that does not correct the lack of fit.

|

Time>=242.5 Time< 163 −0.125 −0.108 0.0777

The Fourth Erich L. Lehmann Symposium Regression tree-based diagnostics for linear multilevel models

slide-26
SLIDE 26

Longitudinal and clustered data and multilevel models Goodness-of-fit and regression trees Performance of tree-based lack-of-fit tests Application to real data Conclusion Spruce tree growth Software transactions pricing

Treating time as categorical

As the diagnostic trees suggest, the problem is in the linear formulation of the effect of time. If time is treated as a categorical predictor, the apparent lack of fit disappears, as the diagnostic tree has no splits. An additional interaction of the treatment and (categorical) time effects is statistically significant, but has higher AIC and BIC values than the additive model, reinforcing that from a practical point of view the fit of the simpler model is adequate. Heteroscedasticity diagnostic trees for all models do not split.

The Fourth Erich L. Lehmann Symposium Regression tree-based diagnostics for linear multilevel models

slide-27
SLIDE 27

Longitudinal and clustered data and multilevel models Goodness-of-fit and regression trees Performance of tree-based lack-of-fit tests Application to real data Conclusion Spruce tree growth Software transactions pricing

Transaction data set

We also apply the diagnostic trees to a dataset on third-party sellers on Amazon Web Services aiming to predict the prices at which software titles are sold based on the characteristics of the competing sellers (Ghose, 2005; Sela and Simonoff, 2009). The data consist of 9484 transactions for 250 distinct software titles; thus, there are I = 250 individuals in the panel with a varying number of observations Ti per individual.

The Fourth Erich L. Lehmann Symposium Regression tree-based diagnostics for linear multilevel models

slide-28
SLIDE 28

Longitudinal and clustered data and multilevel models Goodness-of-fit and regression trees Performance of tree-based lack-of-fit tests Application to real data Conclusion Spruce tree growth Software transactions pricing

Transaction data set variables

◮ Target variable: the price premium that a seller can command

(the difference between the price at which the good is sold and the average price of all of the competing goods in the marketplace). We also examine the log of this variable.

◮ Predictor variables

◮ The seller’s own reputation (total number of comments, the

number of positive and negative comments received from buyers, the length of time that the seller has been in the marketplace)

◮ The characteristics of its competitors (the number of

competitors, the quality of competing products, and the average reputation of the competitors, and the average prices

  • f the competing products).

The Fourth Erich L. Lehmann Symposium Regression tree-based diagnostics for linear multilevel models

slide-29
SLIDE 29

Longitudinal and clustered data and multilevel models Goodness-of-fit and regression trees Performance of tree-based lack-of-fit tests Application to real data Conclusion Spruce tree growth Software transactions pricing

Test of fit of linear model

A linear model is clearly inadequate.

|

COUNTYR< 30.5 AvgCompPrice>=466 PostHours< 492.5 AvgCompPrice>=1013 SellerLife>=24.5 AvgCompPrice>=140 SellerLife< 50 PLIFE< 90 SellerLife< 3.5 COUNTYR>=19 Competitors>=8.5 AvgCompRating>=4.271 AvgCompPrice< 313.3 SellerLife>=267 AvgCompPrice>=140.7 SellerLife< 994.5 NLIFE>=2.5 AvgCompCondition>=4.569 Competitors< 11.5 SellerRating>=4.75 PTHRTY< 91 Competitors>=5.5 AvgCompLife< 1891 AvgCompLife>=578.7 AvgCompPrice< 1340 AvgCompCondition< 4.1 AvgCompLife>=3759 NNINTY>=5.5 Competitors< 7.5 AvgCompCondition< 4.944 AvgCompPrice< 443.9 Competitors< 27.5 −513 −373 −210 −20.2 −213 −162 −78.7 −77.8 −29.8 154 140 0.557 −94.8 −16.1 −0.798 1.2 −0.84 111 25 −312 −114 45 −41 −180 11 304 −143 −189 14.9 46.6 70.5 175 129

The Fourth Erich L. Lehmann Symposium Regression tree-based diagnostics for linear multilevel models

slide-30
SLIDE 30

Longitudinal and clustered data and multilevel models Goodness-of-fit and regression trees Performance of tree-based lack-of-fit tests Application to real data Conclusion Spruce tree growth Software transactions pricing

Test of fit of log-linear model

A log-linear model is also clearly inadequate.

|

COUNTYR< 35.5 PLIFE< 77.5 PLIFE< 3 SellerRating>=1.5 PostHours< 202 PostHours< 499.5 NLIFE>=3.5 AvgCompLife>=527.6 AvgCompLife>=1.231e+04 AvgCompPrice< 37.98 PostHours< 161 SellerRating>=4.05 Competitors>=8.5 COUNTYR>=9.5 SellerLife>=267 SellerLife< 2076 AvgCompLife>=516.9 Competitors>=11.5 NYEAR>=2.5 SellerLife< 854.5 SellerLife< 405.5 AvgCompPrice>=28.2 PostHours< 647 AvgCompLife>=8931 COUNTYR>=2134 AvgCompLife>=7.085e+04 AvgCompLife< 316.4 Competitors< 12.5 COUNTTH< 50.5 AvgCompRating< 4.65 AvgCompPrice< 426.9 AvgCompPrice>=637 PYEAR>=77.5 SellerRating< 4.65 NGNINTY< 6.5 NGNINTY< 4.5 AvgCompLife>=1220 AvgCompRating< 4.5 AvgCompPrice>=24.42 SellerLife< 62 −1.62 −0.804 −0.731 −0.345 −0.575 −0.159 −0.595 −0.267 −0.247 0.425 0.0414 −0.137 0.134 0.765 −0.808 −0.4190.000541 −0.199 −0.437 0.0402 −0.189 −0.0056 0.175 0.0179 −0.25 0.191 0.739 0.0201 0.0789 −0.0381 0.454 1.78 −0.604 0.126 0.335 −0.252 0.13 −0.669 0.265 0.644 0.681

The Fourth Erich L. Lehmann Symposium Regression tree-based diagnostics for linear multilevel models

slide-31
SLIDE 31

Longitudinal and clustered data and multilevel models Goodness-of-fit and regression trees Performance of tree-based lack-of-fit tests Application to real data Conclusion Spruce tree growth Software transactions pricing

Heteroscedasticity test of log-linear model

There is apparent heteroscedasticity in the log-linear model, although we recognize that the lack of fit can affect this test.

|

COUNTYR>=62.5 PYEAR< 80.5 Competitors< 16.5 PLIFE>=76.5 AvgCompLife>=186.8 AvgCompPrice< 559 SellerLife>=2076 PLIFE>=74 NLIFE< 3.5 PLIFE>=3 0.139 0.182 0.165 1.34 0.208 0.236 0.35 0.32 0.476 0.625 1.22

The Fourth Erich L. Lehmann Symposium Regression tree-based diagnostics for linear multilevel models

slide-32
SLIDE 32

Longitudinal and clustered data and multilevel models Goodness-of-fit and regression trees Performance of tree-based lack-of-fit tests Application to real data Conclusion Spruce tree growth Software transactions pricing

RE-EM tree for log price premium

|

COUNTYR< 62.5 PostHours< 293.5 SellerCond< 4.5 AvgCompPrice>=292.2 AvgCompLife< 2033 AvgCompRating>=4.3 AvgCompPrice>=14.75 AvgCompPrice>=933.9 AvgCompLife>=1.155e+04 AvgCompRating>=4.409 Competitors>=7.5 COUNTTH>=3.5 AvgCompLife>=258.7 SellerLife>=141.5 SellerLife< 2076 SellerLife>=267 AvgCompLife>=907.5 COUNTTH>=357 AvgCompCondition>=4.861 SellerLife< 533 AvgCompPrice>=37.89 NGNINTY< 6.5 AvgCompPrice>=27.61 SellerLife< 4654 AvgCompLife>=8019 AvgCompPrice>=32.79 AvgCompCondition>=4.764 Competitors< 6.5 Competitors< 13.5 NGNINTY< 3.5 NYEAR>=2.5 AvgCompRating< 4.7 AvgCompCondition< 4.1 AvgCompCondition>=4.596 Competitors< 9.5 AvgCompRating>=4.7 AvgCompLife>=3609 AvgCompPrice< 13.03 COUNTNY< 72 AvgCompPrice>=23.9 NGNINTY< 3.5 AvgCompCondition>=4.595 AvgCompPrice>=37.19 COUNTNY>=36.5 COUNTYR< 78.5 Competitors< 18.5 −1.3 −0.711−0.618−0.209 −1.26 −0.603−0.332 −0.398 0.0886 −0.356 −0.13 0.0228 0.725 −0.978−0.343 −0.447−0.507 −0.3490.0651 −0.296 0.142 −0.403 −0.239 0.14 0.168 0.0834 0.295 0.882 −0.622 −0.4 0.144 0.455 1.07 0.249 2.5 0.173 −0.287 −0.219 0.692 −0.62 −0.633 0.225 0.256 0.43 0.511 1.04 0.889

The Fourth Erich L. Lehmann Symposium Regression tree-based diagnostics for linear multilevel models

slide-33
SLIDE 33

Longitudinal and clustered data and multilevel models Goodness-of-fit and regression trees Performance of tree-based lack-of-fit tests Application to real data Conclusion Spruce tree growth Software transactions pricing

Heteroscedasticity test of log price premium RE-EM tree

Heteroscedasticity is apparently much reduced when a tree model is used.

|

PLIFE< 80.5 COUNTYR>=61 0.13 0.307 0.253

The Fourth Erich L. Lehmann Symposium Regression tree-based diagnostics for linear multilevel models

slide-34
SLIDE 34

Longitudinal and clustered data and multilevel models Goodness-of-fit and regression trees Performance of tree-based lack-of-fit tests Application to real data Conclusion

Conclusion

◮ Goodness-of-fit diagnostic trees can be constructed for

longitudinal and clustered data based on the RE-EM tree idea.

◮ Versions to assess potential nonlinearity (based on residuals)

and heteroscedasticity (based on absolute residuals) correspond to roughly .05 level tests, and demonstrate effective power for identifying different types of model violations.

◮ The diagnostic trees are not meant to replace examination of

residuals or more focused (and powerful) tests of specific model violations; rather, they are an omnibus tool to add to the data analyst’s toolkit to try to help identify unspecified mixed effects model violations.

The Fourth Erich L. Lehmann Symposium Regression tree-based diagnostics for linear multilevel models

slide-35
SLIDE 35

Longitudinal and clustered data and multilevel models Goodness-of-fit and regression trees Performance of tree-based lack-of-fit tests Application to real data Conclusion

Background information and R code

A paper describing the RE-EM tree method is available at http://archive.nyu.edu/handle/2451/28094. The R package REEMtree used to construct RE-EM trees is available from CRAN (for versions of R starting with 2.12.2).

The Fourth Erich L. Lehmann Symposium Regression tree-based diagnostics for linear multilevel models