Sam Stewart 1 , MMath, AStat Mohammed Abdolell 2 , MSc, PStat Michael - - PowerPoint PPT Presentation

sam stewart 1 mmath astat mohammed abdolell 2 msc pstat
SMART_READER_LITE
LIVE PREVIEW

Sam Stewart 1 , MMath, AStat Mohammed Abdolell 2 , MSc, PStat Michael - - PowerPoint PPT Presentation

Customizing the rpart library for multivariate gaussian outcomes: the longRPart library Sam Stewart 1 , MMath, AStat Mohammed Abdolell 2 , MSc, PStat Michael Leblanc 3 , PhD 1 IDPhD Candidate (Health Informatics), Faculty of Comp Sci Statistical


slide-1
SLIDE 1

Customizing the rpart library for multivariate gaussian outcomes: the longRPart library

Sam Stewart1, MMath, AStat Mohammed Abdolell2, MSc, PStat Michael Leblanc3, PhD

1 IDPhD Candidate (Health Informatics), Faculty of Comp Sci

Statistical Consulting Service, Faculty of Math and Stats Dalhousie University

2 Dept of Diagnostic Radiology and Division of Medical

Education, Dalhousie University Dalla Lana School of Public Health, University of Toronto

3 Fred Hutchinson Cancer Research Center

Sam Stewart (Dal) longRPart Wednesday, June 3, 2009 1 / 30

slide-2
SLIDE 2

Introduction In a paper written in 2002, Abdolell et al. [1] outlined a method for creating classification trees for multivariate normal data. This project implemented and extended this algorithm in an R package. The outline of the presentation:

◮ Problem description ◮ Solution in R ◮ Example data Sam Stewart (Dal) longRPart Wednesday, June 3, 2009 2 / 30

slide-3
SLIDE 3

Problem

Problem Consider the situation where we have n patients with p

  • bservations per patient, along with a vector of patient

attributes. Yn×p, Xn×1 Assume that the outcomes are multivariate normal, that is f (yi; µi, Σ) = (2π)−p/2|Σ|−1/2exp

  • −1

2

  • (yi − µi)′Σ−1(yi − µi)
  • Abdolell et. al [1] proposed a method for building classification

trees for longitudinal data, based on likelihood ratio statistics

Sam Stewart (Dal) longRPart Wednesday, June 3, 2009 3 / 30

slide-4
SLIDE 4

Problem

Building Classification Trees Using maximum likelihood estimates we get µi = yi, and the maximum log-likelihood fuction l(yi; yi) = −1 2log [(2π)p|Σ|] (1) The deviance function for a single observation, therefore, is D(µi; yi) = l(yi; yi)−l(µi; yi) = . . . = (yi −µi)′Σ−1(yi −µi) (2) The deviance function is the Mahalanobis distance [2] for the

  • bservations. In practice this can be obtained from a mixed

model of the data (more on this later) The deviance for a set of observations will be the sum of deviances for those observations D(ˆ µ; y) =

n

  • i=1

D(ˆ µi; yi) (3)

Sam Stewart (Dal) longRPart Wednesday, June 3, 2009 4 / 30

slide-5
SLIDE 5

Problem

Partitioning The partitioning step is the same for multivariate outcomes is the same as it is for univariate outcomes. The data is partitioned by the values of the patient attribute (Xn×1) and the deviance is calculated for each child node. The

  • ptimal split is the one that creates the greatest reduction in

deviance between the parent node and the two child nodes. let s be a particular split in the set of all possible splits S, and let N be the dataset, split into NR, NL

Sam Stewart (Dal) longRPart Wednesday, June 3, 2009 5 / 30

slide-6
SLIDE 6

Problem

Partitioning We then define the splitting criterion as ∆D(s, N) = DN(ˆ µ; , y) − DNL,NR(ˆ µL, ˆ µR; y) (4) =

n

  • i=1

D(ˆ µ; yi) − nL

  • i=1

D(ˆ µ; yi) +

nR

  • i=1

D(ˆ µ; yi)

  • (5)

Sam Stewart (Dal) longRPart Wednesday, June 3, 2009 6 / 30

slide-7
SLIDE 7

Problem

Partitioning The best split, therefore, is the split that maximizes ∆D(s, N) Along with this partitioning function, Abdolell et al. provide the theory for a permutation test to obtain a p-value and a bootstrapping function to obtain a confidence interval on the values of the first split. Though not covered in the original paper, the extension to multiple splits and multiple patient attributes is implied As well v-fold cross-validation can be implemented

Sam Stewart (Dal) longRPart Wednesday, June 3, 2009 7 / 30

slide-8
SLIDE 8

Solution in R

Solution in R

Sam Stewart (Dal) longRPart Wednesday, June 3, 2009 8 / 30

slide-9
SLIDE 9

Solution in R

Providing the algorithm The original paper explained the implementation of the algorithm in SAS using the MIXED procedure. The algorithm has since been adapted to R, and is available from CRAN as the package longRPart The longRPart library has added to the partitioning function by providing plotting functions, multiple splits and predictors, and cross-validation.

Sam Stewart (Dal) longRPart Wednesday, June 3, 2009 9 / 30

slide-10
SLIDE 10

Solution in R

longRPart The library is heavily dependant on two other R libraries: rpart is the traditional method of creating classification and regression trees. nlm provides the methods for fitting the data and

  • btaining the Mahalanobis distance.

The library uses the algorithms in rpart to extend its partitioning to multivariate normal outcomes.

Sam Stewart (Dal) longRPart Wednesday, June 3, 2009 10 / 30

slide-11
SLIDE 11

Solution in R Customizing rpart

Customizing rpart Customizing rpart requires three functions: evaluation, split and init. These three functions are passed as a list to the rpart function through the method option. rpart(y∼x, data=dat, method=list(eval=evaluation, split=split, init=initialize))

Sam Stewart (Dal) longRPart Wednesday, June 3, 2009 11 / 30

slide-12
SLIDE 12

Solution in R Customizing rpart

evaluation The evaluation function evaluates the deviance of a node. The function is called once per node. The function returns a list of two variables: a label for the node and a measure of deviance at the node For the longRPart library, the label at the node is the estimate of ˆ µ for the observations at the node, and the deviance is calculated as the −2×log-likelihood of the model.

Sam Stewart (Dal) longRPart Wednesday, June 3, 2009 12 / 30

slide-13
SLIDE 13

Solution in R Customizing rpart

evaluation evaluation <- function(y, wt, parms){ model = lme(lmeFormula,data=parms[groupingFactor%in%y,],random=randomFormula,correlation=R,na.action=n if(continuous){ slope = model$coefficients$fixed[2] } else{ levels = length(levels(data[,names(data)==terms[1]])) y=model$coefficients$fixed[1:levels] x=1:levels slope = lm(y~x)$coefficients[2] } list(label=slope,deviance=-2*(model$logLik)) }

Sam Stewart (Dal) longRPart Wednesday, June 3, 2009 13 / 30

slide-14
SLIDE 14

Solution in R Customizing rpart

split The function is called once per node per covariate. The split function evaluates the potential splitting values to determine what the best split is. The idea is to check every candidate value for a covariate and provide an ordered list of the results. This function is very computationally intensive in longRPart: For every potential split value two mixed models need to be fit, one for the children on the left and one for the children on the right.

Sam Stewart (Dal) longRPart Wednesday, June 3, 2009 14 / 30

slide-15
SLIDE 15

Solution in R Customizing rpart

split The function returns two vectors: goodness is a measure of the split, with larger values being better, and a value of 0 indicating no split. This vector is ∆D(s, N). direction is the applied ordering for categorical data, or a vector of -1/1 indicating in which direction the split should be directed for continuous data.

Sam Stewart (Dal) longRPart Wednesday, June 3, 2009 15 / 30

slide-16
SLIDE 16

Solution in R Customizing rpart

split

split <- function(){ calculate root deviance if(categorical){ establish the ordering of the categorical variables } for(i in xUnique){ 1. partition data by observation i 2. calculate deviance for each dataset 3. Save the sum of the child deviances in a vector } return (goodness=rootDev - node deviance, direction=dir) }

Sam Stewart (Dal) longRPart Wednesday, June 3, 2009 16 / 30

slide-17
SLIDE 17

Solution in R Customizing rpart

init Is used to initialize the process Allows the passing of auxiliary variables Includes two functions: summary helps summarize the model in text, while text to add text to the plot of the tree.

Sam Stewart (Dal) longRPart Wednesday, June 3, 2009 17 / 30

slide-18
SLIDE 18

Solution in R Customizing rpart

init

initialize <- function(y,offset,parms=0,wt){ list( y=y, parms=parms, numresp=1, numy=1, summary=function(yval,dev,wt,ylevel,digits){ paste("deviance (-2logLik)", format(signif(dev),3), "slope", signif(yval,2)) }, text= function(yval,dev,wt,ylevel,digits,n,use.n){ if(!use.n){paste("m:",format(signif(yval,1)))} else{paste("n:",n)} } ) }

Sam Stewart (Dal) longRPart Wednesday, June 3, 2009 18 / 30

slide-19
SLIDE 19

Solution in R Customizing rpart

Other functions

Working with longRPart models

lrpPVal This function uses permutations of the covariates to determine the p-value of the initial split lrpCI This function takes bootstrap samples of the data to calculates a confidence interval for the initial split lrpCV This function performs v-fold cross validation of the model to assess its fit lrpPlot, lrpTreePlot These are the two plotting functions for the data

Sam Stewart (Dal) longRPart Wednesday, June 3, 2009 19 / 30

slide-20
SLIDE 20

Application

Application

Sam Stewart (Dal) longRPart Wednesday, June 3, 2009 20 / 30

slide-21
SLIDE 21

Application

Data Data is on cochlear implants Question is the effectiveness of the cochlear implants on improving speech and language skills Independant variable of interest is age: is the effect of the implant varried by age. Dependent variable of interest is speech perception scores taken repeatedly over time.

Sam Stewart (Dal) longRPart Wednesday, June 3, 2009 21 / 30

slide-22
SLIDE 22

Application

Analysis data(pbkphData) model = longRPart(pbkph∼Time, ∼age+gender, ∼1|Subject, pbkphData, R=corExp(form=∼time), control=rpart.control(minbucket=80,xval=10))

Sam Stewart (Dal) longRPart Wednesday, June 3, 2009 22 / 30

slide-23
SLIDE 23

Application

Model Summary

summary(model) Call: rpart(formula = paste(groupingName, c(rPartFormula)), data = data, method = list(eval = evaluation, split = split, init = initialize), parms = data, control = control) n= 552 CP nsplit rel error 1 0.04025716 0 1.0000000 2 0.02922562 1 0.9597428 3 0.02802864 2 0.9305172 4 0.01000000 3 0.9024886 Node number 1: 552 observations, complexity param=0.04025716 deviance (-2logLik) 2205.360 slope 5 left son=2 (416 obs) right son=3 (136 obs) Primary splits: age < 8.35 to the left, improve=1147.072, (0 missing) gender splits as LR, improve=1124.425, (0 missing)

Sam Stewart (Dal) longRPart Wednesday, June 3, 2009 23 / 30

slide-24
SLIDE 24

Application

lrpTreePlot

| age< 8.35 age< 4.15 age< 5.55 m: 9 m: 9 m: 8 m: 3 Sam Stewart (Dal) longRPart Wednesday, June 3, 2009 24 / 30

slide-25
SLIDE 25

Application

lrpPlot

1 2 3 4 5 6 7 8 20 30 40 50 60 70 80

  • node 3

node 5 node 6 node 7 Sam Stewart (Dal) longRPart Wednesday, June 3, 2009 25 / 30

slide-26
SLIDE 26

Application

P-Values and Confidence Intervals

  • 200

400 600 800 1000 50 55 60 65 70 75 80

P−Value Plot

permutation Deviance Improvement

90% Confidence Interval Plot

  • ptimal split

Frequency 4 6 8 10 12 14 100 200 300 400 500

Sam Stewart (Dal) longRPart Wednesday, June 3, 2009 26 / 30

slide-27
SLIDE 27

Application

CV

  • 10

20 30 40 50 60 70 10 20 30 40 50 60

Cross−Validated Deviances

Subjects Deviance

Sam Stewart (Dal) longRPart Wednesday, June 3, 2009 27 / 30

slide-28
SLIDE 28

Application

Conclusions Additional work is required to address some issues

◮ Cross-Validation ◮ S3 Compatibility ◮ Improving the speed of the Mahlanobis computation (any

ideas?) It is possible to use the existing algorithms in R to extend classification trees to non-traditional outcomes.

Sam Stewart (Dal) longRPart Wednesday, June 3, 2009 28 / 30

slide-29
SLIDE 29

Application

Thank you Merci Beaucoup!

Sam Stewart (Dal) longRPart Wednesday, June 3, 2009 29 / 30

slide-30
SLIDE 30

Application

  • M. Abdolell, M. LeBlanc, D. Stephens, and R. V. Harrison.

Binary partitioning for continuous longitudinal data: categorizing a prognostic variable. Statistics in Medicine, 21:3395–2409, 2002. BFJ Manley. Multivariate Statistical Methods: A Primer. Chapman and Hall: London, 1986.

Sam Stewart (Dal) longRPart Wednesday, June 3, 2009 30 / 30