1 Sequential data analysis Sequential data analysis Objects and - - PDF document

1
SMART_READER_LITE
LIVE PREVIEW

1 Sequential data analysis Sequential data analysis Objects and - - PDF document

Sequential data analysis Sequential data analysis Outline Sequential data analysis Introduction 1 An introduction to R Installing and launching R 2 Gilbert Ritschard Objects and operators 3 Department of Econometrics and Laboratory of


slide-1
SLIDE 1

Sequential data analysis

Sequential data analysis An introduction to R

Gilbert Ritschard

Department of Econometrics and Laboratory of Demography, University of Geneva http://mephisto.unige.ch/biomining

APA-ATI Workshop on Exploratory Data Mining University of Southern California, Los Angeles, CA, July 2009

23/7/2009gr 1/64 Sequential data analysis

Outline

1

Introduction

2

Installing and launching R

3

Objects and operators

4

Elements of statistical modeling

5

Growing trees: rpart and party

6

Custom functions and programming

23/7/2009gr 2/64 Sequential data analysis Introduction

R

R is: Software environment for statistical computing and graphics Based on the S language (as is S-PLUS) Freely distributed under GPL licence Available for any platform: Windows/Mac/Linux/Unix Easily extensible with numerous contributed modules

23/7/2009gr 4/64 Sequential data analysis Installing and launching R

Installation

R and the modules can be downloaded from the CRAN http://cran.r-project.org By default, no GUI is proposed under Linux. Under Windows and MacOSX, the basic GUI remains limited. ... but try Rcmdr (can be download from the CRAN)

23/7/2009gr 6/64 Sequential data analysis Installing and launching R

First steps in R

Four possibilities to send commands to R

1 Type commands in the R Console. 2 The script editor -> File/New script (only Windows/Mac) 3 The Rcmd module 4 Use a text editor with R support (Tinn-R, WinEdt, etc.)

In addition, you can also use your preferred text editor and copy-paste the commands into the R Console,

23/7/2009gr 7/64 Sequential data analysis Objects and operators Introduction to R objects

Objects

R works with objects Assigning a value to an object ‘a’

R> a <- 50

Operation on an object

R> a/50 [1] 1

Case-sensitive: a = A

R> A/50 Error: object "A" not found

23/7/2009gr 10/64

1

slide-2
SLIDE 2

Sequential data analysis Objects and operators Introduction to R objects

Types of objects

Different types of objects vector: 4 5 1

  • r in R c(4,5,1)

” D”” E”” A”

  • r in R c("D","E","A")

factor: categorical variable matrix: table of numerical data data frame: general data table (columns can be of different types) ...

23/7/2009gr 11/64 Sequential data analysis Objects and operators Introduction to R objects

Factors I

A factor is defined by“levels”(possible values) and an indicator of whether it is ordinal or not. Vector of“strings”

R> sex <- c("man", "woman", "woman", "man", "woman") R> sex [1] "man" "woman" "woman" "man" "woman"

Creation of a factor

R> sex.fac <- factor(sex) R> sex.fac [1] man woman woman man woman Levels: man woman R> attributes(sex.fac)

23/7/2009gr 12/64 Sequential data analysis Objects and operators Introduction to R objects

Factors II

$levels [1] "man" "woman" $class [1] "factor" R> table(sex.fac) sex.fac man woman 2 3

To change the order of the“levels”

R> sex.fac2 <- factor(sex, levels = c("woman", "man")) R> sex.fac2n <- as.numeric(sex.fac2) R> table(sex.fac2, sex.fac2n) sex.fac2n sex.fac2 1 2 woman 3 0 man 0 2

23/7/2009gr 13/64 Sequential data analysis Objects and operators Introduction to R objects

Objects (continued) I

Results can always be stored in a new object Example:

R> library(TraMineR) R> data(mvad) R> tab.male.gcse <- table(mvad$male, mvad$gcse5eq) R> tab.male.gcse no yes no 186 156 yes 266 104

23/7/2009gr 14/64 Sequential data analysis Objects and operators Introduction to R objects

Objects (continued)

Depending of its class, methods can be directly applied to it

R> plot(tab.male.gcse, cex.axis = 1.5)

tab.male.gcse

no yes no yes 23/7/2009gr 15/64 Sequential data analysis Objects and operators Introduction to R objects

Row and marginal distributions

Row and column distributions

R> prop.table(tab.male.gcse, 1) no yes no 0.5438596 0.4561404 yes 0.7189189 0.2810811 R> prop.table(tab.male.gcse, 2) no yes no 0.4115044 0.6000000 yes 0.5884956 0.4000000

Margins

R> margin.table(tab.male.gcse, 1) no yes 342 370 R> margin.table(tab.male.gcse, 2) no yes 452 260

23/7/2009gr 16/64

2

slide-3
SLIDE 3

Sequential data analysis Objects and operators Acting on subsets of objects

Indexes

Indexing vectors x[n] nth element x[-n] all but the nth element x[1:n] first n elements x[-(1:n)] elements from n+1 to the end x[c(1,4,2)] specific elements x["name"] element named "name" x[x > 3] all elements greater than 3 x[x > 3 & x < 5] all elements between 3 and 5 x[x %in% c("a","and","the")] elements in the given set Indexing matrices x[i,j] element at row i, column j x[i,] row i x[,j] column j x[,c(1,3)] columns 1 and 3 x["name",] row named "name" Indexing data frames (matrix indexing plus the following) x[["name"]] column named "name" x$name idem

23/7/2009gr 18/64 Sequential data analysis Objects and operators Acting on subsets of objects

Crosstable on data subsets

Cross tables for catholic and non catholic

R> table(mvad$male[mvad$catholic == "yes"], mvad$gcse5eq[mvad$catholic == + "yes"]) no yes no 82 77 yes 133 52 R> table(mvad$male[mvad$catholic == "no"], mvad$gcse5eq[mvad$catholic == + "no"]) no yes no 104 79 yes 133 52

23/7/2009gr 19/64 Sequential data analysis Objects and operators Acting on subsets of objects

3-dimensional crosstables

Alternatively

R> table(mvad$male, mvad$gcse5eq, mvad$catholic) , , = no no yes no 104 79 yes 133 52 , , = yes no yes no 82 77 yes 133 52

23/7/2009gr 20/64 Sequential data analysis Objects and operators Importation/exportation

Opening and closing R

R saves the working environment in the .RData file of the current directory.

getwd() provides the current directory setwd("C:/introR/") sets the current directory save.image() saves the working directory in .RData load("example.RData") loads working directory example.RData

On line help command: help(subject), or ?sujet

23/7/2009gr 22/64 Sequential data analysis Objects and operators Importation/exportation

Object Management

List of objects in the“Workingspace”

R> ls() [1] "a" "datadir" "filename" "graphdir" [5] "mvad" "pngdir" "sex" "sex.fac" [9] "sex.fac2" "sex.fac2n" "tab.male.gcse"

Removing objects

R> rm(sex, sex.fac2) R> ls() [1] "a" "datadir" "filename" "graphdir" [5] "mvad" "pngdir" "sex.fac" "sex.fac2n" [9] "tab.male.gcse"

23/7/2009gr 23/64 Sequential data analysis Objects and operators Importation/exportation

Importing text files

R can import text files (tab-delimited, CSV, ...) with read.table()

read.table(file, header = FALSE, sep = "", quote = "\"✬", dec = ".", row.names, col.names, as.is = FALSE, na.strings = "NA", colClasses = NA, nrows = -1, skip = 0, check.names = TRUE, fill = !blank.lines.skip, strip.white = FALSE, blank.lines.skip = TRUE, comment.char = "#") Ex: importing a tab-delimited file with variables names in first row: R> example <- read.table(file = "example.dat", header = TRUE, + sep = "\t") R> example age revenu sexe 1 25 100 homme 2 45 200 femme 3 30 50 homme

23/7/2009gr 24/64

3

slide-4
SLIDE 4

Sequential data analysis Objects and operators Importation/exportation

Importing data from other formats

R can import SPSS, Stata, SAS, minitab, ... files with the foreign library Loading the library

R> library(foreign)

Reading the SPSS file

R> mydata <- read.spss("example.sav", to.data.frame = TRUE)

Same principle for other formats See help on the foreign library

R> library(help = "foreign")

23/7/2009gr 25/64 Sequential data analysis Objects and operators Importation/exportation

Exportation

Exporting in text file

R> write.table(mydata, file = "export.txt", sep = "\t")

Labels are lost, factors are saved as strings

Alternatively with foreign

R> write.foreign(mydata, datafile = "export.txt", + codefile = "export.sps", package = "SPSS")

23/7/2009gr 26/64 Sequential data analysis Elements of statistical modeling

Statistical modeling: Regression

We use the mvad data of TraMineR Regression of longitudinal entropies on

male, catholic, ...

23/7/2009gr 28/64 Sequential data analysis Elements of statistical modeling

Loading the data

R> mvad.lab <- seqstatl(mvad[, 17:86]) R> mvad.shortlab <- c("EM", "FE", "HE", "JL", "SC", "TR") R> mvad.seq <- seqdef(mvad[, 17:86], labels = mvad.lab, + states = mvad.shortlab) R> summary(mvad.seq) [>] dimensionality of the sequence space: 350 [>] 712 sequences in the data set [>] 490 unique sequences in the data set [>] min/max sequence length: 70 / 70 [>] alphabet: 1=EM 2=FE 3=HE 4=JL 5=SC 6=TR [>] colors: 1=#7FC97F 2=#BEAED4 3=#FDC086 4=#FFFF99 5=#386CB0 6=#F0027F [>] labels: 1=employment 2=FE 3=HE 4=joblessness 5=school 6=training [>] code for missing statuses: * [>] code for void state: *

23/7/2009gr 29/64 Sequential data analysis Elements of statistical modeling

Computing longitudinal entropies

Computing entropies

R> entrop <- seqient(mvad.seq)

Boxplot of their distribution

R> boxplot(entrop, col = "lightblue", cex.axis = 1.5)

  • 0.0

0.2 0.4 0.6 0.8 23/7/2009gr 30/64 Sequential data analysis Elements of statistical modeling

Linear regression: lm() I

Creating the regression object

R> lm.entrop <- lm(entrop ~ male + catholic + gcse5eq, data = mvad)

23/7/2009gr 31/64

4

slide-5
SLIDE 5

Sequential data analysis Elements of statistical modeling

Linear regression: results

Displaying the results

R> summary(lm.entrop) Call: lm(formula = entrop ~ male + catholic + gcse5eq, data = mvad) Residuals: Min 1Q Median 3Q Max

  • 4.713e-01 -9.506e-02

2.750e-05 1.286e-01 4.479e-01 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.38916 0.01277 30.482 < 2e-16 *** maleyes

  • 0.04177

0.01329

  • 3.143

0.00174 ** catholicyes 0.01768 0.01307 1.353 0.17643 gcse5eqyes 0.06451 0.01378 4.680 3.43e-06 ***

  • Signif. codes:

0 ✬***✬ 0.001 ✬**✬ 0.01 ✬*✬ 0.05 ✬.✬ 0.1 ✬ ✬ 1 Residual standard error: 0.1741 on 708 degrees of freedom Multiple R-squared: 0.05376, Adjusted R-squared: 0.04975 F-statistic: 13.41 on 3 and 708 DF, p-value: 1.615e-08

23/7/2009gr 32/64 Sequential data analysis Elements of statistical modeling

Plotting a regression object

R> plot(lm.entrop, which = 2)

  • −3
−2 −1 1 2 3 −2 −1 1 2 3 Theoretical Quantiles Standardized residuals lm(entrop ~ male + catholic + gcse5eq) Normal Q−Q 193 421 310

23/7/2009gr 33/64 Sequential data analysis Elements of statistical modeling

Logistic regression I

Logistic regression: specific case of the generalized linear model glm() with family = binomial

R> lg.gr <- glm(gcse5eq ~ male + catholic, family = binomial, + data = mvad) R> summary(lg.gr) Call: glm(formula = gcse5eq ~ male + catholic, family = binomial, data = mvad) Deviance Residuals: Min 1Q Median 3Q Max

  • 1.1286
  • 1.0821
  • 0.7929

1.2758 1.6189 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept)

  • 0.2283

0.1315

  • 1.736

0.0825 . maleyes

  • 0.7677

0.1588

  • 4.833 1.34e-06 ***

catholicyes 0.1124 0.1586 0.709 0.4783

  • Signif. codes:

0 ✬***✬ 0.001 ✬**✬ 0.01 ✬*✬ 0.05 ✬.✬ 0.1 ✬ ✬ 1

23/7/2009gr 34/64 Sequential data analysis Elements of statistical modeling

Logistic regression II

(Dispersion parameter for binomial family taken to be 1) Null deviance: 934.62

  • n 711

degrees of freedom Residual deviance: 910.51

  • n 709

degrees of freedom AIC: 916.51 Number of Fisher Scoring iterations: 4

23/7/2009gr 35/64 Sequential data analysis Elements of statistical modeling

computing the“odds ratios”

Retrieve coefficients and compute their exp()

R> exp(lg.gr$coefficients) (Intercept) maleyes catholicyes 0.795889 0.464072 1.118991

Completing the table of coefficients, standard errors and significativity with exp(β)

R> lg.gr.coeff <- as.data.frame(summary(lg.gr)$coefficients) R> lg.gr.coeff <- cbind(lg.gr.coeff, ❵Exp Estim.❵ = exp(lg.gr.coeff[, + "Estimate"])) R> lg.gr.coeff Estimate Std. Error z value Pr(>|z|) Exp Estim. (Intercept) -0.2282955 0.1314786 -1.7363697 8.249849e-02 0.795889 maleyes

  • 0.7677155

0.1588396 -4.8332757 1.343046e-06 0.464072 catholicyes 0.1124274 0.1585663 0.7090246 4.783092e-01 1.118991

23/7/2009gr 36/64 Sequential data analysis Elements of statistical modeling

ANOVA

ANOVA

R> summary(aov(entrop ~ male, data = mvad)) Df Sum Sq Mean Sq F value Pr(>F) male 1 0.4888 0.4888 15.645 8.406e-05 *** Residuals 710 22.1807 0.0312

  • Signif. codes:

0 ✬***✬ 0.001 ✬**✬ 0.01 ✬*✬ 0.05 ✬.✬ 0.1 ✬ ✬ 1 R> summary(aov(lm.entrop)) Df Sum Sq Mean Sq F value Pr(>F) male 1 0.4888 0.4888 16.1322 6.539e-05 *** catholic 1 0.0662 0.0662 2.1847 0.1398 gcse5eq 1 0.6637 0.6637 21.9046 3.434e-06 *** Residuals 708 21.4508 0.0303

  • Signif. codes:

0 ✬***✬ 0.001 ✬**✬ 0.01 ✬*✬ 0.05 ✬.✬ 0.1 ✬ ✬ 1

23/7/2009gr 37/64

5

slide-6
SLIDE 6

Sequential data analysis Growing trees: rpart and party

rpart and party

At least two R-packages for growing (binary) trees:

rpart (Therneau and Atkinson, 1997): recursive partitioning CART, Relative risk trees, party (Hothorn et al., 2006): conditional partitioning Based on a statistical conditional inference method (permutation tests)

We propose here a short introduction to these packages

rpart Essentially Cart + extension for relative risk trees party much more powerful and flexible. better visual rendering (Plots distributions inside the nodes)

23/7/2009gr 39/64 Sequential data analysis Growing trees: rpart and party

rpart: methods

rpart proposes several methods:

"poisson", rate tree, for binary responses. "class", classification tree, when response is a factor. "anova", regression tree, when response is quantitative. "exp", risk tree, for a survival response.

If method is not specified, rpart tries a guess from the response type.

23/7/2009gr 40/64 Sequential data analysis Growing trees: rpart and party rpart

Growing a rate tree rpart

R> library(rpart) R> cart.mvad.gcse <- rpart(gcse5eq ~ male + catholic, data = mvad, + method = "poisson", control = list(minsplit = 20, + minbucket = 10, cp = 1e-04)) R> cart.mvad.gcse n= 712 node), split, n, deviance, yval * denotes terminal node 1) root 712 115.74880 1.365169 2) male=yes 370 53.52554 1.281247 * 3) male=no 342 58.23767 1.455946 6) catholic=no 183 30.99274 1.431429 * 7) catholic=yes 159 27.08354 1.483731 *

23/7/2009gr 42/64 Sequential data analysis Growing trees: rpart and party rpart

Plotting the tree

R> par(xpd = NA) R> plot(cart.mvad.gcse) R> text(cart.mvad.gcse, use.n = T, cex = 0.9, fancy = F)

| male=b catholic=a 1.281 474/370 1.431 262/183 1.484 236/159

23/7/2009gr 43/64 Sequential data analysis Growing trees: rpart and party rpart

The printcp() function

R> printcp(cart.mvad.gcse) Rates regression tree: rpart(formula = gcse5eq ~ male + catholic, data = mvad, method = "poisson", control = list(minsplit = 20, minbucket = 10, cp = 1e-04)) Variables actually used in tree construction: [1] catholic male Root node error: 115.75/712 = 0.16257 n= 712 CP nsplit rel error xerror xstd 1 0.0344335 1.00000 1.00210 0.016719 2 0.0013943 1 0.96557 0.97028 0.021159 3 0.0001000 2 0.96417 0.97543 0.021656

23/7/2009gr 44/64 Sequential data analysis Growing trees: rpart and party rpart

A classification tree

We build a tree for the activity status in October 1996, i.e. 3 years after end of compulsory school.

R> cart.mvad.oct96 <- rpart(Oct.96 ~ male + catholic + gcse5eq, + data = mvad, method = "class", control = list(minsplit = 20, + minbucket = 10, cp = 0)) R> cart.mvad.oct96 n= 712 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root 712 333 employment (0 0.11 0.53 0.072 0.079 0.21) 2) gcse5eq=no 452 158 employment (0 0.1 0.65 0.082 0.11 0.058) * 3) gcse5eq=yes 260 139 HE (0 0.12 0.33 0.054 0.031 0.47) *

23/7/2009gr 45/64

6

slide-7
SLIDE 7

Sequential data analysis Growing trees: rpart and party rpart

survival risk tree

We build a survival data object for time to marriage from the biofam data set provided with TraMineR States of interest are:

2: married without leaving home 3: married and left home 6: married with child 7: divorced

If divorce occurs before any marriage, we assume marriage and divorce the same year.

23/7/2009gr 46/64 Sequential data analysis Growing trees: rpart and party rpart

Creating the time to event variable

R> data(biofam) R> svar <- 10:25 R> durmax <- length(svar) R> biofam.seq <- seqdef(biofam, svar) R> fmar <- data.frame(s2 = seqfpos(biofam.seq, state = 2), + s3 = seqfpos(biofam.seq, state = 3), s6 = seqfpos(biofam.seq, + state = 6), s7 = seqfpos(biofam.seq, state = 7)) R> fmar <- data.frame(fmar, fpos = apply(fmar, 1, min, na.rm = TRUE)) R> fmar <- data.frame(fmar, mar = (fmar$fpos != Inf)) R> fmar$fpos[fmar$fpos == "Inf"] <- durmax R> head(fmar) s2 s3 s6 s7 fpos mar [1] NA 10 11 NA 10 TRUE [2] NA 12 13 NA 12 TRUE [3] NA 13 14 NA 13 TRUE [4] NA NA NA NA 16 FALSE [5] NA NA 14 NA 14 TRUE [6] NA NA NA NA 16 FALSE

23/7/2009gr 47/64 Sequential data analysis Growing trees: rpart and party rpart

Survival analysis

R> library(survival) R> surv.fmar <- Surv(time = fmar$fpos, event = fmar$mar) R> sf.fmar.fh <- survfit(surv.fmar ~ biofam$sex, type = "kaplan-meier") R> plot(sf.fmar.fh, main = "Kaplan-Meier Survival Curves, Time to Marriage", + xlab = "Time from 15 to marriage", col = c("red", + "blue")) R> legend("topright", legend = c("men", "women"), lwd = 2, + col = c("red", "blue"))

23/7/2009gr 48/64 Sequential data analysis Growing trees: rpart and party rpart

KM Survival curves

5 10 15 0.0 0.2 0.4 0.6 0.8 1.0

Kaplan−Meier Survival Curves, Time to Marriage

Time from 15 to marriage men women

23/7/2009gr 49/64 Sequential data analysis Growing trees: rpart and party rpart

Covariates

Preparing covariates

R> coho1 <- factor(biofam$birthyr < 1940) R> coho2 <- factor(biofam$birthyr >= 1940 & biofam$birthyr < + 1950) R> coho3 <- factor(biofam$birthyr >= 1950) R> lang <- biofam$plingu02 R> sex <- biofam$sex R> covariates <- data.frame(sex, lang, coho1, coho2, coho3) R> head(covariates) sex lang coho1 coho2 coho3 1 man german FALSE TRUE FALSE 2 man german TRUE FALSE FALSE 3 woman french FALSE TRUE FALSE 4 man german TRUE FALSE FALSE 5 man german FALSE TRUE FALSE 6 man italian TRUE FALSE FALSE

23/7/2009gr 50/64 Sequential data analysis Growing trees: rpart and party rpart

Survival risk tree (rpart)

Grow and plot the tree

R> stree.risk <- rpart(surv.fmar ~ sex + coho1 + coho2 + + coho3 + lang, data = covariates, method = "exp", + control = list(minsplit = 20, minbucket = 10, cp = 0.001)) R> stree.risk n= 2000 node), split, n, deviance, yval * denotes terminal node 1) root 2000 2681.7330 1.0000000 2) sex=man 908 978.1932 0.8333197 4) coho3=TRUE 286 315.0478 0.7084977 * 5) coho3=FALSE 622 655.1129 0.8975413 * 3) sex=woman 1092 1656.4480 1.1828020 6) coho2=FALSE 750 1128.3620 1.1009180 12) lang=german,italian 580 861.0025 1.0439180 * 13) lang=french 170 261.3472 1.3270480 * 7) coho2=TRUE 342 517.5828 1.3944170 * R> par(xpd = NA) R> plot(stree.risk) R> text(stree.risk, use.n = T, cex = 0.9, fancy = F)

23/7/2009gr 51/64

7

slide-8
SLIDE 8

Sequential data analysis Growing trees: rpart and party rpart

Plot of survival risk tree (rpart)

| sex=a coho3=b coho2=a lang=bc 0.7085 193/286 0.8975 479/622 1.044 444/580 1.327 141/170 1.394 285/342

23/7/2009gr 52/64 Sequential data analysis Growing trees: rpart and party party

party principle

party selects each split in two steps (to avoid bias in favor of predictors with many different values):

First, selects the predictor with strongest association with target, Then, selects the best binary split for selected predictor.

23/7/2009gr 54/64 Sequential data analysis Growing trees: rpart and party party

Linear statistic and permutation test

Both steps are based on the conditional distribution of linear statistics in a permutation test framework.

Linear statistic is: Tj = vec

  • n
  • i=1

wigj(Xji)h

  • Yi, (Y1, . . . , Yn)

T ∈ Rpjq where gj(Xji) is a transformation of Xji, and h() an influence function. Tj is computed for each permutation of the Y values among cases, and results characterize its conditional independence distribution. the variable and split selection is then based on the p-value of the observed t under this conditional independence distribution.

23/7/2009gr 55/64 Sequential data analysis Growing trees: rpart and party party

A R script for generating a tree

You grow the tree with the ctree command

R> library(party) R> ctree.mvad.gcse <- ctree(gcse5eq ~ male + catholic, data = mvad, + controls = ctree_control(mincriterion = 0.3, minsplit = 0), + ) R> plot(ctree.mvad.gcse, drop_terminal = F, inner_panel = node_barplot)

23/7/2009gr 56/64 Sequential data analysis Growing trees: rpart and party party

Classification tree, party

Node 1 (n = 712) yes no 0.2 0.4 0.6 0.8 1 yes no Node 2 (n = 370) yes no 0.2 0.4 0.6 0.8 1 Node 3 (n = 342) yes no 0.2 0.4 0.6 0.8 1 no yes Node 4 (n = 183) yes no 0.2 0.4 0.6 0.8 1 Node 5 (n = 159) yes no 0.2 0.4 0.6 0.8 1

23/7/2009gr 57/64 Sequential data analysis Growing trees: rpart and party party

Classification tree, text output, party

R> ctree.mvad.gcse Conditional inference tree with 3 terminal nodes Response: gcse5eq Inputs: male, catholic Number of observations: 712 1) male == {yes}; criterion = 1, statistic = 23.462 2)* weights = 370 1) male == {no} 3) catholic == {no}; criterion = 0.448, statistic = 0.945 4)* weights = 183 3) catholic == {yes} 5)* weights = 159

23/7/2009gr 58/64

8

slide-9
SLIDE 9

Sequential data analysis Growing trees: rpart and party party

survival tree, party

Grow the tree with ctree and a survival response object. Just plot the result to get the tree with survival curves.

R> marrtree <- ctree(surv.fmar ~ sex + coho1 + coho2 + coho3 + + lang, data = covariates, controls = ctree_control(mincriterion = 0.5, + minsplit = 0), ) R> plot(marrtree)

23/7/2009gr 59/64 Sequential data analysis Growing trees: rpart and party party

Survival tree, text output, party

R> marrtree Conditional inference tree with 5 terminal nodes Response: surv.fmar Inputs: sex, coho1, coho2, coho3, lang Number of observations: 2000 1) sex == {woman}; criterion = 1, statistic = 54.008 2) coho2 == {TRUE}; criterion = 0.989, statistic = 9.395 3)* weights = 342 2) coho2 == {FALSE} 4) lang == {french}; criterion = 0.862, statistic = 7.067 5)* weights = 170 4) lang == {german, italian} 6)* weights = 580 1) sex == {man} 7) coho3 == {FALSE}; criterion = 0.996, statistic = 11.284 8)* weights = 622 7) coho3 == {TRUE} 9)* weights = 286

23/7/2009gr 60/64 Sequential data analysis Growing trees: rpart and party party

Plotted survival tree, party

sex p < 0.001 1 woman man coho2 p = 0.011 2 TRUE FALSE Node 3 (n = 342) 5 10 15 0.2 0.4 0.6 0.8 1 lang p = 0.138 4 french{german, italian} Node 5 (n = 170) 5 10 15 0.2 0.4 0.6 0.8 1 Node 6 (n = 580) 5 10 15 0.2 0.4 0.6 0.8 1 coho3 p = 0.004 7 FALSE TRUE Node 8 (n = 622) 5 10 15 0.2 0.4 0.6 0.8 1 Node 9 (n = 286) 5 10 15 0.2 0.4 0.6 0.8 1

23/7/2009gr 61/64 Sequential data analysis Custom functions and programming

Functions

R> discretize <- function(a) { + if (a < 0.4) { + return(1) + } + else { + if (a < 0.6) { + return(2) + } + else { + return(3) + } + } + } R> discretize(0.33) [1] 1 R> table(apply(entrop, 1, discretize)) 1 2 3 385 243 84

23/7/2009gr 63/64 Sequential data analysis Custom functions and programming

References I

Gabadinho, A., G. Ritschard, M. Studer, and N. S. M¨ uller (2008). Mining sequence data in R with TraMineR: A user’s guide. Technical report, Department of Econometrics and Laboratory of Demography, University of Geneva, Geneva. (TraMineR is on CRAN the Comprehensive R Archive Network). Hothorn, T., K. Hornik, and A. Zeileis (2006). party: A laboratory for recursive part(y)itioning. User’s manual. Maindonald, J. and J. Brown (2006). Data Analysis and Graphics Using R: An Example-based Approach. Cambridge Series in Statistical and Probabilistic

  • Mathematics. Cambridge: Cambridge University Press.

Paradis, E. (2006). R for beginners. Manual, Institut des Sciences de l’ Evolution, Universit´ e Montpellier II. R-Development-Core-Team (2008). An introduction to R (v 2.8.0). Manual, R-project. Spector, P. (2008). Data Manipulation with R. New York: Springer. Therneau, T. M. and E. J. Atkinson (1997). An introduction to recursive partitioning using the rpart routines. Technical Report Series 61, Mayo Clinic, Section of Statistics, Rochester, Minnesota.

23/7/2009gr 64/64

9