Categorical data Modelling and Independence R.W. Oldford - - PowerPoint PPT Presentation

categorical data
SMART_READER_LITE
LIVE PREVIEW

Categorical data Modelling and Independence R.W. Oldford - - PowerPoint PPT Presentation

Categorical data Modelling and Independence R.W. Oldford Eikosograms - Dependence/independence As with continuous data, interest lies in assessing whether the values of Categorical variates might have been generated independently. Independence


slide-1
SLIDE 1

Categorical data

Modelling and Independence R.W. Oldford

slide-2
SLIDE 2

Eikosograms - Dependence/independence

As with continuous data, interest lies in assessing whether the values of Categorical variates might have been generated independently. Independence occurs between X and Y if, and only if, Pr(Y = y X = x) = Pr(Y = y) OR, equivalently, Pr(X = x Y = y) = Pr(X = x) for all possible values x and y. This condition shows up in an eikosogram when the eikosogram is flat, e.g.

Y y_1 y_2 y_3 x_1 x_2 x_3 x_4 X X x_1 x_2 x_3 x_4 y_1 y_2 y_3 Y

slide-3
SLIDE 3

Eikosograms - Dependence/independence

This is simplest to see when each variate is only binary. e.g.

Y y_1 y_2 x_1 x_2 X X x_1 x_2 y_1 y_2 Y

These two variates appear to be independent. Why?

slide-4
SLIDE 4

Eikosograms - Dependence/independence

We could use our lineup test to get a visual test of independence. We set this up mathematically as follows. Let nij denote the number of values when X = xj and Y = yi.

X = x1 X = x2 · · · X = xJ−1 X = xJ Row totals Y = yI nI1 nI2 · · · nI(J−1) nIJ nI+ Y = y(I−1) n(I−1)1 n(I−1)2 · · · n(I−1)(J−1) n(I−1)J n(I−1)+ . . . . . . . . . · · · . . . . . . . . . Y = y2 n21 n22 · · · n2(J−1) n2J n2+ Y = y1 n11 n12 · · · n1(J−1) n1J n1+ Column totals n+1 n+2 · · · n+(J−1) n+J n++

We could estimate the marginal probabilities for X as p+j = Pr(X = xj) = n+j/n++ and the conditional probabilities for Y X as pi

j =

Pr(Y = yi X = xj) = nij/n+j. Similarly we could estimate the marginal probabilities for Y and the conditional probabilities for X Y .

slide-5
SLIDE 5

Eikosograms - Dependence/independence

Under independence pi

j =

Pr(Y = yi X = xj ) = Pr(Y = yi ) and pij = Pr(Y = yi ) × Pr(X = xj ). We now use these estimates (under the hypothesis of independence) to generate values of the categorical variates when the hypothesis holds. For example, suppose we are looking only at the eikosogram for Y X - i.e. X on the horizontal, Y on the vertical. We could choose to condition on the known values of n+j and then for each j sample a vector of nij values which are drawn from a multinomial distribution Mult(n+j , pj ) where pj =

  • n1+

n++ , n2+ n++ , . . . , nI+ n++

  • = p

Notes This pj does not depend on j; it is our marginal estimate of (Pr(Y = y1), Pr(Y = y2), . . . , Pr(Y = yI).) Doing this for every j = 1, . . . , J will produce a new table, one that has the same column totals as the original, but whose column entries have been generated according to the hypothesis of independence. We note also that we could have first generated column totals using a Mult(n++, q) where q =

  • n+1

n++ , n+2 n++ , . . . , n+J n++

  • and use these

n+j values in place of n+j to generate the columns of the new table. Following this methodology, the column widths of the corresponding eikosogram would change with each table. The second approach is an unconditional approach, the first a conditional one (here being conditioned on holding the column totals fixed).

slide-6
SLIDE 6

Eikosograms - Dependence/independence

In R, the apply() allows us to get the sums we want. For example, consider the table x

x ## lower ## upper w x y z ## A 3 3 3 3 ## B 4 3 2 1

slide-7
SLIDE 7

Eikosograms - Dependence/independence

In R, the apply() allows us to get the sums we want. For example, consider the table x

x ## lower ## upper w x y z ## A 3 3 3 3 ## B 4 3 2 1

We now get the sums we want using apply(X, MARGIN, FUN, ...)

# row sums (sum across other dimensions for each value of the first) apply(x, 1, sum) ## A B ## 12 10

slide-8
SLIDE 8

Eikosograms - Dependence/independence

In R, the apply() allows us to get the sums we want. For example, consider the table x

x ## lower ## upper w x y z ## A 3 3 3 3 ## B 4 3 2 1

We now get the sums we want using apply(X, MARGIN, FUN, ...)

# row sums (sum across other dimensions for each value of the first) apply(x, 1, sum) ## A B ## 12 10 # col sums (sum across other dimensions for each value of dimension 2) apply(x, 2, sum) ## w x y z ## 7 6 5 4

This works for multiway tables (MARGIN can be a vector of dimensions).

slide-9
SLIDE 9

Eikosograms - Dependence/independence

An R function which would generate a new table following the hypothesis of independence according to the first (conditional test) way would be generateTable <- function(TwoWayTable, y, x, conditionalTest=TRUE){ varnames <- names(dimnames(TwoWayTable)) nvars <- length(varnames) respID <- (1:nvars)[varnames == y] condID <- (1:nvars)[varnames == x] new_table <- TwoWayTable n <- sum(TwoWayTable) # Get the marginal probabilities for the response variate p <- apply(TwoWayTable, respID, sum) / n # Get the marginal probabilities for the conditioning variate q <- apply(TwoWayTable, condID, sum) / n n_resp <- length(p) n_cond <- length(q) if (conditionalTest) { # Preserve the conditional totals m <- apply(TwoWayTable, condID, sum) } else { m <- rmultinom(1,n, q)} for (c in 1:n_cond){ # Generate new counts from a multinomial newRespVals <- rmultinom(1,m[c],p) if (respID < condID){ for (i in 1:length(newRespVals)){ new_table[i,c] <- newRespVals[i] } } else { for (i in 1:length(newRespVals)){ new_table[c,i] <- newRespVals[i] } } } new_table }

slide-10
SLIDE 10

Eikosograms - Dependence/independence

The graphic of choice to assess independence is an eikosogram. Because the eikos function is implemented using grid, we need to update the lineup function:

lineup <- function(data, showSuspect=NULL, generateSuspect=NULL, trueLoc=NULL, layout =c(5,4), # We add the "pkg" argument pkg=c("graphics","grid", "ggplot2")) { # # Get the number of suspects in total nSuspects <- layout[1] * layout[2] if (is.null(trueLoc)) {trueLoc <- sample(1:nSuspects, 1)} if (is.null(showSuspect)) {stop("need a plot function for the suspect")} if (is.null(generateSuspect)) {stop("need a function to generate suspect")} # Need to decide which subject to present presentSuspect <- function(suspectNo) { if(suspectNo != trueLoc) {data <- generateSuspect(data)} showSuspect(data, suspectNo) } # Up to here, there is no change beyond the additional "pkg" argument. # CONTINUED ON NEXT SLIDE

slide-11
SLIDE 11

Eikosograms - Dependence/independence

# CONTINUED FROM PREVIOUS SLIDE # Plotting depends on the plotting package pkg <- match.arg(pkg) switch(pkg, "graphics" = { savePar <- par(mfrow=layout, mar=c(2.5, 0.1, 3, 0.1), oma=rep(0,4)) sapply(1:nSuspects, FUN = presentSuspect) # The plotLayout here is of no value for graphics plotLayout <- layout par(savePar) }, "grid" = { grobs <- lapply(1:nSuspects, FUN = presentSuspect) plotLayout <- marrangeGrob(grobs=grobs, nrow=layout[1], ncol=layout[2]) }, "ggplot2" = { ##ggplot2 plots <- lapply(1:nSuspects, FUN = presentSuspect) plotLayout <- marrangeGrob(grobs=plots, nrow=layout[1], ncol=layout[2]) }, stop("Wrong 'pkg'") ) # CONTINUED ON NEXT SLIDE

slide-12
SLIDE 12

Eikosograms - Dependence/independence

# CONTINUED FROM PREVIOUS SLIDE # Obfuscate location to keep us honest possibleBaseVals <- 3:min(2*nSuspects, 50) # remove easy base values possibleBaseVals <- possibleBaseVals[possibleBaseVals != 10 & possibleBaseVals != 5] base <- sample(possibleBaseVals, 1)

  • ffset <- sample(5:min(5*nSuspects, 125),1)

# return obfuscated location # return obfuscated location and plot (if not "graphics") list(trueLoc = paste0("log(",base^(trueLoc + offset), ", base=",base,") - ", offset), plotLayout = plotLayout ) }

We’ll work with the SAheart data from the package ElemStatLearn and assess whether coronary heard disease is independent of family history.

library(ElemStatLearn) # First, get the chd events,

  • rdered so that chd =1 qppears as the bottom bar of an eikosogram

chd <- c("None", "CHD")[1+SAheart$chd] # Create the two way table heart <- table(SAheart$famhist, chd, dnn = c("famhist", "coronary"))

slide-13
SLIDE 13

Eikosograms - Dependence/independence

To test independence for two way tables, the data structure passed to line-up needs to be a little richer than that we had before when, say, comparing distributions.

# Here is the data structure that we will use for two way tables data <- list(table = heart, y = "coronary", x = "famhist")

We need to adapt the function for generating a new table to this data structure. We’ll introduce a function for each type of test (conditional or unconditional)

generateTableDataCond <- function(data){ newtable <- generateTable(data$table, data$y, data$x) list(table=newtable, y = data$y, x = data$x) } generateTableDataUncond <- function(data){ newtable <- generateTable(data$table, data$y, data$x, FALSE) list(table=newtable, y = data$y, x = data$x) }

slide-14
SLIDE 14

Eikosograms - Dependence/independence

We also need a function of this data structure that will show an eikosogram

showTable <- function(data, Suspect){ result <- eikos(y = data$y, x = data$x, data = data$table, legend = FALSE, xlabs=FALSE, ylabs=FALSE, xaxs=FALSE, yaxs=FALSE, main= paste("Suspect", Suspect), draw = FALSE ) invisible(result) }

Now the lineup test can be called on our data. Here’s the call for the unconditional test:

library(eikosograms) results <- lineup(data, generateSuspect = generateTableDataCond, showSuspect = showTable, layout = c(4,5), pkg="grid") results$plotLayout results$trueLoc

NOTE The results have to be saved now and the plotLayout and trueLoc evaluated to be seen.

slide-15
SLIDE 15

Eikosograms - Dependence/independence

The conditionally generated test:

Suspect 1 Suspect 2 Suspect 3 Suspect 4 Suspect 5 Suspect 6 Suspect 7 Suspect 8 Suspect 9 Suspect 10 Suspect 11 Suspect 12 Suspect 13 Suspect 14 Suspect 15 Suspect 16 Suspect 17 Suspect 18 Suspect 19 Suspect 20

page 1 of 1

True Location: log(2.41186503225706e+63, base=7) - 66

slide-16
SLIDE 16

Eikosograms - Dependence/independence

The conditionally generated test:

Suspect 1 Suspect 2 Suspect 3 Suspect 4 Suspect 5 Suspect 6 Suspect 7 Suspect 8 Suspect 9 Suspect 10 Suspect 11 Suspect 12 Suspect 13 Suspect 14 Suspect 15 Suspect 16 Suspect 17 Suspect 18 Suspect 19 Suspect 20

page 1 of 1

True Location: log(2.41186503225706e+63, base=7) - 66 = 9

slide-17
SLIDE 17

Eikosograms - Dependence/independence

The UNconditionally generated test:

Suspect 1 Suspect 2 Suspect 3 Suspect 4 Suspect 5 Suspect 6 Suspect 7 Suspect 8 Suspect 9 Suspect 10 Suspect 11 Suspect 12 Suspect 13 Suspect 14 Suspect 15 Suspect 16 Suspect 17 Suspect 18 Suspect 19 Suspect 20

page 1 of 1

True Location: log(9.28446791485507e+99, base=15) - 73

slide-18
SLIDE 18

Eikosograms - Dependence/independence

The UNconditionally generated test:

Suspect 1 Suspect 2 Suspect 3 Suspect 4 Suspect 5 Suspect 6 Suspect 7 Suspect 8 Suspect 9 Suspect 10 Suspect 11 Suspect 12 Suspect 13 Suspect 14 Suspect 15 Suspect 16 Suspect 17 Suspect 18 Suspect 19 Suspect 20

page 1 of 1

True Location: log(9.28446791485507e+99, base=15) - 73 = 12

slide-19
SLIDE 19

Eikosograms - Dependence/independence

Dependence between a pair of ordered categorical variates can be seen in an eikosogram. This can also be seen with binary variates. Suppose we have two binary categorical variates X and Y that each take values of either “yes” or “no. If the value is”yes“, then the event associated with that variate occurred, if”no" then it did not. Many important relationships between two such events (binary variates) have

  • bvious visual signatures when displayed as eikosograms. These include:

◮ independence (already seen) ◮ coincident events ◮ mutually exclusive events ◮ positive association (when one event occurs the other does with high

probability)

◮ negative association (when one event occurs the other does not with high

probability)

slide-20
SLIDE 20

Eikosograms - Coincident versus mutually exclusive events

Coincident

Y Yes No Yes No X

0.33

Mutually exclusive

Y Yes No Yes No X

0.33

slide-21
SLIDE 21

Eikosograms - Association

Negative Association

Y Yes No

0.2 0.8

Yes No X

0.33

Positive Association

Y Yes No

0.8 0.2

Yes No X

0.33

slide-22
SLIDE 22

Eikosograms - Dependence/independence

You can see the visual transition as dependence changes

Mutually exclusive

Y Yes No Yes No X

0.33

Negative Association

Y Yes No

0.2 0.8

Yes No X

0.33

Independent

Y Yes No

0.33

Yes No X

0.33

Positive Association

Y Yes No

0.8 0.2

Yes No X

0.33

Coincident

Y Yes No Yes No X

0.33

mutually → negative → independence → positive → coincidence exclusive association association

slide-23
SLIDE 23

The Titanic

Recall: The data set Titanic provides “information on the fate of passengers on the fatal maiden voyage

  • f the ocean liner Titanic, summarized according

to economic status (class), sex, age and survival.” The Titanic data records the number of passengers in various categories for four different categorical variates No. Variate Values 1 Class 1st, 2nd, 3rd, Crew 2 Sex Male, Female 3 Age Child, Adult 4 Survived No, Yes

slide-24
SLIDE 24

Eikosograms - Dependence/independence

Suppose we are interested in whether:

◮ Survival depends on sex of passenger

eikos(y ="Survived", x = "Sex", data = Titanic, legend = TRUE, main="Titanic")

◮ Survival depends on age of passenger

eikos(y ="Survived", x = "Age", data = Titanic, legend = TRUE, main="Titanic")

◮ Survival depends on class of passenger

eikos(y ="Survived", x = "Class", data = Titanic, legend = TRUE, main="Titanic")

slide-25
SLIDE 25

Eikosograms - Dependence/independence

Titanic

Survived No Yes

0.79 0.27

Male Female Sex

0.79

No Yes

slide-26
SLIDE 26

Eikosograms - Dependence/independence

Titanic

Survived No Yes

0.48 0.69

Child Adult Age

0.05

No Yes

slide-27
SLIDE 27

Eikosograms - Dependence/independence

Titanic

Survived No Yes

0.38 0.59 0.75 0.76

1st 2nd 3rd Crew Class

0.15 0.28 0.6

No Yes

slide-28
SLIDE 28

Eikosograms - conditional independence:

Conditional independence corresponds to flat regions. The joint distribution of X, Y , and Z: Y ⊥ ⊥X Z

slide-29
SLIDE 29

Eikosograms - conditional independence:

Example of a three-way joint distribution: Y ⊥ ⊥X Z; Y ⊥ ⊥ / Z X Z⊥ ⊥X Y ; Z⊥ ⊥ / Y X X⊥ ⊥Z Y ; X⊥ ⊥Y Z Y ⊥ ⊥X Y ⊥ ⊥ / Z Z⊥ ⊥X Z⊥ ⊥ / Y X⊥ ⊥Y X⊥ ⊥Z

slide-30
SLIDE 30

Eikosograms - conditional independence:

Example of a three-way joint distribution: Y ⊥ ⊥ / X Z; Y ⊥ ⊥ / Z X Z⊥ ⊥ / X Y ; Z⊥ ⊥ / Y X X⊥ ⊥ / Z Y ; X⊥ ⊥ / Y Z Y ⊥ ⊥X Y ⊥ ⊥Z Z⊥ ⊥X Z⊥ ⊥Y X⊥ ⊥Y X⊥ ⊥Z

slide-31
SLIDE 31

Eikosograms - conditional independence:

Example of a three-way joint distribution: Y ⊥ ⊥X Z; Y ⊥ ⊥ / Z X Z⊥ ⊥ / X Y ; Z⊥ ⊥ / Y X X⊥ ⊥ / Z Y ; X⊥ ⊥Y Z Y ⊥ ⊥ / X Y ⊥ ⊥ / Z Z⊥ ⊥ / X Z⊥ ⊥ / Y X⊥ ⊥ / Z X⊥ ⊥ / Z

slide-32
SLIDE 32

Eikosograms - display equivalence

Eikosograms date back to at least 1693 by Halley. Note that the name eikosogram is used here to denote a particular display based on a particular model.

◮ Only one variate appears on the vertical axis (the response); all other

variates appear on the horizontal axis (the conditioning variates).

◮ The model is a response model: modelling the conditional distribution of

Y X together with the marginal distribution(s) of the conditioning variate(s) X. Two other named displays are identical to the eikosogram only when there is a single conditioning variate (and single response variate).

◮ the mosaic plot, which is used for categorical data, but differs from an

eikosogram for more than one X variate

◮ the spine plot which takes a categorical Y and either a categorical or

continuous variate X so shares some properties with a histogram as well

slide-33
SLIDE 33

Eikosograms - display equivalence

First create a two-way table: TitanicSurvClass <- margin.table(Titanic, margin=c(1,4)) The mosaic plot: mosaicplot(TitanicSurvClass, main="Mosaic plot (two way table)") The spine plot spineplot(TitanicSurvClass, main="Spine plot (two way table)")

slide-34
SLIDE 34

Eikosograms - display equivalence (to a mosaic plot)

Mosaic plot (two way table)

Class Survived

1st 2nd 3rd Crew No Yes

slide-35
SLIDE 35

Eikosograms - display equivalence (to a spine plot)

Spine plot (two way table)

Class Survived 1st 2nd 3rd Crew No Yes 0.0 0.2 0.4 0.6 0.8 1.0

slide-36
SLIDE 36

Eikosograms - the equivalent eikosogram

Eikosogram (two way table)

Survived No Yes

0.38 0.59 0.75 0.76

1st 2nd 3rd Crew Class

0.15 0.28 0.6

slide-37
SLIDE 37

More than two variates - display differences

Eikosograms and mosaic plots differ when there are more than two variates: eikos(y = "Survived", x = c("Class", "Sex"), data=Titanic, xaxs=FALSE, yaxs=FALSE, main="Eikosogram (three way table)") # Mosaic plot, need to construct the marginal table first # Note that the table is ordered so that Survived is first, # then Class and Sex (to match the eikosogram order above) TitanicSurvClassSex <- margin.table(Titanic, c(4, 1, 2)) # The mosaic plot mosaicplot(TitanicSurvClassSex, main="Mosaic plot (three way table)")

slide-38
SLIDE 38

More than two variates - display differences (eikosogram)

Eikosogram (three way table)

Survived No Yes 1st 1st 2nd 2nd 3rd 3rd Crew Crew Male Female Class Sex

slide-39
SLIDE 39

More than two variates - display differences (mosaic plot)

Mosaic plot (three way table)

Sex Survived

Male Female No Yes 1st 2nd 3rd Crew 1st 2nd 3rd Crew

slide-40
SLIDE 40

More than two variates - display differences (eikosogram)

Eikosogram (three way table)

Sex Male Female 1st 1st 2nd 2nd 3rd 3rd Crew Crew No Yes Class Survived

slide-41
SLIDE 41

More than two variates - display differences (mosaic plot)

Mosaic plot (three way table)

Survived Sex

No Yes Male Female 1st 2nd 3rd Crew 1st 2nd 3rd Crew

slide-42
SLIDE 42

More than two variates - display differences (eikosogram)

Eikosogram (three way table)

Class 1st 2nd 3rd Crew Male Male Female Female No Yes Sex Survived

slide-43
SLIDE 43

More than two variates - display differences (mosaic plot)

Mosaic plot (three way table)

Survived Class

No Yes 1st 2nd 3rd Crew Male Female Male Female

slide-44
SLIDE 44

Different models produce different displays

Eikosogram: shows shows proportions Y (X&Z), X&Z, and Z as lengths. Makes it easier to see conditional independence. E.g. Y ⊥ ⊥X Z Mosaic plot: shows proportions X, Y X, Z (X&Y ) as lengths.

slide-45
SLIDE 45

Different models produce different displays

Differences become greater the greater the number of conditioning variates in the eikosogram: eikos(y = "Survived", x = c("Class", "Sex", "Age"), data=Titanic, xaxs=FALSE, yaxs=FALSE, main="Eikosogram (four way table)") mosaicplot(aperm(Titanic, perm=c(3,4,1,2)), main="Mosaic plot (four way table)")

slide-46
SLIDE 46

Different models produce different displays (eikosogram)

Eikosogram (four way table)

Survived No Yes 1st 1st 1st 1st 2nd 2nd 2nd 2nd 3rd 3rd 3rd 3rd Crew Crew Male Male Female Female Child Adult Class Sex Age

slide-47
SLIDE 47

Different models produce different displays (mosaic plot)

Mosaic plot (four way table)

Age Survived

Child Adult No Yes 1st 2nd3rd Crew Male Female 1st 2nd 3rd Crew