Categorical data Modelling and Independence R.W. Oldford - - PowerPoint PPT Presentation
Categorical data Modelling and Independence R.W. Oldford - - PowerPoint PPT Presentation
Categorical data Modelling and Independence R.W. Oldford Eikosograms - Dependence/independence As with continuous data, interest lies in assessing whether the values of Categorical variates might have been generated independently. Independence
Eikosograms - Dependence/independence
As with continuous data, interest lies in assessing whether the values of Categorical variates might have been generated independently. Independence occurs between X and Y if, and only if, Pr(Y = y X = x) = Pr(Y = y) OR, equivalently, Pr(X = x Y = y) = Pr(X = x) for all possible values x and y. This condition shows up in an eikosogram when the eikosogram is flat, e.g.
Y y_1 y_2 y_3 x_1 x_2 x_3 x_4 X X x_1 x_2 x_3 x_4 y_1 y_2 y_3 Y
Eikosograms - Dependence/independence
This is simplest to see when each variate is only binary. e.g.
Y y_1 y_2 x_1 x_2 X X x_1 x_2 y_1 y_2 Y
These two variates appear to be independent. Why?
Eikosograms - Dependence/independence
We could use our lineup test to get a visual test of independence. We set this up mathematically as follows. Let nij denote the number of values when X = xj and Y = yi.
X = x1 X = x2 · · · X = xJ−1 X = xJ Row totals Y = yI nI1 nI2 · · · nI(J−1) nIJ nI+ Y = y(I−1) n(I−1)1 n(I−1)2 · · · n(I−1)(J−1) n(I−1)J n(I−1)+ . . . . . . . . . · · · . . . . . . . . . Y = y2 n21 n22 · · · n2(J−1) n2J n2+ Y = y1 n11 n12 · · · n1(J−1) n1J n1+ Column totals n+1 n+2 · · · n+(J−1) n+J n++
We could estimate the marginal probabilities for X as p+j = Pr(X = xj) = n+j/n++ and the conditional probabilities for Y X as pi
j =
Pr(Y = yi X = xj) = nij/n+j. Similarly we could estimate the marginal probabilities for Y and the conditional probabilities for X Y .
Eikosograms - Dependence/independence
Under independence pi
j =
Pr(Y = yi X = xj ) = Pr(Y = yi ) and pij = Pr(Y = yi ) × Pr(X = xj ). We now use these estimates (under the hypothesis of independence) to generate values of the categorical variates when the hypothesis holds. For example, suppose we are looking only at the eikosogram for Y X - i.e. X on the horizontal, Y on the vertical. We could choose to condition on the known values of n+j and then for each j sample a vector of nij values which are drawn from a multinomial distribution Mult(n+j , pj ) where pj =
- n1+
n++ , n2+ n++ , . . . , nI+ n++
- = p
Notes This pj does not depend on j; it is our marginal estimate of (Pr(Y = y1), Pr(Y = y2), . . . , Pr(Y = yI).) Doing this for every j = 1, . . . , J will produce a new table, one that has the same column totals as the original, but whose column entries have been generated according to the hypothesis of independence. We note also that we could have first generated column totals using a Mult(n++, q) where q =
- n+1
n++ , n+2 n++ , . . . , n+J n++
- and use these
n+j values in place of n+j to generate the columns of the new table. Following this methodology, the column widths of the corresponding eikosogram would change with each table. The second approach is an unconditional approach, the first a conditional one (here being conditioned on holding the column totals fixed).
Eikosograms - Dependence/independence
In R, the apply() allows us to get the sums we want. For example, consider the table x
x ## lower ## upper w x y z ## A 3 3 3 3 ## B 4 3 2 1
Eikosograms - Dependence/independence
In R, the apply() allows us to get the sums we want. For example, consider the table x
x ## lower ## upper w x y z ## A 3 3 3 3 ## B 4 3 2 1
We now get the sums we want using apply(X, MARGIN, FUN, ...)
# row sums (sum across other dimensions for each value of the first) apply(x, 1, sum) ## A B ## 12 10
Eikosograms - Dependence/independence
In R, the apply() allows us to get the sums we want. For example, consider the table x
x ## lower ## upper w x y z ## A 3 3 3 3 ## B 4 3 2 1
We now get the sums we want using apply(X, MARGIN, FUN, ...)
# row sums (sum across other dimensions for each value of the first) apply(x, 1, sum) ## A B ## 12 10 # col sums (sum across other dimensions for each value of dimension 2) apply(x, 2, sum) ## w x y z ## 7 6 5 4
This works for multiway tables (MARGIN can be a vector of dimensions).
Eikosograms - Dependence/independence
An R function which would generate a new table following the hypothesis of independence according to the first (conditional test) way would be generateTable <- function(TwoWayTable, y, x, conditionalTest=TRUE){ varnames <- names(dimnames(TwoWayTable)) nvars <- length(varnames) respID <- (1:nvars)[varnames == y] condID <- (1:nvars)[varnames == x] new_table <- TwoWayTable n <- sum(TwoWayTable) # Get the marginal probabilities for the response variate p <- apply(TwoWayTable, respID, sum) / n # Get the marginal probabilities for the conditioning variate q <- apply(TwoWayTable, condID, sum) / n n_resp <- length(p) n_cond <- length(q) if (conditionalTest) { # Preserve the conditional totals m <- apply(TwoWayTable, condID, sum) } else { m <- rmultinom(1,n, q)} for (c in 1:n_cond){ # Generate new counts from a multinomial newRespVals <- rmultinom(1,m[c],p) if (respID < condID){ for (i in 1:length(newRespVals)){ new_table[i,c] <- newRespVals[i] } } else { for (i in 1:length(newRespVals)){ new_table[c,i] <- newRespVals[i] } } } new_table }
Eikosograms - Dependence/independence
The graphic of choice to assess independence is an eikosogram. Because the eikos function is implemented using grid, we need to update the lineup function:
lineup <- function(data, showSuspect=NULL, generateSuspect=NULL, trueLoc=NULL, layout =c(5,4), # We add the "pkg" argument pkg=c("graphics","grid", "ggplot2")) { # # Get the number of suspects in total nSuspects <- layout[1] * layout[2] if (is.null(trueLoc)) {trueLoc <- sample(1:nSuspects, 1)} if (is.null(showSuspect)) {stop("need a plot function for the suspect")} if (is.null(generateSuspect)) {stop("need a function to generate suspect")} # Need to decide which subject to present presentSuspect <- function(suspectNo) { if(suspectNo != trueLoc) {data <- generateSuspect(data)} showSuspect(data, suspectNo) } # Up to here, there is no change beyond the additional "pkg" argument. # CONTINUED ON NEXT SLIDE
Eikosograms - Dependence/independence
# CONTINUED FROM PREVIOUS SLIDE # Plotting depends on the plotting package pkg <- match.arg(pkg) switch(pkg, "graphics" = { savePar <- par(mfrow=layout, mar=c(2.5, 0.1, 3, 0.1), oma=rep(0,4)) sapply(1:nSuspects, FUN = presentSuspect) # The plotLayout here is of no value for graphics plotLayout <- layout par(savePar) }, "grid" = { grobs <- lapply(1:nSuspects, FUN = presentSuspect) plotLayout <- marrangeGrob(grobs=grobs, nrow=layout[1], ncol=layout[2]) }, "ggplot2" = { ##ggplot2 plots <- lapply(1:nSuspects, FUN = presentSuspect) plotLayout <- marrangeGrob(grobs=plots, nrow=layout[1], ncol=layout[2]) }, stop("Wrong 'pkg'") ) # CONTINUED ON NEXT SLIDE
Eikosograms - Dependence/independence
# CONTINUED FROM PREVIOUS SLIDE # Obfuscate location to keep us honest possibleBaseVals <- 3:min(2*nSuspects, 50) # remove easy base values possibleBaseVals <- possibleBaseVals[possibleBaseVals != 10 & possibleBaseVals != 5] base <- sample(possibleBaseVals, 1)
- ffset <- sample(5:min(5*nSuspects, 125),1)
# return obfuscated location # return obfuscated location and plot (if not "graphics") list(trueLoc = paste0("log(",base^(trueLoc + offset), ", base=",base,") - ", offset), plotLayout = plotLayout ) }
We’ll work with the SAheart data from the package ElemStatLearn and assess whether coronary heard disease is independent of family history.
library(ElemStatLearn) # First, get the chd events,
- rdered so that chd =1 qppears as the bottom bar of an eikosogram
chd <- c("None", "CHD")[1+SAheart$chd] # Create the two way table heart <- table(SAheart$famhist, chd, dnn = c("famhist", "coronary"))
Eikosograms - Dependence/independence
To test independence for two way tables, the data structure passed to line-up needs to be a little richer than that we had before when, say, comparing distributions.
# Here is the data structure that we will use for two way tables data <- list(table = heart, y = "coronary", x = "famhist")
We need to adapt the function for generating a new table to this data structure. We’ll introduce a function for each type of test (conditional or unconditional)
generateTableDataCond <- function(data){ newtable <- generateTable(data$table, data$y, data$x) list(table=newtable, y = data$y, x = data$x) } generateTableDataUncond <- function(data){ newtable <- generateTable(data$table, data$y, data$x, FALSE) list(table=newtable, y = data$y, x = data$x) }
Eikosograms - Dependence/independence
We also need a function of this data structure that will show an eikosogram
showTable <- function(data, Suspect){ result <- eikos(y = data$y, x = data$x, data = data$table, legend = FALSE, xlabs=FALSE, ylabs=FALSE, xaxs=FALSE, yaxs=FALSE, main= paste("Suspect", Suspect), draw = FALSE ) invisible(result) }
Now the lineup test can be called on our data. Here’s the call for the unconditional test:
library(eikosograms) results <- lineup(data, generateSuspect = generateTableDataCond, showSuspect = showTable, layout = c(4,5), pkg="grid") results$plotLayout results$trueLoc
NOTE The results have to be saved now and the plotLayout and trueLoc evaluated to be seen.
Eikosograms - Dependence/independence
The conditionally generated test:
Suspect 1 Suspect 2 Suspect 3 Suspect 4 Suspect 5 Suspect 6 Suspect 7 Suspect 8 Suspect 9 Suspect 10 Suspect 11 Suspect 12 Suspect 13 Suspect 14 Suspect 15 Suspect 16 Suspect 17 Suspect 18 Suspect 19 Suspect 20
page 1 of 1
True Location: log(2.41186503225706e+63, base=7) - 66
Eikosograms - Dependence/independence
The conditionally generated test:
Suspect 1 Suspect 2 Suspect 3 Suspect 4 Suspect 5 Suspect 6 Suspect 7 Suspect 8 Suspect 9 Suspect 10 Suspect 11 Suspect 12 Suspect 13 Suspect 14 Suspect 15 Suspect 16 Suspect 17 Suspect 18 Suspect 19 Suspect 20
page 1 of 1
True Location: log(2.41186503225706e+63, base=7) - 66 = 9
Eikosograms - Dependence/independence
The UNconditionally generated test:
Suspect 1 Suspect 2 Suspect 3 Suspect 4 Suspect 5 Suspect 6 Suspect 7 Suspect 8 Suspect 9 Suspect 10 Suspect 11 Suspect 12 Suspect 13 Suspect 14 Suspect 15 Suspect 16 Suspect 17 Suspect 18 Suspect 19 Suspect 20
page 1 of 1
True Location: log(9.28446791485507e+99, base=15) - 73
Eikosograms - Dependence/independence
The UNconditionally generated test:
Suspect 1 Suspect 2 Suspect 3 Suspect 4 Suspect 5 Suspect 6 Suspect 7 Suspect 8 Suspect 9 Suspect 10 Suspect 11 Suspect 12 Suspect 13 Suspect 14 Suspect 15 Suspect 16 Suspect 17 Suspect 18 Suspect 19 Suspect 20
page 1 of 1
True Location: log(9.28446791485507e+99, base=15) - 73 = 12
Eikosograms - Dependence/independence
Dependence between a pair of ordered categorical variates can be seen in an eikosogram. This can also be seen with binary variates. Suppose we have two binary categorical variates X and Y that each take values of either “yes” or “no. If the value is”yes“, then the event associated with that variate occurred, if”no" then it did not. Many important relationships between two such events (binary variates) have
- bvious visual signatures when displayed as eikosograms. These include:
◮ independence (already seen) ◮ coincident events ◮ mutually exclusive events ◮ positive association (when one event occurs the other does with high
probability)
◮ negative association (when one event occurs the other does not with high
probability)
Eikosograms - Coincident versus mutually exclusive events
Coincident
Y Yes No Yes No X
0.33
Mutually exclusive
Y Yes No Yes No X
0.33
Eikosograms - Association
Negative Association
Y Yes No
0.2 0.8
Yes No X
0.33
Positive Association
Y Yes No
0.8 0.2
Yes No X
0.33
Eikosograms - Dependence/independence
You can see the visual transition as dependence changes
Mutually exclusive
Y Yes No Yes No X
0.33
Negative Association
Y Yes No
0.2 0.8
Yes No X
0.33
Independent
Y Yes No
0.33
Yes No X
0.33
Positive Association
Y Yes No
0.8 0.2
Yes No X
0.33
Coincident
Y Yes No Yes No X
0.33
mutually → negative → independence → positive → coincidence exclusive association association
The Titanic
Recall: The data set Titanic provides “information on the fate of passengers on the fatal maiden voyage
- f the ocean liner Titanic, summarized according
to economic status (class), sex, age and survival.” The Titanic data records the number of passengers in various categories for four different categorical variates No. Variate Values 1 Class 1st, 2nd, 3rd, Crew 2 Sex Male, Female 3 Age Child, Adult 4 Survived No, Yes
Eikosograms - Dependence/independence
Suppose we are interested in whether:
◮ Survival depends on sex of passenger
eikos(y ="Survived", x = "Sex", data = Titanic, legend = TRUE, main="Titanic")
◮ Survival depends on age of passenger
eikos(y ="Survived", x = "Age", data = Titanic, legend = TRUE, main="Titanic")
◮ Survival depends on class of passenger
eikos(y ="Survived", x = "Class", data = Titanic, legend = TRUE, main="Titanic")
Eikosograms - Dependence/independence
Titanic
Survived No Yes
0.79 0.27
Male Female Sex
0.79
No Yes
Eikosograms - Dependence/independence
Titanic
Survived No Yes
0.48 0.69
Child Adult Age
0.05
No Yes
Eikosograms - Dependence/independence
Titanic
Survived No Yes
0.38 0.59 0.75 0.76
1st 2nd 3rd Crew Class
0.15 0.28 0.6
No Yes
Eikosograms - conditional independence:
Conditional independence corresponds to flat regions. The joint distribution of X, Y , and Z: Y ⊥ ⊥X Z
Eikosograms - conditional independence:
Example of a three-way joint distribution: Y ⊥ ⊥X Z; Y ⊥ ⊥ / Z X Z⊥ ⊥X Y ; Z⊥ ⊥ / Y X X⊥ ⊥Z Y ; X⊥ ⊥Y Z Y ⊥ ⊥X Y ⊥ ⊥ / Z Z⊥ ⊥X Z⊥ ⊥ / Y X⊥ ⊥Y X⊥ ⊥Z
Eikosograms - conditional independence:
Example of a three-way joint distribution: Y ⊥ ⊥ / X Z; Y ⊥ ⊥ / Z X Z⊥ ⊥ / X Y ; Z⊥ ⊥ / Y X X⊥ ⊥ / Z Y ; X⊥ ⊥ / Y Z Y ⊥ ⊥X Y ⊥ ⊥Z Z⊥ ⊥X Z⊥ ⊥Y X⊥ ⊥Y X⊥ ⊥Z
Eikosograms - conditional independence:
Example of a three-way joint distribution: Y ⊥ ⊥X Z; Y ⊥ ⊥ / Z X Z⊥ ⊥ / X Y ; Z⊥ ⊥ / Y X X⊥ ⊥ / Z Y ; X⊥ ⊥Y Z Y ⊥ ⊥ / X Y ⊥ ⊥ / Z Z⊥ ⊥ / X Z⊥ ⊥ / Y X⊥ ⊥ / Z X⊥ ⊥ / Z
Eikosograms - display equivalence
Eikosograms date back to at least 1693 by Halley. Note that the name eikosogram is used here to denote a particular display based on a particular model.
◮ Only one variate appears on the vertical axis (the response); all other
variates appear on the horizontal axis (the conditioning variates).
◮ The model is a response model: modelling the conditional distribution of
Y X together with the marginal distribution(s) of the conditioning variate(s) X. Two other named displays are identical to the eikosogram only when there is a single conditioning variate (and single response variate).
◮ the mosaic plot, which is used for categorical data, but differs from an
eikosogram for more than one X variate
◮ the spine plot which takes a categorical Y and either a categorical or
continuous variate X so shares some properties with a histogram as well
Eikosograms - display equivalence
First create a two-way table: TitanicSurvClass <- margin.table(Titanic, margin=c(1,4)) The mosaic plot: mosaicplot(TitanicSurvClass, main="Mosaic plot (two way table)") The spine plot spineplot(TitanicSurvClass, main="Spine plot (two way table)")
Eikosograms - display equivalence (to a mosaic plot)
Mosaic plot (two way table)
Class Survived
1st 2nd 3rd Crew No Yes
Eikosograms - display equivalence (to a spine plot)
Spine plot (two way table)
Class Survived 1st 2nd 3rd Crew No Yes 0.0 0.2 0.4 0.6 0.8 1.0
Eikosograms - the equivalent eikosogram
Eikosogram (two way table)
Survived No Yes
0.38 0.59 0.75 0.76
1st 2nd 3rd Crew Class
0.15 0.28 0.6
More than two variates - display differences
Eikosograms and mosaic plots differ when there are more than two variates: eikos(y = "Survived", x = c("Class", "Sex"), data=Titanic, xaxs=FALSE, yaxs=FALSE, main="Eikosogram (three way table)") # Mosaic plot, need to construct the marginal table first # Note that the table is ordered so that Survived is first, # then Class and Sex (to match the eikosogram order above) TitanicSurvClassSex <- margin.table(Titanic, c(4, 1, 2)) # The mosaic plot mosaicplot(TitanicSurvClassSex, main="Mosaic plot (three way table)")
More than two variates - display differences (eikosogram)
Eikosogram (three way table)
Survived No Yes 1st 1st 2nd 2nd 3rd 3rd Crew Crew Male Female Class Sex
More than two variates - display differences (mosaic plot)
Mosaic plot (three way table)
Sex Survived
Male Female No Yes 1st 2nd 3rd Crew 1st 2nd 3rd Crew
More than two variates - display differences (eikosogram)
Eikosogram (three way table)
Sex Male Female 1st 1st 2nd 2nd 3rd 3rd Crew Crew No Yes Class Survived
More than two variates - display differences (mosaic plot)
Mosaic plot (three way table)
Survived Sex
No Yes Male Female 1st 2nd 3rd Crew 1st 2nd 3rd Crew
More than two variates - display differences (eikosogram)
Eikosogram (three way table)
Class 1st 2nd 3rd Crew Male Male Female Female No Yes Sex Survived
More than two variates - display differences (mosaic plot)
Mosaic plot (three way table)
Survived Class
No Yes 1st 2nd 3rd Crew Male Female Male Female
Different models produce different displays
Eikosogram: shows shows proportions Y (X&Z), X&Z, and Z as lengths. Makes it easier to see conditional independence. E.g. Y ⊥ ⊥X Z Mosaic plot: shows proportions X, Y X, Z (X&Y ) as lengths.
Different models produce different displays
Differences become greater the greater the number of conditioning variates in the eikosogram: eikos(y = "Survived", x = c("Class", "Sex", "Age"), data=Titanic, xaxs=FALSE, yaxs=FALSE, main="Eikosogram (four way table)") mosaicplot(aperm(Titanic, perm=c(3,4,1,2)), main="Mosaic plot (four way table)")
Different models produce different displays (eikosogram)
Eikosogram (four way table)
Survived No Yes 1st 1st 1st 1st 2nd 2nd 2nd 2nd 3rd 3rd 3rd 3rd Crew Crew Male Male Female Female Child Adult Class Sex Age
Different models produce different displays (mosaic plot)
Mosaic plot (four way table)
Age Survived
Child Adult No Yes 1st 2nd3rd Crew Male Female 1st 2nd 3rd Crew