Categorical data Reasoning by diagrams R.W. Oldford Crossed data - - - PowerPoint PPT Presentation
Categorical data Reasoning by diagrams R.W. Oldford Crossed data - - - PowerPoint PPT Presentation
Categorical data Reasoning by diagrams R.W. Oldford Crossed data - tables The main data structure for crossed categorical data is a table . Each variate has a finite number of values (categories) city <- c ("Kitchener",
Crossed data - tables
The main data structure for crossed categorical data is a table.
Each variate has a finite number of values (categories) city <- c("Kitchener", "Waterloo") housing <- c("House", "Apartment", "Residence") All combinations of one value from each variate are possible (crossed) and we have the number of times each combination occurs # fake data counts <- rpois(6, lambda = 50) Arranged in a rectangular array: vacancy <- matrix(counts, nrow = length(city), ncol = length(housing), byrow = TRUE, dimnames = list(city = city, housing = housing)) And now coerced to be an object of class table vacancy <- as.table(vacancy) vacancy ## housing ## city House Apartment Residence ## Kitchener 52 53 46 ## Waterloo 47 64 43
Crossed data - tables
The table can be a many-way array from crossing many categorical variates
term <- c("Fall", "Winter", "Spring") # more fake counts counts <- seq(from = 10, to = 180, by = 10) vacancy <- array(counts, dim=c(length(city), length(housing), length(term)), dimnames =list(city = city, housing = housing, term = term)) vacancy <- as.table(vacancy) vacancy ## , , term = Fall ## ## housing ## city House Apartment Residence ## Kitchener 10 30 50 ## Waterloo 20 40 60 ## ## , , term = Winter ## ## housing ## city House Apartment Residence ## Kitchener 70 90 110 ## Waterloo 80 100 120 ## ## , , term = Spring ## ## housing ## city House Apartment Residence ## Kitchener 130 150 170 ## Waterloo 140 160 180 Note when filling the array, the earlier indices change more quickly than do the later indices.
Crossed data - tables
The order of dimensions can be rearranged - the R function aperm(...)
aperm(vacancy, perm=c(3,2,1)) ## , , city = Kitchener ## ## housing ## term House Apartment Residence ## Fall 10 30 50 ## Winter 70 90 110 ## Spring 130 150 170 ## ## , , city = Waterloo ## ## housing ## term House Apartment Residence ## Fall 20 40 60 ## Winter 80 100 120 ## Spring 140 160 180
Crossed data - constructing tables from data
Have an existing dataframe with categorical variates
SAheart[1:3,] ## sbp tobacco ldl adiposity famhist typea obesity alcohol age chd ## 1 160 12.00 5.73 23.11 Present 49 25.30 97.20 52 1 ## 2 144 0.01 4.41 28.61 Absent 55 28.87 2.06 63 1 ## 3 118 0.08 3.48 32.28 Present 52 29.14 3.81 46 Create the table directly from individual factors (like famhist) or unique values (like chd): table(SAheart$chd, SAheart$famhist, dnn = c("chd", "famhist")) ## famhist ## chd Absent Present ## 206 96 ## 1 64 96 Or, by cross-tabulation (“cross tabs” or xtabs) xtabs( ~ chd + famhist, data = SAheart) # Note formula ## famhist ## chd Absent Present ## 206 96 ## 1 64 96
Crossed data - working with tables
Consider the three-way table (a 4 x 4 x 2 array) HairEyeColor:
## , , Sex = Male ## ## Eye ## Hair Brown Blue Hazel Green ## Black 32 11 10 3 ## Brown 53 50 25 15 ## Red 10 10 7 7 ## Blond 3 30 5 8 ## ## , , Sex = Female ## ## Eye ## Hair Brown Blue Hazel Green ## Black 36 9 5 2 ## Brown 66 34 29 14 ## Red 16 7 7 7 ## Blond 4 64 5 8
The names of its variates (dimnames) in order are:
names(dimnames(HairEyeColor)) ## [1] "Hair" "Eye" "Sex"
are used to create interesting sub-tables or alternative tables.
Crossed data - working with tables
Selecting slices (conditioning)
HairEyeColor["Black",,] ## Sex ## Eye Male Female ## Brown 32 36 ## Blue 11 9 ## Hazel 10 5 ## Green 3 2 HairEyeColor[,"Green",] ## Sex ## Hair Male Female ## Black 3 2 ## Brown 15 14 ## Red 7 7 ## Blond 8 8 HairEyeColor["Black","Blue",] ## Male Female ## 11 9 HairEyeColor["Black","Green","Male"] ## [1] 3
Crossed data - working with tables
Collapsing dimensions (marginalizing, projecting)
# Zero dimensional margin.table(HairEyeColor) ## [1] 592 # 1 dimensional -- here margin 1 ("Hair") is preserved margin.table(HairEyeColor, margin=1) ## Hair ## Black Brown Red Blond ## 108 286 71 127 # 2 dimensional -- here margins 1 and 2 ("Hair", "Eye") are preserved margin.table(HairEyeColor, margin=c(1,2)) ## Eye ## Hair Brown Blue Hazel Green ## Black 68 20 15 5 ## Brown 119 84 54 29 ## Red 26 17 14 14 ## Blond 7 94 10 16
# Note: except for 0 dimensional. these are the same as using "apply" with "sum" apply(HairEyeColor, MARGIN=1, FUN=sum) ## Black Brown Red Blond ## 108 286 71 127
Crossed data - working with tables
Summing along every margin (new variate value Sum for each variate)
# Every margin is summed addmargins(HairEyeColor) ## , , Sex = Male ## ## Eye ## Hair Brown Blue Hazel Green Sum ## Black 32 11 10 3 56 ## Brown 53 50 25 15 143 ## Red 10 10 7 7 34 ## Blond 3 30 5 8 46 ## Sum 98 101 47 33 279 ## ## , , Sex = Female ## ## Eye ## Hair Brown Blue Hazel Green Sum ## Black 36 9 5 2 52 ## Brown 66 34 29 14 143 ## Red 16 7 7 7 37 ## Blond 4 64 5 8 81 ## Sum 122 114 46 31 313 ## ## , , Sex = Sum ## ## Eye ## Hair Brown Blue Hazel Green Sum ## Black 68 20 15 5 108 ## Brown 119 84 54 29 286 ## Red 26 17 14 14 71 ## Blond 7 94 10 16 127 ## Sum 220 215 93 64 592
Crossed data - working with tables
Summing along a single margin
# Just produce marginal sums over dimension 2 ("Eyes") values # for each pair (i, k) of remaining variates "Hair" and "Sex" addmargins(HairEyeColor, margin=2) ## , , Sex = Male ## ## Eye ## Hair Brown Blue Hazel Green Sum ## Black 32 11 10 3 56 ## Brown 53 50 25 15 143 ## Red 10 10 7 7 34 ## Blond 3 30 5 8 46 ## ## , , Sex = Female ## ## Eye ## Hair Brown Blue Hazel Green Sum ## Black 36 9 5 2 52 ## Brown 66 34 29 14 143 ## Red 16 7 7 7 37 ## Blond 4 64 5 8 81
Crossed data - working with tables
Summing along two margins
# Produce marginal sums over both dimensions 1 and 2 ("Hair" and "Eyes") # for each value for "Eye" addmargins(HairEyeColor, margin=c(1,2)) ## , , Sex = Male ## ## Eye ## Hair Brown Blue Hazel Green Sum ## Black 32 11 10 3 56 ## Brown 53 50 25 15 143 ## Red 10 10 7 7 34 ## Blond 3 30 5 8 46 ## Sum 98 101 47 33 279 ## ## , , Sex = Female ## ## Eye ## Hair Brown Blue Hazel Green Sum ## Black 36 9 5 2 52 ## Brown 66 34 29 14 143 ## Red 16 7 7 7 37 ## Blond 4 64 5 8 81 ## Sum 122 114 46 31 313
Crossed data - working with tables
Proportions (depends on which margin is fixed)
# No margins fixed, just total ... single multinomial round(prop.table(HairEyeColor), 3) ## , , Sex = Male ## ## Eye ## Hair Brown Blue Hazel Green ## Black 0.054 0.019 0.017 0.005 ## Brown 0.090 0.084 0.042 0.025 ## Red 0.017 0.017 0.012 0.012 ## Blond 0.005 0.051 0.008 0.014 ## ## , , Sex = Female ## ## Eye ## Hair Brown Blue Hazel Green ## Black 0.061 0.015 0.008 0.003 ## Brown 0.111 0.057 0.049 0.024 ## Red 0.027 0.012 0.012 0.012 ## Blond 0.007 0.108 0.008 0.014
Possible generative model:
Crossed data - working with tables
Proportions (depends on which margin is fixed)
# No margins fixed, just total ... single multinomial round(prop.table(HairEyeColor), 3) ## , , Sex = Male ## ## Eye ## Hair Brown Blue Hazel Green ## Black 0.054 0.019 0.017 0.005 ## Brown 0.090 0.084 0.042 0.025 ## Red 0.017 0.017 0.012 0.012 ## Blond 0.005 0.051 0.008 0.014 ## ## , , Sex = Female ## ## Eye ## Hair Brown Blue Hazel Green ## Black 0.061 0.015 0.008 0.003 ## Brown 0.111 0.057 0.049 0.024 ## Red 0.027 0.012 0.012 0.012 ## Blond 0.007 0.108 0.008 0.014
Possible generative model: multinomial. Here counts nijk have fixed total n = n+++ =
ijk nijk = 592.
Pr(Data) = n n111 n211 · · · n442
- p n111
111
· · · p n442
442
with p+++ = 4
i=1
4
j=1
2
k=1 pijk = 1.
Crossed data - working with tables
Proportions (depends on which margin is fixed)
# One margin (the third here, i.e. Sex) is fixed ... as many multinomials as in round(prop.table(HairEyeColor, margin=3), 2) ## , , Sex = Male ## ## Eye ## Hair Brown Blue Hazel Green ## Black 0.11 0.04 0.04 0.01 ## Brown 0.19 0.18 0.09 0.05 ## Red 0.04 0.04 0.03 0.03 ## Blond 0.01 0.11 0.02 0.03 ## ## , , Sex = Female ## ## Eye ## Hair Brown Blue Hazel Green ## Black 0.12 0.03 0.02 0.01 ## Brown 0.21 0.11 0.09 0.04 ## Red 0.05 0.02 0.02 0.02 ## Blond 0.01 0.20 0.02 0.03 Possible generative model:
Crossed data - working with tables
Proportions (depends on which margin is fixed)
# One margin (the third here, i.e. Sex) is fixed ... as many multinomials as in round(prop.table(HairEyeColor, margin=3), 2) ## , , Sex = Male ## ## Eye ## Hair Brown Blue Hazel Green ## Black 0.11 0.04 0.04 0.01 ## Brown 0.19 0.18 0.09 0.05 ## Red 0.04 0.04 0.03 0.03 ## Blond 0.01 0.11 0.02 0.03 ## ## , , Sex = Female ## ## Eye ## Hair Brown Blue Hazel Green ## Black 0.12 0.03 0.02 0.01 ## Brown 0.21 0.11 0.09 0.04 ## Red 0.05 0.02 0.02 0.02 ## Blond 0.01 0.20 0.02 0.03 Possible generative model: product multinomial. Fixed sums are (n++1, n++2) =
ij (nij1, nij2) = (279, 313).
Pr(Data) = n++1 n111 n211 · · · n441
- p n111
111
· · · p n441
441
× n++2 n112 n212 · · · n442
- p n112
112
· · · p n442
442
with p++k = 4
i=1
4
j=1 pijk = 1 for each k = 1, 2.
Crossed data - working with tables
Proportions (depends on which margin is fixed)
# Easier to see with column sums for a two way table HairEye <- margin.table(HairEyeColor, margin = c(1,2)) # Sum of each table's proportions must be one round(prop.table(HairEye, margin=2), 2) ## Eye ## Hair Brown Blue Hazel Green ## Black 0.31 0.09 0.16 0.08 ## Brown 0.54 0.39 0.58 0.45 ## Red 0.12 0.08 0.15 0.22 ## Blond 0.03 0.44 0.11 0.25 Possible generative model:
Crossed data - working with tables
Proportions (depends on which margin is fixed)
# Easier to see with column sums for a two way table HairEye <- margin.table(HairEyeColor, margin = c(1,2)) # Sum of each table's proportions must be one round(prop.table(HairEye, margin=2), 2) ## Eye ## Hair Brown Blue Hazel Green ## Black 0.31 0.09 0.16 0.08 ## Brown 0.54 0.39 0.58 0.45 ## Red 0.12 0.08 0.15 0.22 ## Blond 0.03 0.44 0.11 0.25 Possible generative model: Again a product multinomial. But now begin with sums over k (i.e. Sex) so that the relevant counts are nij+ = 2
k=1 nijk. Fixed sums for this table (i.e. HairEye) are
n+j+ = 4
i=1 nij+ ∀ j = 1, . . . , 4. The values (n+1+, n+2+, n+3+, n+4+) = (220, 215, 93, 64).
Pr(Data) =
4
- j=1
- n+j+
n1j+ n2j+ n3j+ n4j+
- p
n1j+ 1j+
· · · p
n4j+ 4j+
with p+j+ = 4
i=1 pij+ = 1 for each j = 1, 2, 3, 4. (That is, each of the above columns of HairEye sum to 1.)
Crossed data - tidy tables
On the course website, there is another package called tidytable which provides an implementation of the rules we developed for table analysis. To install the package
- 1. Download it (tidytable_0.0-1.tar.gz) from the course website
- 2. Place it somewhere in your file system, say in “SomeDirectoryYouPicked”
- 3. Then install it in R as follows:
3.1 EITHER from RStudio’s “Install Packages . . . ” menu:
◮ select “Install from:”“Package archive file . . . ” ◮ browse to your directory “SomeDirectoryYouPicked” ◮ leave the defauly “install to Library” ◮ select “Install”
3.2 OR
◮ get a terminal (or shell, or console) window ◮ change directories (cd) to “SomeDirectoryYouPicked” ◮ in that directory type
R CMD INSTALL tidytable_0.0-1.tar.gz (maybe no .gz)
◮ if you have a problem with permissions, you might need to have
administrator privileges
Crossed data - tidy tables
For example, recall the arrangement of HairEyeColor:
HairEyeColor ## , , Sex = Male ## ## Eye ## Hair Brown Blue Hazel Green ## Black 32 11 10 3 ## Brown 53 50 25 15 ## Red 10 10 7 7 ## Blond 3 30 5 8 ## ## , , Sex = Female ## ## Eye ## Hair Brown Blue Hazel Green ## Black 36 9 5 2 ## Brown 66 34 29 14 ## Red 16 7 7 7 ## Blond 4 64 5 8
Crossed data - tidy tables
tidytable rearranges the dimensions of the table (plus other functionality) to give library(tidytable) tidytable(HairEyeColor)$table ## , , Eye = Brown ## ## Hair ## Sex Brown Blond Black Red ## Female 66 4 36 16 ## Male 53 3 32 10 ## ## , , Eye = Blue ## ## Hair ## Sex Brown Blond Black Red ## Female 34 64 9 7 ## Male 50 30 11 10 ## ## , , Eye = Hazel ## ## Hair ## Sex Brown Blond Black Red ## Female 29 5 5 7 ## Male 25 5 10 7 ## ## , , Eye = Green ## ## Hair ## Sex Brown Blond Black Red ## Female 14 8 2 7 ## Male 15 8 3 7 Which more easily reveals patterns. E.g. consider whether hair colour and sex are independent after conditioning
- n eye colour.
Crossed data - tidy tables
Or, recalling the proportions, compare proportions <- round(prop.table(HairEye, margin=2), 2) proportions ## Eye ## Hair Brown Blue Hazel Green ## Black 0.31 0.09 0.16 0.08 ## Brown 0.54 0.39 0.58 0.45 ## Red 0.12 0.08 0.15 0.22 ## Blond 0.03 0.44 0.11 0.25 to tidyproportions <- tidytable(proportions) tidyproportions$table ## Hair ## Eye Brown Blond Black Red ## Brown 54 3 31 12 ## Blue 39 44 9 8 ## Hazel 58 11 16 15 ## Green 45 25 8 22 Note that the tidyproportions contains more information such as tidyproportions$units ## [1] 0.01
Eikosogram - picture of probability
Eikosograms are modelled directly on conditional probability and so can be used to reveal important patterns in this probability. To see this, let’s begin with an unconditional probability. Suppose we only have a response variate Y which can take only one of two possible values, either Y = y1 or Y = y2. An eikosogram representing this information could look like:
1/3 1 Y= y1 Y= y2
◮ Probability framed within a rectangle. ◮ Frame provides conditions ◮ Horizontal and vertical scales are [0,1] ◮ Horizontal (and vertical) lines only. ◮ Colours are in horizontal bands. ◮ Probability = Area
Actually represents: Pr(Y = y1 Frame) = 1
3
We drop the condition “Frame”, but it is understood to be present.
Eikosogram - picture of probability
Suppose we consider tossing two coins simultaneously.
1/ 4 1 Y= {T,T} Y= {H,H} Y= {H,T} 3/ 4
◮ Two coins tossed simultaneously ◮ Three events, matching outcomes of Coin1,
Coin2
◮ Events: {H, H}, {T, T}, {H, T} ◮ Probabilities = areas
We have: Pr(Y = {H, H} Frame) = 1
4 ,
Pr(Y = {H, T} Frame) = 3
4 − 1 4 = 1 2
and Pr(Y = {T, T} Frame) = 1 − 3
4 = 1 4
Eikosogram - picture of probability
Now imagine instead that we toss one coin, observe the outcome, then toss the next. We could sketch this in an outcome tree Action proceeds from left to right. Often natural to model process. Key features:
◮ Single root ◮ Multiple branches at each node ◮ Typically finite, though not necessarily ◮ Interest lies in different subsets of the tree from the root (paths, partial paths, subtrees) called
- events. There are still only three of interest here.
Typically time is associated with the left to right. This nicely matches the model for an eikosogram!
Eikosogram - picture of probability
First toss one coin, observe outcome, then toss the next. Let X take the value of the first coin’s outcome, Y the value of the second’s. Outcome tree is binary with two layers, first for X second for Y . 1/2 1 Y= tails X= heads Y= heads
◮ Suppose first coin lands heads. ◮ This frame has X = heads ◮ Probabilities = areas
Pr(Y = heads X = heads, Frame) = 1
2 ,
Pr(Y = tails X = heads, Frame) = 1 − 1
2 = 1 2
Eikosogram - picture of probability
Similarly, the eikosogram for the case when X = tails can be produced.\ 1/2 1 Y= tails X= heads Y= heads 1/2 1 Y= tails X= tails Y= heads These two separate eikosograms can be put together in a common frame to tell the whole story in a single eikosogram.
Eikosogram - picture of probability
Putting the two separate frames together: 1/2 1 Y= tails Y= heads
X= heads X= tails
1/2
◮ Probabilities are still areas. ◮ Marginal of X read off horizontal scale ◮ Conditional of Y
X read off vertical
◮ Note X and Y are independent here (flat)
We have: Pr(X = heads Frame) = 1
2 ,
Pr(Y = heads X = tails, Frame) = 1
2
Pr(Y = heads & X = tails Frame) = Area of rectangle = Pr(Y = heads X = tails, Frame) × Pr(X = tails Frame)
Eikosogram - picture of probability
Note: Comparing the two yields the same results (as it should) 1/ 4 1 Y= {T,T} Y= {H,H} Y= {H,T} 3/ 4 1/2 1 Y= tails Y= heads
X= heads X= tails
1/2 Two coins at a time One after another Left eikosogram shows only joint outcomes, right shows marginal, conditional, and joint.
Eikosogram - picture of probability
Consider a different example, one that still two variates X and Y each with binary
- utcomes: X = x with x ∈ {1, 2} and Y = y with y ∈ {1, 2}. But now X and Y have
probabilities as given below:
◮ Probabilities are still areas. ◮ Marginal of X read off horizontal scale ◮ Conditional of Y
X read off vertical
◮ Note X and Y are not independent here (not flat)
Eikosogram - picture of probability
Which variate, where?
Choice of which variate, X or Y , appears on the horizontal and which on the vertical depends on which conditional probabilities are of interest. Below are two views of the {same} probability
- distribution. (Check areas!)
(a) Y X & X (b) X Y & Y
Eikosogram - picture of probability
Rules of probability follow from calculating and equating rectangular areas. ⇐ ⇒ Bottom left yellow area is Bottom left yellow area is Pr(X = 1, Y = 1) Pr(X = 1, Y = 1) = Pr(Y = 1 X = 1) × Pr(X = 1) = Pr(X = 1 Y = 1) × Pr(Y = 1) Pr(Y X) × Pr(X) = Pr(X Y ) × Pr(Y ) . . . Bayes “theorem”
Monty Hall problem
You are on TV show called ‘Let’s make a Deal’ and the host, Monty Hall, shows you three doors.
◮ Behind one of these doors is a brand new car! ◮ Behind each of the other two doors is a goat! ◮ You get to choose one of the three doors and take home the prize hidden
behind it.
Monty Hall problem
You are on ‘Let’s make a Deal’ and the host, Monty Hall, shows you three doors.
◮ You choose a door. ◮ Before the prize is revealed, Monty opens one of the two other doors to reveal
. . . . . . a goat!
◮ Monty then offers you the opportunity to change your mind and either keep what’s
behind the door you have already selected, or whatever’s behind the other unopened door.
◮ Is it better to stay with your original choice? Or switch? Does it matter?
Monty Hall problem
You select door C, and then Monty opens door B:
◮ Should you switch? Or does it matter? ◮ Reasoning often goes as follows:
◮ You always knew that at least one of doors A and B hides a goat. Knowing which one
doesn’t change anything.
◮ Or, two doors remain. It doesn’t matter which you choose.
◮ Both seem reasonable.
Monty Hall problem
Let’s frame the possibilities in an outcome tree: Levels are:
- 1. Monty places the car behind one of three
doors.
- 2. You choose a door.
- 3. Monty reveals a goat.
Monty Hall problem
Let’s frame the possibilities in an outcome tree: Levels are:
- 1. Monty places the car behind one of three
doors.
- 2. You choose a door.
- 3. Monty reveals a goat.
Highlighted is the event we have observed. Monty placed the car behind one of A
- r C, you chose C, Monty reveals goat
behind B.
Monty Hall problem
We want to determine: Pr
- Car is behind door C
Contestant selects door C and Monty reveals a goat behind door B
Monty Hall problem
We want to determine: Pr
- Car is behind door C
Contestant selects door C and Monty reveals a goat behind door B
- Our Frame is that you have already chosen door C.