Exploring models Categorical data R.W. Oldford 1974 Motor trend - - PowerPoint PPT Presentation
Exploring models Categorical data R.W. Oldford 1974 Motor trend - - PowerPoint PPT Presentation
Exploring models Categorical data R.W. Oldford 1974 Motor trend magazine data Recall the R data set called mtcars . head (mtcars) ## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
1974 Motor trend magazine data
Recall the R data set called mtcars.
head(mtcars) ## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 1 4 4 ## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 1 4 4 ## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 ## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 3 1 ## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 3 2 ## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 3 1
1974 Motor trend magazine data
Recall the R data set called mtcars.
head(mtcars) ## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 1 4 4 ## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 1 4 4 ## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 ## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 3 1 ## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 3 2 ## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 3 1
Some of the variables are binary, for example
◮ am (= 0 if transmission is automatic, = 1 if manual) ◮ vs (= 0 if engine is V-shaped, = 1 if straight)
1974 Motor trend magazine data
Recall the R data set called mtcars.
head(mtcars) ## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 1 4 4 ## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 1 4 4 ## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 ## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 3 1 ## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 3 2 ## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 3 1
Some of the variables are binary, for example
◮ am (= 0 if transmission is automatic, = 1 if manual) ◮ vs (= 0 if engine is V-shaped, = 1 if straight)
What if we thought of either of these as a response variate?
1974 Motor trend magazine data
Recall the R data set called mtcars.
head(mtcars) ## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 1 4 4 ## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 1 4 4 ## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 ## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 3 1 ## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 3 2 ## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 3 1
Some of the variables are binary, for example
◮ am (= 0 if transmission is automatic, = 1 if manual) ◮ vs (= 0 if engine is V-shaped, = 1 if straight)
What if we thought of either of these as a response variate? And perhaps modelled this as a function of, say, the weight of the car.
1974 Motor trend magazine data
We could plot the data and fit a loess function to it.
1974 Motor trend magazine data
We could plot the data and fit a loess function to it.
with(mtcars, plot(wt, am, main = "Transmission", col = adjustcolor("black", alpha.f = 0.7), pch = 21, cex = 2, xlab = "weight", ylab = "(0 = automatic, 1 = manual)")) fit1 <- loess(am ~ wt, data = mtcars) x <- seq(min(mtcars$wt), max(mtcars$wt), length.out = 200) y <- predict(fit1, newdata = data.frame(wt = x)) lines(x, y, col = "steelblue", lwd = 2)
1974 Motor trend magazine data
We could plot the data and fit a loess function to it.
with(mtcars, plot(wt, am, main = "Transmission", col = adjustcolor("black", alpha.f = 0.7), pch = 21, cex = 2, xlab = "weight", ylab = "(0 = automatic, 1 = manual)")) fit1 <- loess(am ~ wt, data = mtcars) x <- seq(min(mtcars$wt), max(mtcars$wt), length.out = 200) y <- predict(fit1, newdata = data.frame(wt = x)) lines(x, y, col = "steelblue", lwd = 2)
2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0
Transmission
weight (0 = automatic, 1 = manual)
1974 Motor trend magazine data
We could plot the data and fit a loess function to it.
with(mtcars, plot(wt, am, main = "Transmission", col = adjustcolor("black", alpha.f = 0.7), pch = 21, cex = 2, xlab = "weight", ylab = "(0 = automatic, 1 = manual)")) fit1 <- loess(am ~ wt, data = mtcars) x <- seq(min(mtcars$wt), max(mtcars$wt), length.out = 200) y <- predict(fit1, newdata = data.frame(wt = x)) lines(x, y, col = "steelblue", lwd = 2)
2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0
Transmission
weight (0 = automatic, 1 = manual)
What is the blue curve an “estimate” of?
1974 Motor trend magazine data
We could plot the data and fit a loess function to it.
with(mtcars, plot(wt, am, main = "Transmission", col = adjustcolor("black", alpha.f = 0.7), pch = 21, cex = 2, xlab = "weight", ylab = "(0 = automatic, 1 = manual)")) fit1 <- loess(am ~ wt, data = mtcars) x <- seq(min(mtcars$wt), max(mtcars$wt), length.out = 200) y <- predict(fit1, newdata = data.frame(wt = x)) lines(x, y, col = "steelblue", lwd = 2)
2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0
Transmission
weight (0 = automatic, 1 = manual)
What is the blue curve an “estimate” of? The mean of the am variate.
1974 Motor trend magazine data
We could plot the data and fit a loess function to it.
with(mtcars, plot(wt, am, main = "Transmission", col = adjustcolor("black", alpha.f = 0.7), pch = 21, cex = 2, xlab = "weight", ylab = "(0 = automatic, 1 = manual)")) fit1 <- loess(am ~ wt, data = mtcars) x <- seq(min(mtcars$wt), max(mtcars$wt), length.out = 200) y <- predict(fit1, newdata = data.frame(wt = x)) lines(x, y, col = "steelblue", lwd = 2)
2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0
Transmission
weight (0 = automatic, 1 = manual)
What is the blue curve an “estimate” of? The mean of the am variate. Or, perhaps Pr(Transmission = manual)?
1974 Motor trend magazine data
We could plot the data and fit a loess function to it.
with(mtcars, plot(wt, am, main = "Transmission", col = adjustcolor("black", alpha.f = 0.7), pch = 21, cex = 2, xlab = "weight", ylab = "(0 = automatic, 1 = manual)")) fit1 <- loess(am ~ wt, data = mtcars) x <- seq(min(mtcars$wt), max(mtcars$wt), length.out = 200) y <- predict(fit1, newdata = data.frame(wt = x)) lines(x, y, col = "steelblue", lwd = 2)
2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0
Transmission
weight (0 = automatic, 1 = manual)
What is the blue curve an “estimate” of? The mean of the am variate. Or, perhaps Pr(Transmission = manual)? Which is E(am) after all (given the way am is recorded).
1974 Motor trend magazine data
We could plot the data and fit a loess function to it.
with(mtcars, plot(wt, am, main = "Transmission", col = adjustcolor("black", alpha.f = 0.7), pch = 21, cex = 2, xlab = "weight", ylab = "(0 = automatic, 1 = manual)")) fit1 <- loess(am ~ wt, data = mtcars) x <- seq(min(mtcars$wt), max(mtcars$wt), length.out = 200) y <- predict(fit1, newdata = data.frame(wt = x)) lines(x, y, col = "steelblue", lwd = 2)
2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0
Transmission
weight (0 = automatic, 1 = manual)
What is the blue curve an “estimate” of? The mean of the am variate. Or, perhaps Pr(Transmission = manual)? Which is E(am) after all (given the way am is recorded). Note, however, that the curve was not constrained to be between 0 and 1.
1974 Motor trend magazine data
Could do the same for engine type vs (v-shaped or straight)
with(mtcars, plot(wt, vs, main = "Engine type", col = adjustcolor("black", alpha.f = 0.7), pch = 21, cex = 2, xlab = "weight", ylab = "(0 = V-shaped, 1 = straight)")) fit2 <- loess(vs ~ wt, data = mtcars) x <- seq(min(mtcars$wt), max(mtcars$wt), length.out = 200) y <- predict(fit2, newdata = data.frame(wt = x)) lines(x, y, col = "steelblue", lwd = 2)
1974 Motor trend magazine data
Could do the same for engine type vs (v-shaped or straight)
with(mtcars, plot(wt, vs, main = "Engine type", col = adjustcolor("black", alpha.f = 0.7), pch = 21, cex = 2, xlab = "weight", ylab = "(0 = V-shaped, 1 = straight)")) fit2 <- loess(vs ~ wt, data = mtcars) x <- seq(min(mtcars$wt), max(mtcars$wt), length.out = 200) y <- predict(fit2, newdata = data.frame(wt = x)) lines(x, y, col = "steelblue", lwd = 2)
2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0
Engine type
weight (0 = V−shaped, 1 = straight)
which is a bit weirder.
1974 Motor trend magazine data
Could do the same for engine type vs (v-shaped or straight)
with(mtcars, plot(wt, vs, main = "Engine type", col = adjustcolor("black", alpha.f = 0.7), pch = 21, cex = 2, xlab = "weight", ylab = "(0 = V-shaped, 1 = straight)")) fit2 <- loess(vs ~ wt, data = mtcars) x <- seq(min(mtcars$wt), max(mtcars$wt), length.out = 200) y <- predict(fit2, newdata = data.frame(wt = x)) lines(x, y, col = "steelblue", lwd = 2)
2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0
Engine type
weight (0 = V−shaped, 1 = straight)
which is a bit weirder. Note also that it goes below 0 and above 1 in places.
1974 Motor trend magazine data
Could do the same for engine type vs (v-shaped or straight)
with(mtcars, plot(wt, vs, main = "Engine type", col = adjustcolor("black", alpha.f = 0.7), pch = 21, cex = 2, xlab = "weight", ylab = "(0 = V-shaped, 1 = straight)")) fit2 <- loess(vs ~ wt, data = mtcars) x <- seq(min(mtcars$wt), max(mtcars$wt), length.out = 200) y <- predict(fit2, newdata = data.frame(wt = x)) lines(x, y, col = "steelblue", lwd = 2)
2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0
Engine type
weight (0 = V−shaped, 1 = straight)
which is a bit weirder. Note also that it goes below 0 and above 1 in places. What does the blue curve estimate now?
1974 Motor trend magazine data
Could do the same for engine type vs (v-shaped or straight)
with(mtcars, plot(wt, vs, main = "Engine type", col = adjustcolor("black", alpha.f = 0.7), pch = 21, cex = 2, xlab = "weight", ylab = "(0 = V-shaped, 1 = straight)")) fit2 <- loess(vs ~ wt, data = mtcars) x <- seq(min(mtcars$wt), max(mtcars$wt), length.out = 200) y <- predict(fit2, newdata = data.frame(wt = x)) lines(x, y, col = "steelblue", lwd = 2)
2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0
Engine type
weight (0 = V−shaped, 1 = straight)
which is a bit weirder. Note also that it goes below 0 and above 1 in places. What does the blue curve estimate now? Again. this is mean of the vs variate. Or, given the way vs is recorded, E(vs) = Pr(Engine = straight).
Generalized linear model (glm)
We could also use a parametric model, such as a generalized linear model.
Generalized linear model (glm)
We could also use a parametric model, such as a generalized linear model. Recall a generalized linear model models a known function of the mean µ, say g(µ), called the link function and we model g(µ(x1, . . . , xq)) = θ0 + θ1x1 + · · · + θpxp with everything else as before.
Generalized linear model (glm)
We could also use a parametric model, such as a generalized linear model. Recall a generalized linear model models a known function of the mean µ, say g(µ), called the link function and we model g(µ(x1, . . . , xq)) = θ0 + θ1x1 + · · · + θpxp with everything else as before. If Y ∈ {0, 1} is a Bernoulli random variable then µ = E(Y )
Generalized linear model (glm)
We could also use a parametric model, such as a generalized linear model. Recall a generalized linear model models a known function of the mean µ, say g(µ), called the link function and we model g(µ(x1, . . . , xq)) = θ0 + θ1x1 + · · · + θpxp with everything else as before. If Y ∈ {0, 1} is a Bernoulli random variable then µ = E(Y ) = 0 × Pr(Y = 0) + 1 × Pr(Y = 1)
Generalized linear model (glm)
We could also use a parametric model, such as a generalized linear model. Recall a generalized linear model models a known function of the mean µ, say g(µ), called the link function and we model g(µ(x1, . . . , xq)) = θ0 + θ1x1 + · · · + θpxp with everything else as before. If Y ∈ {0, 1} is a Bernoulli random variable then µ = E(Y ) = 0 × Pr(Y = 0) + 1 × Pr(Y = 1) = Pr(Y = 1)
Generalized linear model (glm)
We could also use a parametric model, such as a generalized linear model. Recall a generalized linear model models a known function of the mean µ, say g(µ), called the link function and we model g(µ(x1, . . . , xq)) = θ0 + θ1x1 + · · · + θpxp with everything else as before. If Y ∈ {0, 1} is a Bernoulli random variable then µ = E(Y ) = 0 × Pr(Y = 0) + 1 × Pr(Y = 1) = Pr(Y = 1) = p, say, The natural link function for Bernoulli random variables is the logit, where g(µ) = ln
- µ
1 − µ
Generalized linear model (glm)
We could also use a parametric model, such as a generalized linear model. Recall a generalized linear model models a known function of the mean µ, say g(µ), called the link function and we model g(µ(x1, . . . , xq)) = θ0 + θ1x1 + · · · + θpxp with everything else as before. If Y ∈ {0, 1} is a Bernoulli random variable then µ = E(Y ) = 0 × Pr(Y = 0) + 1 × Pr(Y = 1) = Pr(Y = 1) = p, say, The natural link function for Bernoulli random variables is the logit, where g(µ) = ln
- µ
1 − µ
- = ln
- p
1 − p
- .
Generalized linear model (glm)
We could also use a parametric model, such as a generalized linear model. Recall a generalized linear model models a known function of the mean µ, say g(µ), called the link function and we model g(µ(x1, . . . , xq)) = θ0 + θ1x1 + · · · + θpxp with everything else as before. If Y ∈ {0, 1} is a Bernoulli random variable then µ = E(Y ) = 0 × Pr(Y = 0) + 1 × Pr(Y = 1) = Pr(Y = 1) = p, say, The natural link function for Bernoulli random variables is the logit, where g(µ) = ln
- µ
1 − µ
- = ln
- p
1 − p
- .
This particular response modelling is called logistic regression. Other link functions are also possible. Binomial random variables (sums of Bernoullis) have the same model.
1974 Motor trend magazine data - glm fit
A logistic model is fitted to mtcars (here with am as response) as follows:
fit_am_glm <- glm(am ~ wt, data = mtcars, family = "binomial")
1974 Motor trend magazine data - glm fit
A logistic model is fitted to mtcars (here with am as response) as follows:
fit_am_glm <- glm(am ~ wt, data = mtcars, family = "binomial")
which has a summary much like lm()
1974 Motor trend magazine data - glm fit
A logistic model is fitted to mtcars (here with am as response) as follows:
fit_am_glm <- glm(am ~ wt, data = mtcars, family = "binomial")
which has a summary much like lm()
summary(fit_am_glm) ## ## Call: ## glm(formula = am ~ wt, family = "binomial", data = mtcars) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.11400
- 0.53738
- 0.08811
0.26055 2.19931 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 12.040 4.510 2.670 0.00759 ** ## wt
- 4.024
1.436
- 2.801
0.00509 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 43.230
- n 31
degrees of freedom ## Residual deviance: 19.176
- n 30
degrees of freedom ## AIC: 23.176 ## ## Number of Fisher Scoring iterations: 6
1974 Motor trend magazine data - glm fit
To display the fitted model, we need predictions as before.
x <- seq(min(mtcars$wt), max(mtcars$wt), length.out = 200) y <- predict(fit_am_glm, newdata = data.frame(wt = x), type = "response")
Note the argument type = "response" ensures the prediction is on the original scale (namely that of µ, or p).
1974 Motor trend magazine data - glm fit
To display the fitted model, we need predictions as before.
x <- seq(min(mtcars$wt), max(mtcars$wt), length.out = 200) y <- predict(fit_am_glm, newdata = data.frame(wt = x), type = "response")
Note the argument type = "response" ensures the prediction is on the original scale (namely that of µ, or p).
with(mtcars, plot(wt, am, main = "Transmission", col = adjustcolor("black", alpha.f = 0.7), pch = 21, cex = 2, xlab = "weight", ylab = "(0 = automatic, 1 = manual)")) lines(x, y, col = "steelblue", lwd = 2)
2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0
Transmission
weight (0 = automatic, 1 = manual)
The predicted values are probabilities as a function of wt.
1974 Motor trend magazine data - glm fit
Could have had a more complex model, say a polynomial (on the logit scale):
fit_am3_glm <- glm(am ~ poly(wt, 3), data = mtcars, family = "binomial") x <- seq(min(mtcars$wt), max(mtcars$wt), length.out = 200) y2 <- predict(fit_am3_glm, newdata = data.frame(wt = x), type = "response") with(mtcars, plot(wt, am, main = "Transmission (cubic fit)", col = adjustcolor("black", alpha.f = 0.7), pch = 21, cex = 2, xlab = "weight", ylab = "(0 = automatic, 1 = manual)")) lines(x, y2, col = "steelblue", lwd = 2)
2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0
Transmission (cubic fit)
weight (0 = automatic, 1 = manual)
Which is slightly different (higher order terms not significant here).
1974 Motor trend magazine data - glm fit
Fitting the other binary response vs now using a logistic model:
fit_vs_glm <- glm(vs ~ wt, data = mtcars, family = "binomial")
1974 Motor trend magazine data - glm fit
Fitting the other binary response vs now using a logistic model:
fit_vs_glm <- glm(vs ~ wt, data = mtcars, family = "binomial")
and plotting
1974 Motor trend magazine data - glm fit
Fitting the other binary response vs now using a logistic model:
fit_vs_glm <- glm(vs ~ wt, data = mtcars, family = "binomial")
and plotting
2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0
Engine
weight (0 = V−shaped, 1 = straight)
which at least conforms to producing probability estimates between 0 and 1.
1974 Motor trend magazine data - glm fit with factors
Note that our binary variates were represented as numeric vectors and not as factors. We can fix that by coercion and adding these as new variates to mtcars as follows:
mtcars$Transmission <- factor(mtcars$am, labels = c("automatic", "manual")) mtcars$Engine <- factor(mtcars$vs, labels = c("V-shaped", "straight"))
1974 Motor trend magazine data - glm fit with factors
Note that our binary variates were represented as numeric vectors and not as factors. We can fix that by coercion and adding these as new variates to mtcars as follows:
mtcars$Transmission <- factor(mtcars$am, labels = c("automatic", "manual")) mtcars$Engine <- factor(mtcars$vs, labels = c("V-shaped", "straight"))
These can be fit via glm() as before:
fit_trans_glm <- glm(Transmission ~ wt, data = mtcars, family = "binomial") fit_engine_glm <- glm(Engine ~ wt, data = mtcars, family = "binomial")
1974 Motor trend magazine data - glm fit with factors
Note that our binary variates were represented as numeric vectors and not as factors. We can fix that by coercion and adding these as new variates to mtcars as follows:
mtcars$Transmission <- factor(mtcars$am, labels = c("automatic", "manual")) mtcars$Engine <- factor(mtcars$vs, labels = c("V-shaped", "straight"))
These can be fit via glm() as before:
fit_trans_glm <- glm(Transmission ~ wt, data = mtcars, family = "binomial") fit_engine_glm <- glm(Engine ~ wt, data = mtcars, family = "binomial")
The first of the two levels denotes “failure”, the second (or all others for multiple categories) denotes “success”. Here
levels(mtcars$Transmission) ## [1] "automatic" "manual" levels(mtcars$Engine) ## [1] "V-shaped" "straight"
1974 Motor trend magazine data - glm fit with factors
Plotting takes a little more care though. We need to add one to the prediction since levels correspond to values 1 and 2.
1974 Motor trend magazine data - glm fit with factors
Plotting takes a little more care though. We need to add one to the prediction since levels correspond to values 1 and 2.
with(mtcars, plot(wt, Transmission, main = "Transmission as factor", col = adjustcolor("black", alpha.f = 0.7), pch = 21, cex = 2, xlab = "weight")) x <- seq(min(mtcars$wt), max(mtcars$wt), length.out = 200) y <- predict(fit_trans_glm, newdata = data.frame(wt = x), type = "response") lines(x, y + 1, col = "steelblue", lwd = 2)
2 3 4 5 1.0 1.2 1.4 1.6 1.8 2.0
Transmission as factor
weight Transmission
Similarly for the factor Engine.
1974 Motor trend magazine data - glm fit with factors
Plotting takes a little more care though. We need to add one to the prediction since levels correspond to values 1 and 2.
with(mtcars, plot(wt, Transmission, main = "Transmission as factor", col = adjustcolor("black", alpha.f = 0.7), pch = 21, cex = 2, xlab = "weight")) x <- seq(min(mtcars$wt), max(mtcars$wt), length.out = 200) y <- predict(fit_trans_glm, newdata = data.frame(wt = x), type = "response") lines(x, y + 1, col = "steelblue", lwd = 2)
2 3 4 5 1.0 1.2 1.4 1.6 1.8 2.0
Transmission as factor
weight Transmission
Similarly for the factor Engine. See help(glm) for more variations.
Interactive exploration
Don’t forget that we have already seen how to explore the relationship between a binary response and a continuous explanatory variate interactively in loon . For example, library(loon) h_wt <- with(mtcars, l_hist(wt, linkingGroup = "mtcars")) h_engine <- with(mtcars, l_hist(Engine, linkingGroup = "mtcars")) h_trans <- with(mtcars, l_hist(Transmission, linkingGroup = "mtcars"))
Interactive exploration
For example, effect of different wt groups on either Transmission or Engine
library(gridExtra) wt_grps <- cut(mtcars$wt, 5) h_wt["color"] <- wt_grps grid.arrange(plot(h_wt, draw = FALSE), plot(h_trans, draw = FALSE), plot(h_engine, draw = FALSE), nrow = 1)
wt Frequency Transmission Frequency
automatic manual
Engine Frequency
V−shaped straight
Interactive exploration
Or, condition on Transmission choice:
library(gridExtra) h_wt["color"] <- mtcars$Transmission grid.arrange(plot(h_wt, draw = FALSE), plot(h_trans, draw = FALSE), plot(h_engine, draw = FALSE), nrow = 1)
wt Frequency Transmission Frequency
automatic manual
Engine Frequency
V−shaped straight
Interactive exploration
Or, alternatively, condition on Engine choice:
library(gridExtra) h_wt["color"] <- mtcars$Engine grid.arrange(plot(h_wt, draw = FALSE), plot(h_trans, draw = FALSE), plot(h_engine, draw = FALSE), nrow = 1)
wt Frequency Transmission Frequency
automatic manual
Engine Frequency
V−shaped straight
Spine plots
Another way to display binary responses as a function of a continuous response is via spineplot(). It requires the response to be a factor.
spineplot(Transmission ~ wt, data = mtcars)
Spine plots
Another way to display binary responses as a function of a continuous response is via spineplot(). It requires the response to be a factor.
spineplot(Transmission ~ wt, data = mtcars) wt Transmission 1.5 2 2.5 3 3.5 4 5.5 automatic manual 0.0 0.2 0.4 0.6 0.8 1.0
Spine plots
Another way to display binary responses as a function of a continuous response is via spineplot(). It requires the response to be a factor.
spineplot(Transmission ~ wt, data = mtcars) wt Transmission 1.5 2 2.5 3 3.5 4 5.5 automatic manual 0.0 0.2 0.4 0.6 0.8 1.0 Note:
◮ each interval has identical range ◮ each interval width is proportional to the number of observations, and ◮ the bar heights are coloured by the proportion of each level in that interval. ◮ Conceptually, this is a plot of Pr(y|x) versus P(x).
Spine plots
And for the factor Engine
spineplot(Engine ~ wt, data = mtcars) wt Engine 1.5 2 2.5 3 3.5 4 5.5 V−shaped straight 0.0 0.2 0.4 0.6 0.8 1.0
Again, conceptually, spine plots plot of Pr(y|x) versus P(x)
Eikosograms - the probability picture
This plot is very similar to the spine plot with the following differences:
◮ both response and explanatory variates are categorical ◮ there can be as many explanatory variates as the size of the display will
support
◮ probabilities can be shown on both horizontal and vertical axes
The display is available from the R package eikosograms available on CRAN.
Eikosograms - the probability picture
This plot is very similar to the spine plot with the following differences:
◮ both response and explanatory variates are categorical ◮ there can be as many explanatory variates as the size of the display will
support
◮ probabilities can be shown on both horizontal and vertical axes
The display is available from the R package eikosograms available on CRAN.
library(eikosograms) eikos(Transmission ~ Engine, data = mtcars)
Transmission automatic manual
0.67 0.5
V−shaped straight Engine
0.56
Eikosograms - the probability picture
The eikosogram shows the joint distribution of X and Y by showing Pr(X, Y ) = Pr(Y | X) × Pr(X)
Eikosograms - the probability picture
The eikosogram shows the joint distribution of X and Y by showing Pr(X, Y ) = Pr(Y | X) × Pr(X) We could have just as easily written the joint probability as Pr(X, Y ) = Pr(X | Y ) × Pr(Y )
Eikosograms - the probability picture
The eikosogram shows the joint distribution of X and Y by showing Pr(X, Y ) = Pr(Y | X) × Pr(X) We could have just as easily written the joint probability as Pr(X, Y ) = Pr(X | Y ) × Pr(Y ) These would be two different but equivalent (in the sense that one can be derived from the other) eikosograms.
Eikosograms - the probability picture
The eikosogram shows the joint distribution of X and Y by showing Pr(X, Y ) = Pr(Y | X) × Pr(X) We could have just as easily written the joint probability as Pr(X, Y ) = Pr(X | Y ) × Pr(Y ) These would be two different but equivalent (in the sense that one can be derived from the other) eikosograms.
e1 <- eikos(Transmission ~ Engine, data = mtcars, draw = FALSE) e2 <- eikos(Engine ~ Transmission, data = mtcars, draw = FALSE)
Eikosograms - the probability picture
The two eikosograms are
grid.arrange(e1, e2, nrow = 1, widths = c(0.75, 0.825))
Transmission automatic manual
0.67 0.5
V−shaped straight Engine
0.56
Engine V−shaped straight
0.63 0.46
automatic manual Transmission
0.59
One can be derived from the other.
Eikosograms - the probability picture
The two eikosograms are
grid.arrange(e1, e2, nrow = 1, widths = c(0.75, 0.825))
Transmission automatic manual
0.67 0.5
V−shaped straight Engine
0.56
Engine V−shaped straight
0.63 0.46
automatic manual Transmission
0.59
One can be derived from the other. Matching (X, Y ) areas in both diagrams
Eikosograms - the probability picture
The two eikosograms are
grid.arrange(e1, e2, nrow = 1, widths = c(0.75, 0.825))
Transmission automatic manual
0.67 0.5
V−shaped straight Engine
0.56
Engine V−shaped straight
0.63 0.46
automatic manual Transmission
0.59
One can be derived from the other. Matching (X, Y ) areas in both diagrams
area rectangle in left eikosogram = area of corresponding rectangle in right eikosogram
Eikosograms - the probability picture
The two eikosograms are
grid.arrange(e1, e2, nrow = 1, widths = c(0.75, 0.825))
Transmission automatic manual
0.67 0.5
V−shaped straight Engine
0.56
Engine V−shaped straight
0.63 0.46
automatic manual Transmission
0.59
One can be derived from the other. Matching (X, Y ) areas in both diagrams
area rectangle in left eikosogram = area of corresponding rectangle in right eikosogram height × width = height × width of corresponding rectangle
Eikosograms - the probability picture
The two eikosograms are
grid.arrange(e1, e2, nrow = 1, widths = c(0.75, 0.825))
Transmission automatic manual
0.67 0.5
V−shaped straight Engine
0.56
Engine V−shaped straight
0.63 0.46
automatic manual Transmission
0.59
One can be derived from the other. Matching (X, Y ) areas in both diagrams
area rectangle in left eikosogram = area of corresponding rectangle in right eikosogram height × width = height × width of corresponding rectangle Pr(Y | X) × Pr(X) = Pr(X | Y ) × Pr(Y )
- r Bayes’s theorem.
Eikosograms - tables
A table ia a data structure that summarizes (e.g. counts) of cross-classified factors.
table1 <- table(mtcars$vs, mtcars$am) # vectors intepretable as factors
Eikosograms - tables
A table ia a data structure that summarizes (e.g. counts) of cross-classified factors.
table1 <- table(mtcars$vs, mtcars$am) # vectors intepretable as factors table1 ## ## 1 ## 0 12 6 ## 1 7 7
Eikosograms - tables
A table ia a data structure that summarizes (e.g. counts) of cross-classified factors.
table1 <- table(mtcars$vs, mtcars$am) # vectors intepretable as factors table1 ## ## 1 ## 0 12 6 ## 1 7 7 # Or with a formula via cross-tabulation table2 <- xtabs( ~ Engine + am, data = mtcars) # Note no response
Eikosograms - tables
A table ia a data structure that summarizes (e.g. counts) of cross-classified factors.
table1 <- table(mtcars$vs, mtcars$am) # vectors intepretable as factors table1 ## ## 1 ## 0 12 6 ## 1 7 7 # Or with a formula via cross-tabulation table2 <- xtabs( ~ Engine + am, data = mtcars) # Note no response table2 ## am ## Engine 1 ## V-shaped 12 6 ## straight 7 7
Eikosograms - tables
A table ia a data structure that summarizes (e.g. counts) of cross-classified factors.
table1 <- table(mtcars$vs, mtcars$am) # vectors intepretable as factors table1 ## ## 1 ## 0 12 6 ## 1 7 7 # Or with a formula via cross-tabulation table2 <- xtabs( ~ Engine + am, data = mtcars) # Note no response table2 ## am ## Engine 1 ## V-shaped 12 6 ## straight 7 7 Cross tabulation does a little more
Eikosograms - tables
A table ia a data structure that summarizes (e.g. counts) of cross-classified factors.
table1 <- table(mtcars$vs, mtcars$am) # vectors intepretable as factors table1 ## ## 1 ## 0 12 6 ## 1 7 7 # Or with a formula via cross-tabulation table2 <- xtabs( ~ Engine + am, data = mtcars) # Note no response table2 ## am ## Engine 1 ## V-shaped 12 6 ## straight 7 7 Cross tabulation does a little more table3 <- xtabs(wt ~ Engine + Transmission, data = mtcars) # response summed
Eikosograms - tables
A table ia a data structure that summarizes (e.g. counts) of cross-classified factors.
table1 <- table(mtcars$vs, mtcars$am) # vectors intepretable as factors table1 ## ## 1 ## 0 12 6 ## 1 7 7 # Or with a formula via cross-tabulation table2 <- xtabs( ~ Engine + am, data = mtcars) # Note no response table2 ## am ## Engine 1 ## V-shaped 12 6 ## straight 7 7 Cross tabulation does a little more table3 <- xtabs(wt ~ Engine + Transmission, data = mtcars) # response summed round(table3/table2, 2) # Average weights ## Transmission ## Engine automatic manual ## V-shaped 4.10 2.86 ## straight 3.19 2.03 See help("table"), help("xtabs") and also help(tabulate)
Tabulating admissions - sex discrimination at grad school?
A well known table of counts is UCBAdmissions which records the number of applications and admissions to several large graduate programmes at the University of California (Berkeley) in 1973. This is a table data structure of counts cross classified by three different factors:
dim(UCBAdmissions) ## [1] 2 2 6 dimnames(UCBAdmissions) ## $Admit ## [1] "Admitted" "Rejected" ## ## $Gender ## [1] "Male" "Female" ## ## $Dept ## [1] "A" "B" "C" "D" "E" "F"
Tabulating admissions - sex discrimination at grad school?
A well known table of counts is UCBAdmissions which records the number of applications and admissions to several large graduate programmes at the University of California (Berkeley) in 1973. This is a table data structure of counts cross classified by three different factors:
dim(UCBAdmissions) ## [1] 2 2 6 dimnames(UCBAdmissions) ## $Admit ## [1] "Admitted" "Rejected" ## ## $Gender ## [1] "Male" "Female" ## ## $Dept ## [1] "A" "B" "C" "D" "E" "F" Based on the data in this table, the question of whether sexism played a role in admission to graduate school at Berkeley.
Tabulating admissions - sex discrimination at grad school?
A well known table of counts is UCBAdmissions which records the number of applications and admissions to several large graduate programmes at the University of California (Berkeley) in 1973. This is a table data structure of counts cross classified by three different factors:
dim(UCBAdmissions) ## [1] 2 2 6 dimnames(UCBAdmissions) ## $Admit ## [1] "Admitted" "Rejected" ## ## $Gender ## [1] "Male" "Female" ## ## $Dept ## [1] "A" "B" "C" "D" "E" "F" Based on the data in this table, the question of whether sexism played a role in admission to graduate school at Berkeley. This claim can be explored using eikosgrams.
Eikosograms - sex discrimination at grad school?
First, to look at the admission rate we use eikos("Admit", data = UCBAdmissions)
Eikosograms - sex discrimination at grad school?
First, to look at the admission rate we use eikos("Admit", data = UCBAdmissions)
Admit Admitted Rejected
0.39
Eikosograms - sex discrimination at grad school?
First, to look at the admission rate we use eikos("Admit", data = UCBAdmissions)
Admit Admitted Rejected
0.39
Comments?
Eikosograms - sex discrimination at grad school?
But, what if we break this down by the sex of the applicant?
Eikosograms - sex discrimination at grad school?
But, what if we break this down by the sex of the applicant? eikos(Admit ~ Gender , data = UCBAdmissions)
Eikosograms - sex discrimination at grad school?
But, what if we break this down by the sex of the applicant? eikos(Admit ~ Gender , data = UCBAdmissions)
Admit Admitted Rejected
0.45 0.3
Male Female Gender
0.59
Comments?
Eikosograms - sex discrimination at grad school?
Recall that there are several different departments
Eikosograms - sex discrimination at grad school?
Recall that there are several different departments eikos("Dept" , data = UCBAdmissions)
Eikosograms - sex discrimination at grad school?
Recall that there are several different departments eikos("Dept" , data = UCBAdmissions)
Dept A B C D E F
0.21 0.34 0.54 0.71 0.84
Comments?
Eikosograms - sex discrimination at grad school?
How do admissions for the two sexes fare across departments?
Eikosograms - sex discrimination at grad school?
How do admissions for the two sexes fare across departments? Males: eikos(Admit ~ Dept, data = UCBAdmissions[,"Male",])
Eikosograms - sex discrimination at grad school?
How do admissions for the two sexes fare across departments? Males: eikos(Admit ~ Dept, data = UCBAdmissions[,"Male",])
Admit Admitted Rejected
0.62 0.63 0.37 0.33 0.28 0.06
A B C D E F Dept
0.31 0.51 0.64 0.79 0.86
Comments?
Eikosograms - sex discrimination at grad school?
How do admissions for the two sexes fare across departments?
Eikosograms - sex discrimination at grad school?
How do admissions for the two sexes fare across departments? Females: eikos(Admit ~ Dept, data = UCBAdmissions[,"Female",])
Eikosograms - sex discrimination at grad school?
How do admissions for the two sexes fare across departments? Females: eikos(Admit ~ Dept, data = UCBAdmissions[,"Female",])
Admit Admitted Rejected
0.82 0.68 0.34 0.35 0.24 0.07
A B C D E F Dept
0.06 0.07 0.4 0.6 0.81
Comments?
Eikosograms - sex discrimination at grad school?
How above comparing sexes by department?
Eikosograms - sex discrimination at grad school?
How above comparing sexes by department? Department A: eikos(Admit ~ Gender, data = UCBAdmissions[,,"A"])
Eikosograms - sex discrimination at grad school?
How above comparing sexes by department? Department A: eikos(Admit ~ Gender, data = UCBAdmissions[,,"A"])
Admit Admitted Rejected
0.62 0.82
Male Female Gender
0.88
Comments?
Eikosograms - sex discrimination at grad school?
How above comparing sexes by department?
Eikosograms - sex discrimination at grad school?
How above comparing sexes by department? Department B: eikos(Admit ~ Gender, data = UCBAdmissions[,,"B"])
Eikosograms - sex discrimination at grad school?
How above comparing sexes by department? Department B: eikos(Admit ~ Gender, data = UCBAdmissions[,,"B"])
Admit Admitted Rejected
0.63 0.68
Male Female Gender
0.96
Comments?
Eikosograms - sex discrimination at grad school?
How above comparing sexes by department?
Eikosograms - sex discrimination at grad school?
How above comparing sexes by department? Department C: eikos(Admit ~ Gender, data = UCBAdmissions[,,"C"])
Eikosograms - sex discrimination at grad school?
How above comparing sexes by department? Department C: eikos(Admit ~ Gender, data = UCBAdmissions[,,"C"])
Admit Admitted Rejected
0.37 0.34
Male Female Gender
0.35
Comments?
Eikosograms - sex discrimination at grad school?
How above comparing sexes by department?
Eikosograms - sex discrimination at grad school?
How above comparing sexes by department? Department D: eikos(Admit ~ Gender, data = UCBAdmissions[,,"D"])
Eikosograms - sex discrimination at grad school?
How above comparing sexes by department? Department D: eikos(Admit ~ Gender, data = UCBAdmissions[,,"D"])
Admit Admitted Rejected
0.33 0.35
Male Female Gender
0.53
Comments?
Eikosograms - sex discrimination at grad school?
How above comparing sexes by department?
Eikosograms - sex discrimination at grad school?
How above comparing sexes by department? Department E: eikos(Admit ~ Gender, data = UCBAdmissions[,,"E"])
Eikosograms - sex discrimination at grad school?
How above comparing sexes by department? Department E: eikos(Admit ~ Gender, data = UCBAdmissions[,,"E"])
Admit Admitted Rejected
0.28 0.24
Male Female Gender
0.33
Comments?
Eikosograms - sex discrimination at grad school?
How above comparing sexes by department?
Eikosograms - sex discrimination at grad school?
How above comparing sexes by department? Department F: eikos(Admit ~ Gender, data = UCBAdmissions[,,"F"])
Eikosograms - sex discrimination at grad school?
How above comparing sexes by department? Department F: eikos(Admit ~ Gender, data = UCBAdmissions[,,"F"])
Admit Admitted Rejected
0.06 0.07
Male Female Gender
0.52
Comments?
Eikosograms - sex discrimination at grad school?
Admission rates by department
Eikosograms - sex discrimination at grad school?
Admission rates by department eikos(Admit ~ Dept, data = UCBAdmissions)
Eikosograms - sex discrimination at grad school?
Admission rates by department eikos(Admit ~ Dept, data = UCBAdmissions)
Admit Admitted Rejected
0.64 0.63 0.35 0.34 0.25 0.06
A B C D E F Dept
0.21 0.34 0.54 0.71 0.84
Comments?
Eikosograms - sex discrimination at grad school?
Sex distribution by department
Eikosograms - sex discrimination at grad school?
Sex distribution by department eikos(Gender ~ Dept, data = UCBAdmissions)
Eikosograms - sex discrimination at grad school?
Sex distribution by department eikos(Gender ~ Dept, data = UCBAdmissions)
Gender Male Female
0.88 0.96 0.35 0.53 0.33 0.52
A B C D E F Dept
0.21 0.34 0.54 0.71 0.84
Comments?
Eikosograms - sex discrimination at grad school?
Sex distribution by department and admission by department side by side
Gender Male Female
0.88 0.96 0.35 0.53 0.33 0.52
A B C D E F Dept
0.21 0.34 0.54 0.71 0.84
Admit Admitted Rejected
0.64 0.63 0.35 0.34 0.25 0.06
A B C D E F Dept
0.21 0.34 0.54 0.71 0.84
Comments?
Eikosograms - sex discrimination at grad school?
We can put this together in a single eikosogram: eikos(Admit ~ Dept + Gender, data = UCBAdmissions, xlab_rot = 45)
Eikosograms - sex discrimination at grad school?
We can put this together in a single eikosogram: eikos(Admit ~ Dept + Gender, data = UCBAdmissions, xlab_rot = 45)
Admit Admitted Rejected
0.62 0.63 0.37 0.33 0.28 0.06 0.82 0.68 0.34 0.35 0.24 0.07
A A B B C C D D E E F F M a l e F e m a l e Dept Gender
0.18 0.31 0.38 0.47 0.51 0.59 0.62 0.62 0.75 0.84 0.92
Comments?
Eikosograms - sex discrimination at grad school?
Or equivalently, making admission rates more easily comparable: eikos(Admit ~ Gender + Dept, data = UCBAdmissions, xlab_rot = 45)
Eikosograms - sex discrimination at grad school?
Or equivalently, making admission rates more easily comparable: eikos(Admit ~ Gender + Dept, data = UCBAdmissions, xlab_rot = 45)
Admit Admitted Rejected
0.62 0.82 0.63 0.68 0.37 0.34 0.33 0.35 0.28 0.24 0.06 0.07
M a l e M a l e M a l e M a l e M a l e M a l e F e m a l e F e m a l e F e m a l e F e m a l e F e m a l e F e m a l e A B C D E F Gender Dept
0.18 0.21 0.33 0.34 0.41 0.54 0.63 0.71 0.76 0.84 0.92
Comments?
Eikosograms - sex discrimination at grad school?
Or equivalently, making admission rates more easily comparable: eikos(Admit ~ Gender + Dept, data = UCBAdmissions, xlab_rot = 45)
Admit Admitted Rejected
0.62 0.82 0.63 0.68 0.37 0.34 0.33 0.35 0.28 0.24 0.06 0.07
M a l e M a l e M a l e M a l e M a l e M a l e F e m a l e F e m a l e F e m a l e F e m a l e F e m a l e F e m a l e A B C D E F Gender Dept
0.18 0.21 0.33 0.34 0.41 0.54 0.63 0.71 0.76 0.84 0.92
Comments?
Females applied to the more difficult programmes and hence fewer were admitted overall. A greater proportion of females were admitted in most programmes. No evidence of sex discrimation in favour
- f males (possibly the opposite but this would require a statistical test).