Exploring models Categorical data R.W. Oldford 1974 Motor trend - - PowerPoint PPT Presentation

exploring models
SMART_READER_LITE
LIVE PREVIEW

Exploring models Categorical data R.W. Oldford 1974 Motor trend - - PowerPoint PPT Presentation

Exploring models Categorical data R.W. Oldford 1974 Motor trend magazine data Recall the R data set called mtcars . head (mtcars) ## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4


slide-1
SLIDE 1

Exploring models

Categorical data R.W. Oldford

slide-2
SLIDE 2

1974 Motor trend magazine data

Recall the R data set called mtcars.

head(mtcars) ## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 1 4 4 ## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 1 4 4 ## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 ## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 3 1 ## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 3 2 ## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 3 1

slide-3
SLIDE 3

1974 Motor trend magazine data

Recall the R data set called mtcars.

head(mtcars) ## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 1 4 4 ## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 1 4 4 ## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 ## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 3 1 ## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 3 2 ## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 3 1

Some of the variables are binary, for example

◮ am (= 0 if transmission is automatic, = 1 if manual) ◮ vs (= 0 if engine is V-shaped, = 1 if straight)

slide-4
SLIDE 4

1974 Motor trend magazine data

Recall the R data set called mtcars.

head(mtcars) ## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 1 4 4 ## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 1 4 4 ## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 ## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 3 1 ## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 3 2 ## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 3 1

Some of the variables are binary, for example

◮ am (= 0 if transmission is automatic, = 1 if manual) ◮ vs (= 0 if engine is V-shaped, = 1 if straight)

What if we thought of either of these as a response variate?

slide-5
SLIDE 5

1974 Motor trend magazine data

Recall the R data set called mtcars.

head(mtcars) ## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 1 4 4 ## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 1 4 4 ## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 ## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 3 1 ## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 3 2 ## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 3 1

Some of the variables are binary, for example

◮ am (= 0 if transmission is automatic, = 1 if manual) ◮ vs (= 0 if engine is V-shaped, = 1 if straight)

What if we thought of either of these as a response variate? And perhaps modelled this as a function of, say, the weight of the car.

slide-6
SLIDE 6

1974 Motor trend magazine data

We could plot the data and fit a loess function to it.

slide-7
SLIDE 7

1974 Motor trend magazine data

We could plot the data and fit a loess function to it.

with(mtcars, plot(wt, am, main = "Transmission", col = adjustcolor("black", alpha.f = 0.7), pch = 21, cex = 2, xlab = "weight", ylab = "(0 = automatic, 1 = manual)")) fit1 <- loess(am ~ wt, data = mtcars) x <- seq(min(mtcars$wt), max(mtcars$wt), length.out = 200) y <- predict(fit1, newdata = data.frame(wt = x)) lines(x, y, col = "steelblue", lwd = 2)

slide-8
SLIDE 8

1974 Motor trend magazine data

We could plot the data and fit a loess function to it.

with(mtcars, plot(wt, am, main = "Transmission", col = adjustcolor("black", alpha.f = 0.7), pch = 21, cex = 2, xlab = "weight", ylab = "(0 = automatic, 1 = manual)")) fit1 <- loess(am ~ wt, data = mtcars) x <- seq(min(mtcars$wt), max(mtcars$wt), length.out = 200) y <- predict(fit1, newdata = data.frame(wt = x)) lines(x, y, col = "steelblue", lwd = 2)

2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0

Transmission

weight (0 = automatic, 1 = manual)

slide-9
SLIDE 9

1974 Motor trend magazine data

We could plot the data and fit a loess function to it.

with(mtcars, plot(wt, am, main = "Transmission", col = adjustcolor("black", alpha.f = 0.7), pch = 21, cex = 2, xlab = "weight", ylab = "(0 = automatic, 1 = manual)")) fit1 <- loess(am ~ wt, data = mtcars) x <- seq(min(mtcars$wt), max(mtcars$wt), length.out = 200) y <- predict(fit1, newdata = data.frame(wt = x)) lines(x, y, col = "steelblue", lwd = 2)

2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0

Transmission

weight (0 = automatic, 1 = manual)

What is the blue curve an “estimate” of?

slide-10
SLIDE 10

1974 Motor trend magazine data

We could plot the data and fit a loess function to it.

with(mtcars, plot(wt, am, main = "Transmission", col = adjustcolor("black", alpha.f = 0.7), pch = 21, cex = 2, xlab = "weight", ylab = "(0 = automatic, 1 = manual)")) fit1 <- loess(am ~ wt, data = mtcars) x <- seq(min(mtcars$wt), max(mtcars$wt), length.out = 200) y <- predict(fit1, newdata = data.frame(wt = x)) lines(x, y, col = "steelblue", lwd = 2)

2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0

Transmission

weight (0 = automatic, 1 = manual)

What is the blue curve an “estimate” of? The mean of the am variate.

slide-11
SLIDE 11

1974 Motor trend magazine data

We could plot the data and fit a loess function to it.

with(mtcars, plot(wt, am, main = "Transmission", col = adjustcolor("black", alpha.f = 0.7), pch = 21, cex = 2, xlab = "weight", ylab = "(0 = automatic, 1 = manual)")) fit1 <- loess(am ~ wt, data = mtcars) x <- seq(min(mtcars$wt), max(mtcars$wt), length.out = 200) y <- predict(fit1, newdata = data.frame(wt = x)) lines(x, y, col = "steelblue", lwd = 2)

2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0

Transmission

weight (0 = automatic, 1 = manual)

What is the blue curve an “estimate” of? The mean of the am variate. Or, perhaps Pr(Transmission = manual)?

slide-12
SLIDE 12

1974 Motor trend magazine data

We could plot the data and fit a loess function to it.

with(mtcars, plot(wt, am, main = "Transmission", col = adjustcolor("black", alpha.f = 0.7), pch = 21, cex = 2, xlab = "weight", ylab = "(0 = automatic, 1 = manual)")) fit1 <- loess(am ~ wt, data = mtcars) x <- seq(min(mtcars$wt), max(mtcars$wt), length.out = 200) y <- predict(fit1, newdata = data.frame(wt = x)) lines(x, y, col = "steelblue", lwd = 2)

2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0

Transmission

weight (0 = automatic, 1 = manual)

What is the blue curve an “estimate” of? The mean of the am variate. Or, perhaps Pr(Transmission = manual)? Which is E(am) after all (given the way am is recorded).

slide-13
SLIDE 13

1974 Motor trend magazine data

We could plot the data and fit a loess function to it.

with(mtcars, plot(wt, am, main = "Transmission", col = adjustcolor("black", alpha.f = 0.7), pch = 21, cex = 2, xlab = "weight", ylab = "(0 = automatic, 1 = manual)")) fit1 <- loess(am ~ wt, data = mtcars) x <- seq(min(mtcars$wt), max(mtcars$wt), length.out = 200) y <- predict(fit1, newdata = data.frame(wt = x)) lines(x, y, col = "steelblue", lwd = 2)

2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0

Transmission

weight (0 = automatic, 1 = manual)

What is the blue curve an “estimate” of? The mean of the am variate. Or, perhaps Pr(Transmission = manual)? Which is E(am) after all (given the way am is recorded). Note, however, that the curve was not constrained to be between 0 and 1.

slide-14
SLIDE 14

1974 Motor trend magazine data

Could do the same for engine type vs (v-shaped or straight)

with(mtcars, plot(wt, vs, main = "Engine type", col = adjustcolor("black", alpha.f = 0.7), pch = 21, cex = 2, xlab = "weight", ylab = "(0 = V-shaped, 1 = straight)")) fit2 <- loess(vs ~ wt, data = mtcars) x <- seq(min(mtcars$wt), max(mtcars$wt), length.out = 200) y <- predict(fit2, newdata = data.frame(wt = x)) lines(x, y, col = "steelblue", lwd = 2)

slide-15
SLIDE 15

1974 Motor trend magazine data

Could do the same for engine type vs (v-shaped or straight)

with(mtcars, plot(wt, vs, main = "Engine type", col = adjustcolor("black", alpha.f = 0.7), pch = 21, cex = 2, xlab = "weight", ylab = "(0 = V-shaped, 1 = straight)")) fit2 <- loess(vs ~ wt, data = mtcars) x <- seq(min(mtcars$wt), max(mtcars$wt), length.out = 200) y <- predict(fit2, newdata = data.frame(wt = x)) lines(x, y, col = "steelblue", lwd = 2)

2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0

Engine type

weight (0 = V−shaped, 1 = straight)

which is a bit weirder.

slide-16
SLIDE 16

1974 Motor trend magazine data

Could do the same for engine type vs (v-shaped or straight)

with(mtcars, plot(wt, vs, main = "Engine type", col = adjustcolor("black", alpha.f = 0.7), pch = 21, cex = 2, xlab = "weight", ylab = "(0 = V-shaped, 1 = straight)")) fit2 <- loess(vs ~ wt, data = mtcars) x <- seq(min(mtcars$wt), max(mtcars$wt), length.out = 200) y <- predict(fit2, newdata = data.frame(wt = x)) lines(x, y, col = "steelblue", lwd = 2)

2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0

Engine type

weight (0 = V−shaped, 1 = straight)

which is a bit weirder. Note also that it goes below 0 and above 1 in places.

slide-17
SLIDE 17

1974 Motor trend magazine data

Could do the same for engine type vs (v-shaped or straight)

with(mtcars, plot(wt, vs, main = "Engine type", col = adjustcolor("black", alpha.f = 0.7), pch = 21, cex = 2, xlab = "weight", ylab = "(0 = V-shaped, 1 = straight)")) fit2 <- loess(vs ~ wt, data = mtcars) x <- seq(min(mtcars$wt), max(mtcars$wt), length.out = 200) y <- predict(fit2, newdata = data.frame(wt = x)) lines(x, y, col = "steelblue", lwd = 2)

2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0

Engine type

weight (0 = V−shaped, 1 = straight)

which is a bit weirder. Note also that it goes below 0 and above 1 in places. What does the blue curve estimate now?

slide-18
SLIDE 18

1974 Motor trend magazine data

Could do the same for engine type vs (v-shaped or straight)

with(mtcars, plot(wt, vs, main = "Engine type", col = adjustcolor("black", alpha.f = 0.7), pch = 21, cex = 2, xlab = "weight", ylab = "(0 = V-shaped, 1 = straight)")) fit2 <- loess(vs ~ wt, data = mtcars) x <- seq(min(mtcars$wt), max(mtcars$wt), length.out = 200) y <- predict(fit2, newdata = data.frame(wt = x)) lines(x, y, col = "steelblue", lwd = 2)

2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0

Engine type

weight (0 = V−shaped, 1 = straight)

which is a bit weirder. Note also that it goes below 0 and above 1 in places. What does the blue curve estimate now? Again. this is mean of the vs variate. Or, given the way vs is recorded, E(vs) = Pr(Engine = straight).

slide-19
SLIDE 19

Generalized linear model (glm)

We could also use a parametric model, such as a generalized linear model.

slide-20
SLIDE 20

Generalized linear model (glm)

We could also use a parametric model, such as a generalized linear model. Recall a generalized linear model models a known function of the mean µ, say g(µ), called the link function and we model g(µ(x1, . . . , xq)) = θ0 + θ1x1 + · · · + θpxp with everything else as before.

slide-21
SLIDE 21

Generalized linear model (glm)

We could also use a parametric model, such as a generalized linear model. Recall a generalized linear model models a known function of the mean µ, say g(µ), called the link function and we model g(µ(x1, . . . , xq)) = θ0 + θ1x1 + · · · + θpxp with everything else as before. If Y ∈ {0, 1} is a Bernoulli random variable then µ = E(Y )

slide-22
SLIDE 22

Generalized linear model (glm)

We could also use a parametric model, such as a generalized linear model. Recall a generalized linear model models a known function of the mean µ, say g(µ), called the link function and we model g(µ(x1, . . . , xq)) = θ0 + θ1x1 + · · · + θpxp with everything else as before. If Y ∈ {0, 1} is a Bernoulli random variable then µ = E(Y ) = 0 × Pr(Y = 0) + 1 × Pr(Y = 1)

slide-23
SLIDE 23

Generalized linear model (glm)

We could also use a parametric model, such as a generalized linear model. Recall a generalized linear model models a known function of the mean µ, say g(µ), called the link function and we model g(µ(x1, . . . , xq)) = θ0 + θ1x1 + · · · + θpxp with everything else as before. If Y ∈ {0, 1} is a Bernoulli random variable then µ = E(Y ) = 0 × Pr(Y = 0) + 1 × Pr(Y = 1) = Pr(Y = 1)

slide-24
SLIDE 24

Generalized linear model (glm)

We could also use a parametric model, such as a generalized linear model. Recall a generalized linear model models a known function of the mean µ, say g(µ), called the link function and we model g(µ(x1, . . . , xq)) = θ0 + θ1x1 + · · · + θpxp with everything else as before. If Y ∈ {0, 1} is a Bernoulli random variable then µ = E(Y ) = 0 × Pr(Y = 0) + 1 × Pr(Y = 1) = Pr(Y = 1) = p, say, The natural link function for Bernoulli random variables is the logit, where g(µ) = ln

  • µ

1 − µ

slide-25
SLIDE 25

Generalized linear model (glm)

We could also use a parametric model, such as a generalized linear model. Recall a generalized linear model models a known function of the mean µ, say g(µ), called the link function and we model g(µ(x1, . . . , xq)) = θ0 + θ1x1 + · · · + θpxp with everything else as before. If Y ∈ {0, 1} is a Bernoulli random variable then µ = E(Y ) = 0 × Pr(Y = 0) + 1 × Pr(Y = 1) = Pr(Y = 1) = p, say, The natural link function for Bernoulli random variables is the logit, where g(µ) = ln

  • µ

1 − µ

  • = ln
  • p

1 − p

  • .
slide-26
SLIDE 26

Generalized linear model (glm)

We could also use a parametric model, such as a generalized linear model. Recall a generalized linear model models a known function of the mean µ, say g(µ), called the link function and we model g(µ(x1, . . . , xq)) = θ0 + θ1x1 + · · · + θpxp with everything else as before. If Y ∈ {0, 1} is a Bernoulli random variable then µ = E(Y ) = 0 × Pr(Y = 0) + 1 × Pr(Y = 1) = Pr(Y = 1) = p, say, The natural link function for Bernoulli random variables is the logit, where g(µ) = ln

  • µ

1 − µ

  • = ln
  • p

1 − p

  • .

This particular response modelling is called logistic regression. Other link functions are also possible. Binomial random variables (sums of Bernoullis) have the same model.

slide-27
SLIDE 27

1974 Motor trend magazine data - glm fit

A logistic model is fitted to mtcars (here with am as response) as follows:

fit_am_glm <- glm(am ~ wt, data = mtcars, family = "binomial")

slide-28
SLIDE 28

1974 Motor trend magazine data - glm fit

A logistic model is fitted to mtcars (here with am as response) as follows:

fit_am_glm <- glm(am ~ wt, data = mtcars, family = "binomial")

which has a summary much like lm()

slide-29
SLIDE 29

1974 Motor trend magazine data - glm fit

A logistic model is fitted to mtcars (here with am as response) as follows:

fit_am_glm <- glm(am ~ wt, data = mtcars, family = "binomial")

which has a summary much like lm()

summary(fit_am_glm) ## ## Call: ## glm(formula = am ~ wt, family = "binomial", data = mtcars) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.11400

  • 0.53738
  • 0.08811

0.26055 2.19931 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 12.040 4.510 2.670 0.00759 ** ## wt

  • 4.024

1.436

  • 2.801

0.00509 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 43.230

  • n 31

degrees of freedom ## Residual deviance: 19.176

  • n 30

degrees of freedom ## AIC: 23.176 ## ## Number of Fisher Scoring iterations: 6

slide-30
SLIDE 30

1974 Motor trend magazine data - glm fit

To display the fitted model, we need predictions as before.

x <- seq(min(mtcars$wt), max(mtcars$wt), length.out = 200) y <- predict(fit_am_glm, newdata = data.frame(wt = x), type = "response")

Note the argument type = "response" ensures the prediction is on the original scale (namely that of µ, or p).

slide-31
SLIDE 31

1974 Motor trend magazine data - glm fit

To display the fitted model, we need predictions as before.

x <- seq(min(mtcars$wt), max(mtcars$wt), length.out = 200) y <- predict(fit_am_glm, newdata = data.frame(wt = x), type = "response")

Note the argument type = "response" ensures the prediction is on the original scale (namely that of µ, or p).

with(mtcars, plot(wt, am, main = "Transmission", col = adjustcolor("black", alpha.f = 0.7), pch = 21, cex = 2, xlab = "weight", ylab = "(0 = automatic, 1 = manual)")) lines(x, y, col = "steelblue", lwd = 2)

2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0

Transmission

weight (0 = automatic, 1 = manual)

The predicted values are probabilities as a function of wt.

slide-32
SLIDE 32

1974 Motor trend magazine data - glm fit

Could have had a more complex model, say a polynomial (on the logit scale):

fit_am3_glm <- glm(am ~ poly(wt, 3), data = mtcars, family = "binomial") x <- seq(min(mtcars$wt), max(mtcars$wt), length.out = 200) y2 <- predict(fit_am3_glm, newdata = data.frame(wt = x), type = "response") with(mtcars, plot(wt, am, main = "Transmission (cubic fit)", col = adjustcolor("black", alpha.f = 0.7), pch = 21, cex = 2, xlab = "weight", ylab = "(0 = automatic, 1 = manual)")) lines(x, y2, col = "steelblue", lwd = 2)

2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0

Transmission (cubic fit)

weight (0 = automatic, 1 = manual)

Which is slightly different (higher order terms not significant here).

slide-33
SLIDE 33

1974 Motor trend magazine data - glm fit

Fitting the other binary response vs now using a logistic model:

fit_vs_glm <- glm(vs ~ wt, data = mtcars, family = "binomial")

slide-34
SLIDE 34

1974 Motor trend magazine data - glm fit

Fitting the other binary response vs now using a logistic model:

fit_vs_glm <- glm(vs ~ wt, data = mtcars, family = "binomial")

and plotting

slide-35
SLIDE 35

1974 Motor trend magazine data - glm fit

Fitting the other binary response vs now using a logistic model:

fit_vs_glm <- glm(vs ~ wt, data = mtcars, family = "binomial")

and plotting

2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0

Engine

weight (0 = V−shaped, 1 = straight)

which at least conforms to producing probability estimates between 0 and 1.

slide-36
SLIDE 36

1974 Motor trend magazine data - glm fit with factors

Note that our binary variates were represented as numeric vectors and not as factors. We can fix that by coercion and adding these as new variates to mtcars as follows:

mtcars$Transmission <- factor(mtcars$am, labels = c("automatic", "manual")) mtcars$Engine <- factor(mtcars$vs, labels = c("V-shaped", "straight"))

slide-37
SLIDE 37

1974 Motor trend magazine data - glm fit with factors

Note that our binary variates were represented as numeric vectors and not as factors. We can fix that by coercion and adding these as new variates to mtcars as follows:

mtcars$Transmission <- factor(mtcars$am, labels = c("automatic", "manual")) mtcars$Engine <- factor(mtcars$vs, labels = c("V-shaped", "straight"))

These can be fit via glm() as before:

fit_trans_glm <- glm(Transmission ~ wt, data = mtcars, family = "binomial") fit_engine_glm <- glm(Engine ~ wt, data = mtcars, family = "binomial")

slide-38
SLIDE 38

1974 Motor trend magazine data - glm fit with factors

Note that our binary variates were represented as numeric vectors and not as factors. We can fix that by coercion and adding these as new variates to mtcars as follows:

mtcars$Transmission <- factor(mtcars$am, labels = c("automatic", "manual")) mtcars$Engine <- factor(mtcars$vs, labels = c("V-shaped", "straight"))

These can be fit via glm() as before:

fit_trans_glm <- glm(Transmission ~ wt, data = mtcars, family = "binomial") fit_engine_glm <- glm(Engine ~ wt, data = mtcars, family = "binomial")

The first of the two levels denotes “failure”, the second (or all others for multiple categories) denotes “success”. Here

levels(mtcars$Transmission) ## [1] "automatic" "manual" levels(mtcars$Engine) ## [1] "V-shaped" "straight"

slide-39
SLIDE 39

1974 Motor trend magazine data - glm fit with factors

Plotting takes a little more care though. We need to add one to the prediction since levels correspond to values 1 and 2.

slide-40
SLIDE 40

1974 Motor trend magazine data - glm fit with factors

Plotting takes a little more care though. We need to add one to the prediction since levels correspond to values 1 and 2.

with(mtcars, plot(wt, Transmission, main = "Transmission as factor", col = adjustcolor("black", alpha.f = 0.7), pch = 21, cex = 2, xlab = "weight")) x <- seq(min(mtcars$wt), max(mtcars$wt), length.out = 200) y <- predict(fit_trans_glm, newdata = data.frame(wt = x), type = "response") lines(x, y + 1, col = "steelblue", lwd = 2)

2 3 4 5 1.0 1.2 1.4 1.6 1.8 2.0

Transmission as factor

weight Transmission

Similarly for the factor Engine.

slide-41
SLIDE 41

1974 Motor trend magazine data - glm fit with factors

Plotting takes a little more care though. We need to add one to the prediction since levels correspond to values 1 and 2.

with(mtcars, plot(wt, Transmission, main = "Transmission as factor", col = adjustcolor("black", alpha.f = 0.7), pch = 21, cex = 2, xlab = "weight")) x <- seq(min(mtcars$wt), max(mtcars$wt), length.out = 200) y <- predict(fit_trans_glm, newdata = data.frame(wt = x), type = "response") lines(x, y + 1, col = "steelblue", lwd = 2)

2 3 4 5 1.0 1.2 1.4 1.6 1.8 2.0

Transmission as factor

weight Transmission

Similarly for the factor Engine. See help(glm) for more variations.

slide-42
SLIDE 42

Interactive exploration

Don’t forget that we have already seen how to explore the relationship between a binary response and a continuous explanatory variate interactively in loon . For example, library(loon) h_wt <- with(mtcars, l_hist(wt, linkingGroup = "mtcars")) h_engine <- with(mtcars, l_hist(Engine, linkingGroup = "mtcars")) h_trans <- with(mtcars, l_hist(Transmission, linkingGroup = "mtcars"))

slide-43
SLIDE 43

Interactive exploration

For example, effect of different wt groups on either Transmission or Engine

library(gridExtra) wt_grps <- cut(mtcars$wt, 5) h_wt["color"] <- wt_grps grid.arrange(plot(h_wt, draw = FALSE), plot(h_trans, draw = FALSE), plot(h_engine, draw = FALSE), nrow = 1)

wt Frequency Transmission Frequency

automatic manual

Engine Frequency

V−shaped straight

slide-44
SLIDE 44

Interactive exploration

Or, condition on Transmission choice:

library(gridExtra) h_wt["color"] <- mtcars$Transmission grid.arrange(plot(h_wt, draw = FALSE), plot(h_trans, draw = FALSE), plot(h_engine, draw = FALSE), nrow = 1)

wt Frequency Transmission Frequency

automatic manual

Engine Frequency

V−shaped straight

slide-45
SLIDE 45

Interactive exploration

Or, alternatively, condition on Engine choice:

library(gridExtra) h_wt["color"] <- mtcars$Engine grid.arrange(plot(h_wt, draw = FALSE), plot(h_trans, draw = FALSE), plot(h_engine, draw = FALSE), nrow = 1)

wt Frequency Transmission Frequency

automatic manual

Engine Frequency

V−shaped straight

slide-46
SLIDE 46

Spine plots

Another way to display binary responses as a function of a continuous response is via spineplot(). It requires the response to be a factor.

spineplot(Transmission ~ wt, data = mtcars)

slide-47
SLIDE 47

Spine plots

Another way to display binary responses as a function of a continuous response is via spineplot(). It requires the response to be a factor.

spineplot(Transmission ~ wt, data = mtcars) wt Transmission 1.5 2 2.5 3 3.5 4 5.5 automatic manual 0.0 0.2 0.4 0.6 0.8 1.0

slide-48
SLIDE 48

Spine plots

Another way to display binary responses as a function of a continuous response is via spineplot(). It requires the response to be a factor.

spineplot(Transmission ~ wt, data = mtcars) wt Transmission 1.5 2 2.5 3 3.5 4 5.5 automatic manual 0.0 0.2 0.4 0.6 0.8 1.0 Note:

◮ each interval has identical range ◮ each interval width is proportional to the number of observations, and ◮ the bar heights are coloured by the proportion of each level in that interval. ◮ Conceptually, this is a plot of Pr(y|x) versus P(x).

slide-49
SLIDE 49

Spine plots

And for the factor Engine

spineplot(Engine ~ wt, data = mtcars) wt Engine 1.5 2 2.5 3 3.5 4 5.5 V−shaped straight 0.0 0.2 0.4 0.6 0.8 1.0

Again, conceptually, spine plots plot of Pr(y|x) versus P(x)

slide-50
SLIDE 50

Eikosograms - the probability picture

This plot is very similar to the spine plot with the following differences:

◮ both response and explanatory variates are categorical ◮ there can be as many explanatory variates as the size of the display will

support

◮ probabilities can be shown on both horizontal and vertical axes

The display is available from the R package eikosograms available on CRAN.

slide-51
SLIDE 51

Eikosograms - the probability picture

This plot is very similar to the spine plot with the following differences:

◮ both response and explanatory variates are categorical ◮ there can be as many explanatory variates as the size of the display will

support

◮ probabilities can be shown on both horizontal and vertical axes

The display is available from the R package eikosograms available on CRAN.

library(eikosograms) eikos(Transmission ~ Engine, data = mtcars)

Transmission automatic manual

0.67 0.5

V−shaped straight Engine

0.56

slide-52
SLIDE 52

Eikosograms - the probability picture

The eikosogram shows the joint distribution of X and Y by showing Pr(X, Y ) = Pr(Y | X) × Pr(X)

slide-53
SLIDE 53

Eikosograms - the probability picture

The eikosogram shows the joint distribution of X and Y by showing Pr(X, Y ) = Pr(Y | X) × Pr(X) We could have just as easily written the joint probability as Pr(X, Y ) = Pr(X | Y ) × Pr(Y )

slide-54
SLIDE 54

Eikosograms - the probability picture

The eikosogram shows the joint distribution of X and Y by showing Pr(X, Y ) = Pr(Y | X) × Pr(X) We could have just as easily written the joint probability as Pr(X, Y ) = Pr(X | Y ) × Pr(Y ) These would be two different but equivalent (in the sense that one can be derived from the other) eikosograms.

slide-55
SLIDE 55

Eikosograms - the probability picture

The eikosogram shows the joint distribution of X and Y by showing Pr(X, Y ) = Pr(Y | X) × Pr(X) We could have just as easily written the joint probability as Pr(X, Y ) = Pr(X | Y ) × Pr(Y ) These would be two different but equivalent (in the sense that one can be derived from the other) eikosograms.

e1 <- eikos(Transmission ~ Engine, data = mtcars, draw = FALSE) e2 <- eikos(Engine ~ Transmission, data = mtcars, draw = FALSE)

slide-56
SLIDE 56

Eikosograms - the probability picture

The two eikosograms are

grid.arrange(e1, e2, nrow = 1, widths = c(0.75, 0.825))

Transmission automatic manual

0.67 0.5

V−shaped straight Engine

0.56

Engine V−shaped straight

0.63 0.46

automatic manual Transmission

0.59

One can be derived from the other.

slide-57
SLIDE 57

Eikosograms - the probability picture

The two eikosograms are

grid.arrange(e1, e2, nrow = 1, widths = c(0.75, 0.825))

Transmission automatic manual

0.67 0.5

V−shaped straight Engine

0.56

Engine V−shaped straight

0.63 0.46

automatic manual Transmission

0.59

One can be derived from the other. Matching (X, Y ) areas in both diagrams

slide-58
SLIDE 58

Eikosograms - the probability picture

The two eikosograms are

grid.arrange(e1, e2, nrow = 1, widths = c(0.75, 0.825))

Transmission automatic manual

0.67 0.5

V−shaped straight Engine

0.56

Engine V−shaped straight

0.63 0.46

automatic manual Transmission

0.59

One can be derived from the other. Matching (X, Y ) areas in both diagrams

area rectangle in left eikosogram = area of corresponding rectangle in right eikosogram

slide-59
SLIDE 59

Eikosograms - the probability picture

The two eikosograms are

grid.arrange(e1, e2, nrow = 1, widths = c(0.75, 0.825))

Transmission automatic manual

0.67 0.5

V−shaped straight Engine

0.56

Engine V−shaped straight

0.63 0.46

automatic manual Transmission

0.59

One can be derived from the other. Matching (X, Y ) areas in both diagrams

area rectangle in left eikosogram = area of corresponding rectangle in right eikosogram height × width = height × width of corresponding rectangle

slide-60
SLIDE 60

Eikosograms - the probability picture

The two eikosograms are

grid.arrange(e1, e2, nrow = 1, widths = c(0.75, 0.825))

Transmission automatic manual

0.67 0.5

V−shaped straight Engine

0.56

Engine V−shaped straight

0.63 0.46

automatic manual Transmission

0.59

One can be derived from the other. Matching (X, Y ) areas in both diagrams

area rectangle in left eikosogram = area of corresponding rectangle in right eikosogram height × width = height × width of corresponding rectangle Pr(Y | X) × Pr(X) = Pr(X | Y ) × Pr(Y )

  • r Bayes’s theorem.
slide-61
SLIDE 61

Eikosograms - tables

A table ia a data structure that summarizes (e.g. counts) of cross-classified factors.

table1 <- table(mtcars$vs, mtcars$am) # vectors intepretable as factors

slide-62
SLIDE 62

Eikosograms - tables

A table ia a data structure that summarizes (e.g. counts) of cross-classified factors.

table1 <- table(mtcars$vs, mtcars$am) # vectors intepretable as factors table1 ## ## 1 ## 0 12 6 ## 1 7 7

slide-63
SLIDE 63

Eikosograms - tables

A table ia a data structure that summarizes (e.g. counts) of cross-classified factors.

table1 <- table(mtcars$vs, mtcars$am) # vectors intepretable as factors table1 ## ## 1 ## 0 12 6 ## 1 7 7 # Or with a formula via cross-tabulation table2 <- xtabs( ~ Engine + am, data = mtcars) # Note no response

slide-64
SLIDE 64

Eikosograms - tables

A table ia a data structure that summarizes (e.g. counts) of cross-classified factors.

table1 <- table(mtcars$vs, mtcars$am) # vectors intepretable as factors table1 ## ## 1 ## 0 12 6 ## 1 7 7 # Or with a formula via cross-tabulation table2 <- xtabs( ~ Engine + am, data = mtcars) # Note no response table2 ## am ## Engine 1 ## V-shaped 12 6 ## straight 7 7

slide-65
SLIDE 65

Eikosograms - tables

A table ia a data structure that summarizes (e.g. counts) of cross-classified factors.

table1 <- table(mtcars$vs, mtcars$am) # vectors intepretable as factors table1 ## ## 1 ## 0 12 6 ## 1 7 7 # Or with a formula via cross-tabulation table2 <- xtabs( ~ Engine + am, data = mtcars) # Note no response table2 ## am ## Engine 1 ## V-shaped 12 6 ## straight 7 7 Cross tabulation does a little more

slide-66
SLIDE 66

Eikosograms - tables

A table ia a data structure that summarizes (e.g. counts) of cross-classified factors.

table1 <- table(mtcars$vs, mtcars$am) # vectors intepretable as factors table1 ## ## 1 ## 0 12 6 ## 1 7 7 # Or with a formula via cross-tabulation table2 <- xtabs( ~ Engine + am, data = mtcars) # Note no response table2 ## am ## Engine 1 ## V-shaped 12 6 ## straight 7 7 Cross tabulation does a little more table3 <- xtabs(wt ~ Engine + Transmission, data = mtcars) # response summed

slide-67
SLIDE 67

Eikosograms - tables

A table ia a data structure that summarizes (e.g. counts) of cross-classified factors.

table1 <- table(mtcars$vs, mtcars$am) # vectors intepretable as factors table1 ## ## 1 ## 0 12 6 ## 1 7 7 # Or with a formula via cross-tabulation table2 <- xtabs( ~ Engine + am, data = mtcars) # Note no response table2 ## am ## Engine 1 ## V-shaped 12 6 ## straight 7 7 Cross tabulation does a little more table3 <- xtabs(wt ~ Engine + Transmission, data = mtcars) # response summed round(table3/table2, 2) # Average weights ## Transmission ## Engine automatic manual ## V-shaped 4.10 2.86 ## straight 3.19 2.03 See help("table"), help("xtabs") and also help(tabulate)

slide-68
SLIDE 68

Tabulating admissions - sex discrimination at grad school?

A well known table of counts is UCBAdmissions which records the number of applications and admissions to several large graduate programmes at the University of California (Berkeley) in 1973. This is a table data structure of counts cross classified by three different factors:

dim(UCBAdmissions) ## [1] 2 2 6 dimnames(UCBAdmissions) ## $Admit ## [1] "Admitted" "Rejected" ## ## $Gender ## [1] "Male" "Female" ## ## $Dept ## [1] "A" "B" "C" "D" "E" "F"

slide-69
SLIDE 69

Tabulating admissions - sex discrimination at grad school?

A well known table of counts is UCBAdmissions which records the number of applications and admissions to several large graduate programmes at the University of California (Berkeley) in 1973. This is a table data structure of counts cross classified by three different factors:

dim(UCBAdmissions) ## [1] 2 2 6 dimnames(UCBAdmissions) ## $Admit ## [1] "Admitted" "Rejected" ## ## $Gender ## [1] "Male" "Female" ## ## $Dept ## [1] "A" "B" "C" "D" "E" "F" Based on the data in this table, the question of whether sexism played a role in admission to graduate school at Berkeley.

slide-70
SLIDE 70

Tabulating admissions - sex discrimination at grad school?

A well known table of counts is UCBAdmissions which records the number of applications and admissions to several large graduate programmes at the University of California (Berkeley) in 1973. This is a table data structure of counts cross classified by three different factors:

dim(UCBAdmissions) ## [1] 2 2 6 dimnames(UCBAdmissions) ## $Admit ## [1] "Admitted" "Rejected" ## ## $Gender ## [1] "Male" "Female" ## ## $Dept ## [1] "A" "B" "C" "D" "E" "F" Based on the data in this table, the question of whether sexism played a role in admission to graduate school at Berkeley. This claim can be explored using eikosgrams.

slide-71
SLIDE 71

Eikosograms - sex discrimination at grad school?

First, to look at the admission rate we use eikos("Admit", data = UCBAdmissions)

slide-72
SLIDE 72

Eikosograms - sex discrimination at grad school?

First, to look at the admission rate we use eikos("Admit", data = UCBAdmissions)

Admit Admitted Rejected

0.39

slide-73
SLIDE 73

Eikosograms - sex discrimination at grad school?

First, to look at the admission rate we use eikos("Admit", data = UCBAdmissions)

Admit Admitted Rejected

0.39

Comments?

slide-74
SLIDE 74

Eikosograms - sex discrimination at grad school?

But, what if we break this down by the sex of the applicant?

slide-75
SLIDE 75

Eikosograms - sex discrimination at grad school?

But, what if we break this down by the sex of the applicant? eikos(Admit ~ Gender , data = UCBAdmissions)

slide-76
SLIDE 76

Eikosograms - sex discrimination at grad school?

But, what if we break this down by the sex of the applicant? eikos(Admit ~ Gender , data = UCBAdmissions)

Admit Admitted Rejected

0.45 0.3

Male Female Gender

0.59

Comments?

slide-77
SLIDE 77

Eikosograms - sex discrimination at grad school?

Recall that there are several different departments

slide-78
SLIDE 78

Eikosograms - sex discrimination at grad school?

Recall that there are several different departments eikos("Dept" , data = UCBAdmissions)

slide-79
SLIDE 79

Eikosograms - sex discrimination at grad school?

Recall that there are several different departments eikos("Dept" , data = UCBAdmissions)

Dept A B C D E F

0.21 0.34 0.54 0.71 0.84

Comments?

slide-80
SLIDE 80

Eikosograms - sex discrimination at grad school?

How do admissions for the two sexes fare across departments?

slide-81
SLIDE 81

Eikosograms - sex discrimination at grad school?

How do admissions for the two sexes fare across departments? Males: eikos(Admit ~ Dept, data = UCBAdmissions[,"Male",])

slide-82
SLIDE 82

Eikosograms - sex discrimination at grad school?

How do admissions for the two sexes fare across departments? Males: eikos(Admit ~ Dept, data = UCBAdmissions[,"Male",])

Admit Admitted Rejected

0.62 0.63 0.37 0.33 0.28 0.06

A B C D E F Dept

0.31 0.51 0.64 0.79 0.86

Comments?

slide-83
SLIDE 83

Eikosograms - sex discrimination at grad school?

How do admissions for the two sexes fare across departments?

slide-84
SLIDE 84

Eikosograms - sex discrimination at grad school?

How do admissions for the two sexes fare across departments? Females: eikos(Admit ~ Dept, data = UCBAdmissions[,"Female",])

slide-85
SLIDE 85

Eikosograms - sex discrimination at grad school?

How do admissions for the two sexes fare across departments? Females: eikos(Admit ~ Dept, data = UCBAdmissions[,"Female",])

Admit Admitted Rejected

0.82 0.68 0.34 0.35 0.24 0.07

A B C D E F Dept

0.06 0.07 0.4 0.6 0.81

Comments?

slide-86
SLIDE 86

Eikosograms - sex discrimination at grad school?

How above comparing sexes by department?

slide-87
SLIDE 87

Eikosograms - sex discrimination at grad school?

How above comparing sexes by department? Department A: eikos(Admit ~ Gender, data = UCBAdmissions[,,"A"])

slide-88
SLIDE 88

Eikosograms - sex discrimination at grad school?

How above comparing sexes by department? Department A: eikos(Admit ~ Gender, data = UCBAdmissions[,,"A"])

Admit Admitted Rejected

0.62 0.82

Male Female Gender

0.88

Comments?

slide-89
SLIDE 89

Eikosograms - sex discrimination at grad school?

How above comparing sexes by department?

slide-90
SLIDE 90

Eikosograms - sex discrimination at grad school?

How above comparing sexes by department? Department B: eikos(Admit ~ Gender, data = UCBAdmissions[,,"B"])

slide-91
SLIDE 91

Eikosograms - sex discrimination at grad school?

How above comparing sexes by department? Department B: eikos(Admit ~ Gender, data = UCBAdmissions[,,"B"])

Admit Admitted Rejected

0.63 0.68

Male Female Gender

0.96

Comments?

slide-92
SLIDE 92

Eikosograms - sex discrimination at grad school?

How above comparing sexes by department?

slide-93
SLIDE 93

Eikosograms - sex discrimination at grad school?

How above comparing sexes by department? Department C: eikos(Admit ~ Gender, data = UCBAdmissions[,,"C"])

slide-94
SLIDE 94

Eikosograms - sex discrimination at grad school?

How above comparing sexes by department? Department C: eikos(Admit ~ Gender, data = UCBAdmissions[,,"C"])

Admit Admitted Rejected

0.37 0.34

Male Female Gender

0.35

Comments?

slide-95
SLIDE 95

Eikosograms - sex discrimination at grad school?

How above comparing sexes by department?

slide-96
SLIDE 96

Eikosograms - sex discrimination at grad school?

How above comparing sexes by department? Department D: eikos(Admit ~ Gender, data = UCBAdmissions[,,"D"])

slide-97
SLIDE 97

Eikosograms - sex discrimination at grad school?

How above comparing sexes by department? Department D: eikos(Admit ~ Gender, data = UCBAdmissions[,,"D"])

Admit Admitted Rejected

0.33 0.35

Male Female Gender

0.53

Comments?

slide-98
SLIDE 98

Eikosograms - sex discrimination at grad school?

How above comparing sexes by department?

slide-99
SLIDE 99

Eikosograms - sex discrimination at grad school?

How above comparing sexes by department? Department E: eikos(Admit ~ Gender, data = UCBAdmissions[,,"E"])

slide-100
SLIDE 100

Eikosograms - sex discrimination at grad school?

How above comparing sexes by department? Department E: eikos(Admit ~ Gender, data = UCBAdmissions[,,"E"])

Admit Admitted Rejected

0.28 0.24

Male Female Gender

0.33

Comments?

slide-101
SLIDE 101

Eikosograms - sex discrimination at grad school?

How above comparing sexes by department?

slide-102
SLIDE 102

Eikosograms - sex discrimination at grad school?

How above comparing sexes by department? Department F: eikos(Admit ~ Gender, data = UCBAdmissions[,,"F"])

slide-103
SLIDE 103

Eikosograms - sex discrimination at grad school?

How above comparing sexes by department? Department F: eikos(Admit ~ Gender, data = UCBAdmissions[,,"F"])

Admit Admitted Rejected

0.06 0.07

Male Female Gender

0.52

Comments?

slide-104
SLIDE 104

Eikosograms - sex discrimination at grad school?

Admission rates by department

slide-105
SLIDE 105

Eikosograms - sex discrimination at grad school?

Admission rates by department eikos(Admit ~ Dept, data = UCBAdmissions)

slide-106
SLIDE 106

Eikosograms - sex discrimination at grad school?

Admission rates by department eikos(Admit ~ Dept, data = UCBAdmissions)

Admit Admitted Rejected

0.64 0.63 0.35 0.34 0.25 0.06

A B C D E F Dept

0.21 0.34 0.54 0.71 0.84

Comments?

slide-107
SLIDE 107

Eikosograms - sex discrimination at grad school?

Sex distribution by department

slide-108
SLIDE 108

Eikosograms - sex discrimination at grad school?

Sex distribution by department eikos(Gender ~ Dept, data = UCBAdmissions)

slide-109
SLIDE 109

Eikosograms - sex discrimination at grad school?

Sex distribution by department eikos(Gender ~ Dept, data = UCBAdmissions)

Gender Male Female

0.88 0.96 0.35 0.53 0.33 0.52

A B C D E F Dept

0.21 0.34 0.54 0.71 0.84

Comments?

slide-110
SLIDE 110

Eikosograms - sex discrimination at grad school?

Sex distribution by department and admission by department side by side

Gender Male Female

0.88 0.96 0.35 0.53 0.33 0.52

A B C D E F Dept

0.21 0.34 0.54 0.71 0.84

Admit Admitted Rejected

0.64 0.63 0.35 0.34 0.25 0.06

A B C D E F Dept

0.21 0.34 0.54 0.71 0.84

Comments?

slide-111
SLIDE 111

Eikosograms - sex discrimination at grad school?

We can put this together in a single eikosogram: eikos(Admit ~ Dept + Gender, data = UCBAdmissions, xlab_rot = 45)

slide-112
SLIDE 112

Eikosograms - sex discrimination at grad school?

We can put this together in a single eikosogram: eikos(Admit ~ Dept + Gender, data = UCBAdmissions, xlab_rot = 45)

Admit Admitted Rejected

0.62 0.63 0.37 0.33 0.28 0.06 0.82 0.68 0.34 0.35 0.24 0.07

A A B B C C D D E E F F M a l e F e m a l e Dept Gender

0.18 0.31 0.38 0.47 0.51 0.59 0.62 0.62 0.75 0.84 0.92

Comments?

slide-113
SLIDE 113

Eikosograms - sex discrimination at grad school?

Or equivalently, making admission rates more easily comparable: eikos(Admit ~ Gender + Dept, data = UCBAdmissions, xlab_rot = 45)

slide-114
SLIDE 114

Eikosograms - sex discrimination at grad school?

Or equivalently, making admission rates more easily comparable: eikos(Admit ~ Gender + Dept, data = UCBAdmissions, xlab_rot = 45)

Admit Admitted Rejected

0.62 0.82 0.63 0.68 0.37 0.34 0.33 0.35 0.28 0.24 0.06 0.07

M a l e M a l e M a l e M a l e M a l e M a l e F e m a l e F e m a l e F e m a l e F e m a l e F e m a l e F e m a l e A B C D E F Gender Dept

0.18 0.21 0.33 0.34 0.41 0.54 0.63 0.71 0.76 0.84 0.92

Comments?

slide-115
SLIDE 115

Eikosograms - sex discrimination at grad school?

Or equivalently, making admission rates more easily comparable: eikos(Admit ~ Gender + Dept, data = UCBAdmissions, xlab_rot = 45)

Admit Admitted Rejected

0.62 0.82 0.63 0.68 0.37 0.34 0.33 0.35 0.28 0.24 0.06 0.07

M a l e M a l e M a l e M a l e M a l e M a l e F e m a l e F e m a l e F e m a l e F e m a l e F e m a l e F e m a l e A B C D E F Gender Dept

0.18 0.21 0.33 0.34 0.41 0.54 0.63 0.71 0.76 0.84 0.92

Comments?

Females applied to the more difficult programmes and hence fewer were admitted overall. A greater proportion of females were admitted in most programmes. No evidence of sex discrimation in favour

  • f males (possibly the opposite but this would require a statistical test).