The General Social S u r v e y IN FE R E N C E FOR C ATE G OR IC - - PowerPoint PPT Presentation

the general social s u r v e y
SMART_READER_LITE
LIVE PREVIEW

The General Social S u r v e y IN FE R E N C E FOR C ATE G OR IC - - PowerPoint PPT Presentation

The General Social S u r v e y IN FE R E N C E FOR C ATE G OR IC AL DATA IN R Andre w Bra y Assistant Professor of Statistics at Reed College INFERENCE FOR CATEGORICAL DATA IN R INFERENCE FOR CATEGORICAL DATA IN R INFERENCE FOR CATEGORICAL


slide-1
SLIDE 1

The General Social Survey

IN FE R E N C E FOR C ATE G OR IC AL DATA IN R

Andrew Bray

Assistant Professor of Statistics at Reed College

slide-2
SLIDE 2

INFERENCE FOR CATEGORICAL DATA IN R

slide-3
SLIDE 3

INFERENCE FOR CATEGORICAL DATA IN R

slide-4
SLIDE 4

INFERENCE FOR CATEGORICAL DATA IN R

slide-5
SLIDE 5

INFERENCE FOR CATEGORICAL DATA IN R

slide-6
SLIDE 6

INFERENCE FOR CATEGORICAL DATA IN R

slide-7
SLIDE 7

INFERENCE FOR CATEGORICAL DATA IN R

Exploring GSS

library(dplyr) glimpse(gss) Observations: 3,300 Variables: 25 $ id <dbl> 518, 1092, 2094, 229, 979, 554, 491, 319, 3143, 1... $ year <dbl> 1982, 1982, 1982, 1982, 1982, 1982, 1982, 1982, 1... $ age <fct> 49, 22, 26, 75, 71, 33, 56, 33, 69, 40, 44, 42, 5... $ class <fct> WORKING CLASS, WORKING CLASS, WORKING CLASS, LOWE... $ degree <fct> HIGH SCHOOL, HIGH SCHOOL, HIGH SCHOOL, LT HIGH SC... $ sex <fct> MALE, MALE, MALE, MALE, FEMALE, FEMALE, MALE, FEM... $ happy <fct> HAPPY, HAPPY, HAPPY, HAPPY, HAPPY, HAPPY, HAPPY, ...

slide-8
SLIDE 8

INFERENCE FOR CATEGORICAL DATA IN R

Exploring GSS

gss2016 <- filter(gss, year == 2016) ggplot(gss2016, aes(x = happy)) + geom_bar()

slide-9
SLIDE 9

INFERENCE FOR CATEGORICAL DATA IN R

Exploring GSS

gss2016 <- filter(gss, year == 2016) ggplot(gss2016, aes(x = happy)) + geom_bar()

slide-10
SLIDE 10

INFERENCE FOR CATEGORICAL DATA IN R

Exploring GSS

p_hat <- gss2016 %>% summarize(prop_happy = mean(happy == "HAPPY")) %>% pull() p_hat 0.7733333

slide-11
SLIDE 11

INFERENCE FOR CATEGORICAL DATA IN R

General 95% confidence interval

( − 2 × SE, + 2 × SE)

Sample proportion plus or minus two standard errors

p ^ p ^

slide-12
SLIDE 12

INFERENCE FOR CATEGORICAL DATA IN R

Bootstrap

slide-13
SLIDE 13

INFERENCE FOR CATEGORICAL DATA IN R

Bootstrap

slide-14
SLIDE 14

INFERENCE FOR CATEGORICAL DATA IN R

Bootstrap

slide-15
SLIDE 15

INFERENCE FOR CATEGORICAL DATA IN R

Bootstrap

slide-16
SLIDE 16

INFERENCE FOR CATEGORICAL DATA IN R

Bootstrap

slide-17
SLIDE 17

INFERENCE FOR CATEGORICAL DATA IN R

Bootstrap

slide-18
SLIDE 18

INFERENCE FOR CATEGORICAL DATA IN R

Bootstrap

slide-19
SLIDE 19

INFERENCE FOR CATEGORICAL DATA IN R

Bootstrap

slide-20
SLIDE 20

INFERENCE FOR CATEGORICAL DATA IN R

Bootstrap

slide-21
SLIDE 21

INFERENCE FOR CATEGORICAL DATA IN R

Bootstrap

slide-22
SLIDE 22

INFERENCE FOR CATEGORICAL DATA IN R

Bootstrap

slide-23
SLIDE 23

INFERENCE FOR CATEGORICAL DATA IN R

Bootstrap

slide-24
SLIDE 24

INFERENCE FOR CATEGORICAL DATA IN R

Bootstrap

slide-25
SLIDE 25

INFERENCE FOR CATEGORICAL DATA IN R

Bootstrap

slide-26
SLIDE 26

INFERENCE FOR CATEGORICAL DATA IN R

Bootstrap Confidence Interval

library(infer) boot <- gss2016 %>% specify(response = happy, success = “HAPPY”) %>% generate(reps = 500, type = "bootstrap") %>% calculate(stat = "prop") boot Response: happy (factor) # A tibble: 500 x 2 replicate stat <int> <dbl> 1 1 0.827 2 2 0.740 3 3 0.780 4 4 0.773 5 5 0.747 6 6 0.753

slide-27
SLIDE 27

INFERENCE FOR CATEGORICAL DATA IN R

Bootstrap Confidence Interval

ggplot(boot, aes(x = stat)) + geom_density()

slide-28
SLIDE 28

INFERENCE FOR CATEGORICAL DATA IN R

Bootstrap Confidence Interval

SE <- boot %>% summarize(sd(stat)) %>% pull() SE 0.03482251

( − 2 × SE, + 2 × SE)

c(p_hat - 2 * SE, p_hat + 2 * SE) 0.7051883 0.8412784

p ^ p ^

slide-29
SLIDE 29

Let's practice!

IN FE R E N C E FOR C ATE G OR IC AL DATA IN R

slide-30
SLIDE 30

Interpreting a Confidence Interval

IN FE R E N C E FOR C ATE G OR IC AL DATA IN R

Andrew Bray

Assistant Professor of Statistics at Reed College

slide-31
SLIDE 31

INFERENCE FOR CATEGORICAL DATA IN R

Confidence intervals

Conclusion: the true proportion of Americans that are happy is between 0.705 and 0.841. What do we mean by condent?

slide-32
SLIDE 32

INFERENCE FOR CATEGORICAL DATA IN R

Dataset 1

ds1 <- filter(gss, year == 2016) p_hat <- ds1 %>% summarize(mean(happy == "HAPPY")) %>% pull() SE <- ds1 %>% specify(response = happy, success = "HAPPY") %>% generate(reps = 500, type = "bootstrap") %>% calculate(stat = "prop") %>% summarize(sd(stat)) %>% pull() c(p_hat - 2 * SE, p_hat + 2 * SE) 0.7073114 0.8393553

slide-33
SLIDE 33

INFERENCE FOR CATEGORICAL DATA IN R

slide-34
SLIDE 34

INFERENCE FOR CATEGORICAL DATA IN R

slide-35
SLIDE 35

INFERENCE FOR CATEGORICAL DATA IN R

slide-36
SLIDE 36

INFERENCE FOR CATEGORICAL DATA IN R

slide-37
SLIDE 37

INFERENCE FOR CATEGORICAL DATA IN R

slide-38
SLIDE 38

INFERENCE FOR CATEGORICAL DATA IN R

slide-39
SLIDE 39

INFERENCE FOR CATEGORICAL DATA IN R

slide-40
SLIDE 40

INFERENCE FOR CATEGORICAL DATA IN R

slide-41
SLIDE 41

INFERENCE FOR CATEGORICAL DATA IN R

slide-42
SLIDE 42

INFERENCE FOR CATEGORICAL DATA IN R

slide-43
SLIDE 43

INFERENCE FOR CATEGORICAL DATA IN R

Dataset 2

ds2 <- filter(gss, year == 2014) p_hat <- ds1 %>% summarize(mean(happy == "HAPPY")) %>% pull() SE <- ds1 %>% specify(response = happy, success = "HAPPY") %>% generate(reps = 500, type = "bootstrap") %>% calculate(stat = "prop") %>% summarize(sd(stat)) %>% pull() c(p_hat - 2 * SE, p_hat + 2 * SE) 0.8348831 0.9384503

slide-44
SLIDE 44

INFERENCE FOR CATEGORICAL DATA IN R

Dataset 3

ds3 <- filter(gss, year == 2012) p_hat <- ds1 %>% summarize(mean(happy == "HAPPY")) %>% pull() SE <- ds1 %>% specify(response = happy, success = "HAPPY") %>% generate(reps = 500, type = "bootstrap") %>% calculate(stat = "prop") %>% summarize(sd(stat)) %>% pull() c(p_hat - 2 * SE, p_hat + 2 * SE) 0.7626359 0.8906974

slide-45
SLIDE 45

INFERENCE FOR CATEGORICAL DATA IN R

Dataset 3

ds3 <- filter(gss, year == 2012) p_hat <- ds3 %>% summarize(mean(happy == "HAPPY")) %>% pull() SE <- ds3 %>% specify(response = happy, success = "HAPPY") %>% generate(reps = 500, type = "bootstrap") %>% calculate(stat = "prop") %>% summarize(sd(stat)) %>% pull() c(p_hat - 2 * SE, p_hat + 2 * SE) 0.7626359 0.8906974

slide-46
SLIDE 46

INFERENCE FOR CATEGORICAL DATA IN R

Dataset 3

ds3 <- filter(gss, year == 2012) p_hat <- ds3 %>% summarize(mean(happy == "HAPPY")) %>% pull() SE <- ds3 %>% specify(response = happy, success = "HAPPY") %>% generate(reps = 500, type = "bootstrap") %>% calculate(stat = "prop") %>% summarize(sd(stat)) %>% pull() c(p_hat - 2 * SE, p_hat + 2 * SE) 0.7626359 0.8906974

slide-47
SLIDE 47

INFERENCE FOR CATEGORICAL DATA IN R

Dataset 3

ds3 <- filter(gss, year == 2012) p_hat <- ds3 %>% summarize(mean(happy == "HAPPY")) %>% pull() SE <- ds3 %>% specify(response = happy, success = "HAPPY") %>% generate(reps = 500, type = "bootstrap") %>% calculate(stat = "prop") %>% summarize(sd(stat)) %>% pull() c(p_hat - 2 * SE, p_hat + 2 * SE) 0.7626359 0.8906974

slide-48
SLIDE 48

INFERENCE FOR CATEGORICAL DATA IN R

Dataset 3

ds3 <- filter(gss, year == 2012) p_hat <- ds3 %>% summarize(mean(happy == "HAPPY")) %>% pull() SE <- ds3 %>% specify(response = happy, success = "HAPPY") %>% generate(reps = 500, type = "bootstrap") %>% calculate(stat = "prop") %>% summarize(sd(stat)) %>% pull() c(p_hat - 2 * SE, p_hat + 2 * SE) 0.7626359 0.8906974

slide-49
SLIDE 49

INFERENCE FOR CATEGORICAL DATA IN R

Dataset 3

ds3 <- filter(gss, year == 2012) p_hat <- ds3 %>% summarize(mean(happy == "HAPPY")) %>% pull() SE <- ds3 %>% specify(response = happy, success = "HAPPY") %>% generate(reps = 500, type = "bootstrap") %>% calculate(stat = "prop") %>% summarize(sd(stat)) %>% pull() c(p_hat - 2 * SE, p_hat + 2 * SE) 0.7626359 0.8906974

slide-50
SLIDE 50

INFERENCE FOR CATEGORICAL DATA IN R

Confidence Intervals

Interpretation: “We’re 95% condent that the true proportion of Americans that are happy is between 0.705 and 0.841.” Width of the interval aected by

n

condence level

p

slide-51
SLIDE 51

Let's practice!

IN FE R E N C E FOR C ATE G OR IC AL DATA IN R

slide-52
SLIDE 52

The approximation shortcut

IN FE R E N C E FOR C ATE G OR IC AL DATA IN R

Andrew Bray

Assistant Professor of Statistics at Reed College

slide-53
SLIDE 53

INFERENCE FOR CATEGORICAL DATA IN R

Confidence Intervals

SE 0.009998905 SE_small_n 0.03809731 SE_low_p 0.00547912

Standard errors increase when

n is small p is close to 0.5

slide-54
SLIDE 54

INFERENCE FOR CATEGORICAL DATA IN R

slide-55
SLIDE 55

INFERENCE FOR CATEGORICAL DATA IN R

slide-56
SLIDE 56

INFERENCE FOR CATEGORICAL DATA IN R

The normal distribution

A.K.A the "bell curve". If

  • bservations are independent

n is large

Then follows a normal distribution

p ^

slide-57
SLIDE 57

INFERENCE FOR CATEGORICAL DATA IN R

Standard deviation

√ n × (1 − ) p ^ p ^

slide-58
SLIDE 58

INFERENCE FOR CATEGORICAL DATA IN R

Assessing model assumptions

How do I check "observations are independent"? This depends upon the data collection method. What does "n is large" mean?

n × > 10 n × (1 − ) > 10 p ^ p ^

slide-59
SLIDE 59

INFERENCE FOR CATEGORICAL DATA IN R

Calculating standard error: approximation

p_hat <- gss2016 %>% summarize(mean(happy == "HAPPY")) %>% pull() n <- nrow(gss2016) c(n * p_hat, n * (1 - p_hat)) 116 35 SE_approx <- sqrt(p_hat * (1 - p_hat) / n) SE_approx 0.03418468

slide-60
SLIDE 60

INFERENCE FOR CATEGORICAL DATA IN R

Calculating standard error: computation

boot <- gss2016 %>% specify(response = happy, success = "HAPPY") %>% generate(reps = 500, type = "bootstrap") %>% calculate(stat = "prop") SE_boot <- boot %>% summarize(sd(stat)) %>% pull() SE_boot 0.03176741

slide-61
SLIDE 61

INFERENCE FOR CATEGORICAL DATA IN R

Sampling distributions

ggplot(boot, aes(x = stat)) + geom_density()

slide-62
SLIDE 62

INFERENCE FOR CATEGORICAL DATA IN R

Sampling distributions

ggplot(boot, aes(x = stat)) + geom_density() + stat_function(fun = dnorm, color = "purple", args = list(mean = p_hat, sd = SE_approx))

slide-63
SLIDE 63

INFERENCE FOR CATEGORICAL DATA IN R

Sampling distributions

ggplot(boot, aes(x = stat)) + geom_density() + stat_function(fun = dnorm, color = "purple", args = list(mean = p_hat, sd = SE_approx))

slide-64
SLIDE 64

Let's practice!

IN FE R E N C E FOR C ATE G OR IC AL DATA IN R