Vocabulary score vs. self identified social class Mine - - PowerPoint PPT Presentation

vocabulary score vs self identified social class
SMART_READER_LITE
LIVE PREVIEW

Vocabulary score vs. self identified social class Mine - - PowerPoint PPT Presentation

DataCamp Inference for Numerical Data in R INFERENCE FOR NUMERICAL DATA IN R Vocabulary score vs. self identified social class Mine Cetinkaya-Rundel Associate Professor of the Practice, Duke University DataCamp Inference for Numerical Data in


slide-1
SLIDE 1

DataCamp Inference for Numerical Data in R

Vocabulary score vs. self identified social class

INFERENCE FOR NUMERICAL DATA IN R

Mine Cetinkaya-Rundel

Associate Professor of the Practice, Duke University

slide-2
SLIDE 2

DataCamp Inference for Numerical Data in R

Vocabulary score and self identified social class

wordsum: 10 question vocabulary test

(scores range from 0 to 10)

class: self identified social class

(lower, working, middle, upper)

wordsum class

1 6 MIDDLE 2 9 WORKING 3 6 WORKING 4 5 WORKING 5 6 WORKING 6 6 WORKING ... ... ... 795 9 MIDDLE

slide-3
SLIDE 3

DataCamp Inference for Numerical Data in R

  • 1. SPACE (school, noon, captain, room, board, don't know)
  • 2. BROADEN (efface, make level, elapse, embroider, widen, don't know)
  • 3. EMANATE (populate, free, prominent, rival, come, don't know)
  • 4. EDIBLE (auspicious, eligible, fit to eat, sagacious, able to speak, don't know)
  • 5. ANIMOSITY (hatred, animation, disobedience, diversity, friendship, don't

know)

  • 6. PACT (puissance, remonstrance, agreement, skillet, pressure, don't know)
  • 7. CLOISTERED (miniature, bunched, arched, malady, secluded, don't

know)

  • 8. CAPRICE (value, a star, grimace, whim, inducement, don't know)
  • 9. ACCUSTOM (disappoint, customary, encounter, get used to, business, don't

know)

slide-4
SLIDE 4

DataCamp Inference for Numerical Data in R

Distribution of vocabulary score

ggplot(data = gss, aes(x = wordsum)) + geom_histogram(binwidth = 1)

slide-5
SLIDE 5

DataCamp Inference for Numerical Data in R

Self identified social class: class

If you were asked to use one of four names for your social class, which would you say you belong in: the lower class, the working class, the middle class, or the upper class?

ggplot(data = gss, aes(x = wordsum)) + geom_histogram(binwidth = 1)

slide-6
SLIDE 6

DataCamp Inference for Numerical Data in R

Let's practice!

INFERENCE FOR NUMERICAL DATA IN R

slide-7
SLIDE 7

DataCamp Inference for Numerical Data in R

ANOVA

INFERENCE FOR NUMERICAL DATA IN R

Mine Cetinkaya-Rundel

Associate Professor of the Practice, Duke University

slide-8
SLIDE 8

DataCamp Inference for Numerical Data in R

slide-9
SLIDE 9

DataCamp Inference for Numerical Data in R

ANOVA for vocabulary scores vs. self identified social class

H : The average vocabulary score is the same across all social classes, μ = μ = μ = μ . H : The average vocabulary scores differ between at least one pair of social classes.

lower working middle upper A

slide-10
SLIDE 10

DataCamp Inference for Numerical Data in R

Variability partitioning

Total variability in vocabulary score: Variability that can be attributed to differences in social class - between group variability Variability attributed to all other factor - within group variability

slide-11
SLIDE 11

DataCamp Inference for Numerical Data in R

ANOVA output

term df sumsq meansq statistic p.value class 3 236.5644 78.854810 21.73467 Residuals 791 2869.8003 3.628066 NA NA

library(broom) aov(wordsum ~ class, gss) %>% tidy()

slide-12
SLIDE 12

DataCamp Inference for Numerical Data in R

Sum of squares

term df sumsq meansq statistic p.value class 3 236.5644 78.854810 21.73467 Residuals 791 2869.8003 3.628066 NA NA

SST = 236.5644 + 2869.8003 = 3106.365 - Measures the total variability in the response variable Calculated very similarly to variance (except not scaled by the sample size) Percentage of explained variability = = 7.6%

3106.365 236.5644

slide-13
SLIDE 13

DataCamp Inference for Numerical Data in R

F-statistic

term df sumsq meansq statistic p.value class 3 236.5644 78.854810 21.73467 Residuals 791 2869.8003 3.628066 NA NA

F-statistic = 21.73467 = within group var

between group var

slide-14
SLIDE 14

DataCamp Inference for Numerical Data in R

Let's practice!

INFERENCE FOR NUMERICAL DATA IN R

slide-15
SLIDE 15

DataCamp Inference for Numerical Data in R

Conditions for ANOVA

INFERENCE FOR NUMERICAL DATA IN R

Mine Cetinkaya-Rundel

Associate Professor of the Practice, Duke University

slide-16
SLIDE 16

DataCamp Inference for Numerical Data in R

Conditions for ANOVA

Independence: within groups: sampled observations must be independent between groups: the groups must be independent of each other (non-paired) Approximate normality: distribution of the response variable should be nearly normal within each group Equal variance: groups should have roughly equal variability

slide-17
SLIDE 17

DataCamp Inference for Numerical Data in R

Independence

Within groups: Sampled observations must be independent of each other Random sample / assignment Each n less than 10% of respective population always important, but sometimes difficult to check Between groups: Groups must be independent of each other Carefully consider whether the groups may be dependent

j

slide-18
SLIDE 18

DataCamp Inference for Numerical Data in R

Approximately normal

Distribution of response variable within each group should be approximately normal Especially important when sample sizes are small Check with visuals

slide-19
SLIDE 19

DataCamp Inference for Numerical Data in R

Constant variance

Variability should be consistent across groups (homoscedasticity) Especially important when sample sizes differ between groups

slide-20
SLIDE 20

DataCamp Inference for Numerical Data in R

Let's practice!

INFERENCE FOR NUMERICAL DATA IN R

slide-21
SLIDE 21

DataCamp Inference for Numerical Data in R

Post-hoc testing

INFERENCE FOR NUMERICAL DATA IN R

Mine Cetinkaya-Rundel

Associate Professor of the Practice, Duke University

slide-22
SLIDE 22

DataCamp Inference for Numerical Data in R

Which means differ?

Two sample t-tests for differences in each possible pair of groups Multiple tests → inflated Type 1 error rate Solution: use modified significance level

slide-23
SLIDE 23

DataCamp Inference for Numerical Data in R

Multiple comparisons

Testing many pairs of groups is called multiple comparisons The Bonferroni correction suggests that a more stringent significance level is more appropriate for these tests Adjust α by the number of comparisons being considered α = , where K =

⋆ K α 2 k(k−1)

slide-24
SLIDE 24

DataCamp Inference for Numerical Data in R

Pairwise comparisons

Constant variance → re-think standard error and degrees of freedom: Use consistent standard error and degrees of freedom for all tests Compare the p-values from each test to the modified significance level

slide-25
SLIDE 25

DataCamp Inference for Numerical Data in R

Let's practice!

INFERENCE FOR NUMERICAL DATA IN R

slide-26
SLIDE 26

DataCamp Inference for Numerical Data in R

Congratulations!

INFERENCE FOR NUMERICAL DATA IN R

Mine Cetinkaya-Rundel

Associate Professor of the Practice, Duke University