Unit 5: Inference for categorical data 3. Chi-square testing - PowerPoint PPT Presentation

Announcements Unit 5: Inference for categorical data 3. Chi-square testing ▶ Tomorrow in lab: work on Project 1---attendence is still Sta 101 - Spring 2015 mandatory. ▶ Project 1 due Monday at noon. Duke University, Department of Statistical Science ▶ RA6 Monday (all videos, unit is shorter) March 24, 2015 Dr. Windle Slides posted at http://bitly.com/windle2 1 Inference for categorical data Clicker question You and a friend are playing craps, which relies on two dice. Your If sample size related conditions are met: friend brought the dice. You have recorded the previously rolled ▶ Categorical data with 2 levels → Z totals as data. Which test is most appropriate to check that the dice are fair? – one variable: Z HT / CI for a single proportion – two variables: Z HT / CI comparing two proportions (a) Z test for a single proportion ▶ Categorical data with more than 2 levels → χ 2 (b) Z test for comparing two proportions – one variable: χ 2 test of goodness of fit , no CI – two variables: χ 2 test of independence , no CI (c) χ 2 test of goodness of fit (d) χ 2 test of independence If sample size related conditions are not met: Simulation based inference (randomization for HT / bootstrapping for CI, when H 0 : p 2 = p 12 = 1/36 ; p 3 = p 11 = 1/18 ; p 4 = p 10 = 1/12 ; appropriate) p 5 = p 9 = 1/9 ; p 6 = p 8 = 5/36 ; p 7 = 1/6 . 2 3

Clicker question Clicker question Suppose the Gallup poll instead asked about A Gallup poll asked whether or not respondents identify as Tea Party ▶ party affiliation (Tea Party Republican, Other Republican, and Republican (yes / no) and whether or not they are motivated to vote Non-Republican), and in the upcoming midterm election (yes / no). We want to find out ▶ motivation to vote (extremely unmotivated, very unmotivated, whether being a Tea Party Republican is associated with motivation unmotivated, motivated, very motivated, extremely motivated) to vote. Which test is most appropriate? We want to find out whether party affiliation is associated with motivation to vote. Which test is most appropriate? (a) Z test for a single proportion (b) Z test for comparing two proportions (a) Z test for a single proportion (c) χ 2 test of goodness of fit (b) Z test for comparing two proportions (d) χ 2 test of independence (c) χ 2 test of goodness of fit (d) χ 2 test of independence H 0 : p TPR = p Other , where p = probability of being motivated to vote H 0 : Party affiliation and motivation to vote are independent 4 5 The χ 2 statistic Expected Counts Example: does survival on the Titanic depend on cabin class? χ 2 statistic: When dealing with counts and investigating how far the observed counts are from the expected counts, we use a new test Column props.: Observed counts: statistic called the chi-square ( χ 2 ) statistic : 1st 2nd 3rd 1st 2nd 3rd k no 123 158 528 no 0.38 0.57 0.74 ( O − E ) 2 χ 2 = ∑ where k = total number of cells yes 200 119 181 yes 0.62 0.43 0.26 E i =1 Intuition: Important points: p 1 st × (total # obs) E no, 1st = ˆ p no × ˆ ▶ Use counts (not proportions ) in the calculation of the test Simplification: statistic, even though we're truly interested in the proportions for inference E no, 1st = row ``no'' total × col ``1st'' total = 809 × 323 = 199 . 6 table total ▶ Expected counts are calculated assuming the null hypothesis is 1309 true χ 2 titanic = 127 . 9 6 7

The χ 2 distribution Finding areas under the chi-square curve p-value = tail area under the chi-square distribution (as usual) The χ 2 distribution has just one parameter, degrees of freedom (df) , ▶ Using the applet: http://bit.ly/dist_calc which influences the shape, center, and spread of the distribution. ▶ For χ 2 GOF test: df = k − 1 ▶ Using R: pchisq() ▶ Using the table: works a lot like the t table, but only provides ▶ For χ 2 independence test: df = ( R − 1) × ( C − 1) upper tail values. Degrees of Freedom 2 4 9 0 5 10 15 20 25 Upper tail 0.3 0.2 0.1 0.05 0.02 0.01 0.005 0.001 df 1 1.07 1.64 2.71 3.84 5.41 6.63 7.88 10.83 2 2.41 3.22 4.61 5.99 7.82 9.21 10.60 13.82 3 3.66 4.64 6.25 7.81 9.84 11.34 12.84 16.27 4 4.88 5.99 7.78 9.49 11.67 13.28 14.86 18.47 5 6.06 7.29 9.24 11.07 13.39 15.09 16.75 20.52 0 5 10 15 20 25 6 7.23 8.56 10.64 12.59 15.03 16.81 18.55 22.46 · · · 8 9 Computing a p -value using the table Interpretation Clicker question In the Titanic example, χ 2 titanic = 127 . 9 and df = 2 . Based on the table Clicker question from the previous slide, which of the following is correct? (Hint: draw What is the best interpretation of the hypothesis test? a picture!) (a) There is not convincing evidence that survival and cabin class are (a) The p -value for this data set will be in the interval (0 . 02 , 0 . 05] . dependent. (b) The p -value for this data set will be in the interval (0 . 01 , 0 . 02] . (b) There is convincing evidence that survival and cabin class are (c) The p -value for this data set will be in the interval (0 . 005 , 0 . 01] . dependent. (d) The p -value for this data set will be in the interval (0 . 001 , 0 . 005] . (e) The p -value for this data set will be at or below 0 . 001 . 10 11

Conditions for χ 2 testing 1. Independence: In addition to what we previously discussed for Application exercise: 5.3 Chi-square tests independence, each case that contributes a count to the table must be independent of all the other cases in the table. See course website for details. 2. Sample size / distribution: Each cell must have at least 5 expected cases. 12 13 Recap: Does smoking habit depend on exercise habit? Randomization test for the difference of two proportions 1. Use 236 index cards, where each card represents an observation. Does smoking habit depend on exercise habit? 2. Mark 189 of the cards as ``Non Smoker'' and the remaining 47 Freq.Exer Not.Freq.Exer Total as ``Smoker.'' Non.Smoker 87 102 189 3. Shuffle the cards and split into two groups of size of size 115 Smoker 28 19 47 and 121 corresponding to ``Freq. Exer'' and ``Not Freq. Exer'' Total 115 121 236 respectively. 4. Calculate the difference between the proportions of ``Non ▶ H 0 : p freq. exer = p not freq. exer Smoking" in the frequent exercise and not frequent exercise ▶ H A : p freq. exer ̸ = p not freq. exer groups, and record this number. 5. Repeat steps (3) and (4) many times to build a randomization distribution of differences in simulated proportions. 14 15

Recap: Does smoking habit depend on exercise habit? Randomization test for the dependence of two categorical variables Does smoking habit depend on exercise habit? 1. Use 236 index cards, where each card represents an observation. Freq None Some Total 2. Mark 11 of the cards ``Heavy'', 189 of the cards ``Never'', 19 of Heavy 7 1 3 11 the cards ``Occas'', 17 of the cards ``Regul''. Never 87 18 84 189 Occas 12 3 4 19 3. Shuffle the cards and split into 3 groups of size of size 115, 23, Regul 9 1 7 17 and 98 corresponding to ``Freq'', ``None'', and ``Some'' Total 115 23 98 236 respectively. 4. Calculate the the χ 2 test statistic for the shuffled data. ▶ H 0 : smoking habits are not dependent on exercise habits. 5. Repeat steps (3) and (4) many times to build a randomization ▶ H A : smoking habits are depenednet on exercise habits. distribution of many χ 2 test statistics. 16 17 Randomization test for the dependence of two categorical variables Summary of main ideas 1. Categorical data: 2 levels → Z, > 2 levels → χ 2 square 2. The χ 2 statistic is always positive and right skewed Check out chisq-randomization.R on the course website. 3. At least 5 expected successes for χ 2 testing 18 19

Unit 5: Inference for categorical data 3. Chi-square testing - PowerPoint PPT Presentation

Announcements Unit 5: Inference for categorical data 3. Chi-square testing Tomorrow in lab: work on Project 1---attendence is still Sta 101 - Spring 2015 mandatory. Project 1 due Monday at noon. Duke University, Department of

Unit 5: Inference for categorical variables Lecture 3: Chi-square tests Statistics 101 Thomas

1 Outline Chi-square test Logistic regression 2 Chi-square test 3 Chi-Square Test -

Unit 5: Inference for categorical variables Lecture 3: Chi-square tests Statistics 101 Thomas

Unit 5: Inference for categorical data 3. Chi-square testing PS 5 and PA 5 due Friday 12.30 pm

+ Quantitative Statistics: Chi-Square ScWk 242 Session 7 Slides + Chi-Square Test of

Chi-Square Test How do you know if your data is the result of random chance or environmental

Chi square LING572 Advanced Statistical Methods for NLP January 23, 2020 1 Chi square An

Levels of Testing Chapter 12 Beyond unit testing Developer Testing stages Unit testing

Chapter 23 Two Categorical Variables: The Chi-Square Test Chapter 22 1 BPS - 5th Ed.

Case study introduction Emily Robinson Data Scientist DataCamp Categorical Data in the

Reordering factors Emily Robinson Data Scientist DataCamp Categorical Data in the Tidyverse

STAT 113 Describing Categorical Data Colin Reimer Dawson Oberlin College September 7, 2017 1 /

The General Social S u r v e y IN FE R E N C E FOR C ATE G OR IC AL DATA IN R Andre w Bra y

Unit Testing a C++ Database Application with Unit Testing a C++ Database Application with Unit

STAT 113 Describing Categorical Data I Colin Reimer Dawson Oberlin College September 11, 2020

Unit 5: Inference for categorical variables Lecture 2: Inference for 2-sample proportions

d 4 x g [ RF 1 ( ) R + RF Towards Ghost free and singularity free construction of

Modularity Modularity Also a structured programming topic: Can replace a rectangle with a

Lab 7 Lab 6 Review Review for Lab 7 March 5, 2019 Sprenkle - CSCI111 1 Lab 7: Pair

UMBC A B M A L T F O U M B C I M Y O R T 1 (10/16/07) I E S R C E O V

CPSC 531: System Modeling and Simulation Carey Williamson Department of Computer Science

Turbulence in AdS Akihiro Ishibashi Chaos in AdS workshop 8 Sep. 2014 at Osaka University

Bounds on Boundary Entropy Anatoly Konechny Heriot-Watt University November 28, 2012, EMPG

Toward the real time dynamics of periodically driven holographic superconductor Hongbao Zhang

Sambuz

Useful Links

Newsletter

Mail Us