ACMS 20340 Statistics for Life Sciences Chapter 21: The Chi-Square - PowerPoint PPT Presentation

ACMS 20340 Statistics for Life Sciences Chapter 21: The Chi-Square Test for Goodness of Fit

Proportions When studying population proportions, we are considering categorical variables that only have two possible classes, which we call “successes” and “failures”. Of course, other names are possible such as “Yes/No”, “Male/Female”, “Up/Down”

Categorical Variables with Multiple Categories But categorical variables may have more than two outcomes: A B C This one has three.

How well does our data fit? Suppose we expect that world looks a certain way: A A B B C C But our sample looks different. This difference may be due to the sample being random, or it may just be that the world does not look like our idea. A χ 2 test measures how much the sample does not match the way we expected the world to be. χ is the Greek letter “chi” (rhymes with “eye”).

Births Suppose we consider the day of the week on which a birth occurs for a random sample of 700 births. We hypothesize each day of the week is equally likely. That is, H 0 : p 1 = 1 7 , p 2 = 1 7 , . . . , p 7 = 1 7 where p 1 is the proportion of births on Sunday, p 2 is Monday, etc. This is our null hypothesis. Notice that we could have made p 1 , . . . , p 7 anything we wanted, as long as they sum to 1. The alternate hypothesis is that at least one of the p i ’s is not equal to p i 0 (= 1 7 ).

Birth Data We then collect the following birth data (from a sample of size n = 700): S M T W Th F S Births 84 110 124 104 94 112 72 Percent 12% 16% 18% 15% 13% 16% 10% The hypothesised value is that the overall proportion of births on each day is 1 / 7 ≈ 14%. The question is whether the observed data is significantly different from what we expected.

Birth Data 120 100 80 60 40 20 0 S M T W Th F S

How significant is the difference? Just how many births did we expect to see on Sunday? (This is the expected count of births on Sunday.) Sunday Proportion × 700 = p 1 n = 1 7 · 700 = 100 Doing the same for the other days gives: S M T W Th F S Expected 100 100 100 100 100 100 100 Note: the expected counts will not always be a whole number

Computing the χ 2 Statistic S M T W Th F S Births 84 110 124 104 94 112 72 Expected 100 100 100 100 100 100 100 Difference -16 10 24 4 -6 12 -28 Diff. Sq. 256 100 576 16 36 144 784 Normalized 2.56 1 5.76 0.16 0.36 1.44 7.84 Step 1: Take the difference of the actual counts of births and the expected counts of births. Step 2: Square each of the differences. Step 3: Normalize the squared difference by dividing by the number we expected to see. (e.g. Sunday = 256 / 100). These are called the χ 2 components, one for each category. The chi-squared statistic is then the sum of the components: χ 2 = 2 . 56 + 1 + 5 . 76 + 0 . 16 + 0 . 36 + 1 . 44 + 7 . 84 = 19 . 12

Normalization By normalizing, we are measuring the difference-squared as the percentage of the difference of the actual values from the values we expected. Thinking about χ 2 : ◮ χ 2 is always ≥ 0 (why?) ◮ If we observed exactly what we expected, the corresponding χ 2 value would be close to 0. ◮ The farther the data is from what we expected, the larger the χ 2 value will be. ◮ Since we expect some random variation, we expect each component to contribute around 1 or so. ◮ Adding together 7 components means we should expect a χ 2 around 7 or so.

The χ 2 distribution To interpret the value χ 2 = 19 . 2 we need to use the (appropriately named) chi-squared ( χ 2 ) distribution. This distribution is actually a family of distributions indexed by degrees of freedom , like the t -distribution.

The χ 2 distribution The χ 2 distribution is not symmetric, and it never takes on values less than 0. To turn the test score into a P -value we need to determine the correct df to use and then find the area in the tail to the right of the test score.

Birthday Example S M T W Th F S Births 84 110 124 104 94 112 72 Expected 100 100 100 100 100 100 100 The degrees of freedom is 1 less than the number of categories. In this case df = 7 − 1 = 6. Reasoning behind the degrees of freedom : If we were to alter these proportions, how many could we alter? We could directly change at most 6 since the 7th one must be whatever is needed for the overall sum of proportions to be 1. Note: Unlike the t tests, the sample size does not determine the degrees of freedom.

Table Look-Up Ex: df = 6 We find that with df = 6 the score χ 2 = 19 . 12 has a P -value between 0.005 and 0.0025. This P -value indicates that the difference in our observation from our expectation is statistically significant, so we reject H 0 .

How is it Significant? Since the variation in our data seems to be significant, we are interested in how it is different. S M T W Th F S Observed 84 110 124 104 94 112 72 Expected 100 100 100 100 100 100 100 χ 2 component 2.56 1 5.76 0.16 0.36 1.44 7.84 The larger components indicate days where there is a greater difference. It appears Tuesday has more births and Saturday has fewer births than expected from random variation.

Summary The χ 2 test for goodness of fit with k categories has a null hypothesis H 0 : p 1 = p 1 0 , p 2 = p 2 0 , . . . , p k = p k 0 and alternate hypothesis that at least one of the above inequalities does not hold. With a sample of size n the test statistic is � ( X i − np i 0 ) 2 � (observed count i − expected count i ) 2 χ 2 = = expected count i np i 0 The P -value comes from finding the area under the chi-square curve with df = k − 1 degrees of freedom to the right of χ 2 .

Conditions on χ 2 test ◮ We have some number n of independent observations. ◮ There are only a finite number k of categories. ◮ The probability of observing any given category does not change from observation to observation. ◮ The sample size n is large enough that we expect to see at least 1 of every category (although we may not actually see one of each category). ◮ And n is large enough that at least 80% of the expected counts are larger than 5.

Another Example In the fish H. rosaceus a certain genetic marker has two possible variants, A and B. From previous studies we expect the following proportions in the population: Variant A 61% Variant B 19% No Marker 20% We take a random sample n = 27 of from remote fish population which we believe to be H. rosaceus. Testing for the marker in the sample gives the counts Variant A = 19, Variant B = 2, No Marker = 6. Does the distribution of the marker agree with the previous studies? Are the conditions for performing the χ 2 test met? ◮ The expected counts are (0 . 61)(27) = 16 . 47, (0 . 19)(27) = 5 . 13, (0 . 20)(27) = 5 . 40, which are all large enough.

Another Example χ 2 component Marker Exp. % Expected Obs. 0.39 = (19 − 16 . 47) 2 / 16 . 47 Variant A 61% 16.47 19 1.91 = (2 − 5 . 13) 2 / 5 . 13 Variant B 19% 5.13 2 0.07 = (6 − 5 . 40) 2 / 5 . 40 No Marker 20% 5.40 6 Summing the components gives χ 2 = 2 . 37. There are df = 3 − 1 = 2 degrees of freedom. Look up in table to get p -value > 0 . 25. (in fact = 0.307) Thus our observation supports the conclusion that the fish sampled are indeed H. rosaceus.

Not Rejecting H 0 As usual, not rejecting H 0 does not necessarily mean that H 0 is true. This just means our data are consistent with the null hypothesis. The following are all possible. ◮ H 0 is indeed true. ◮ H 0 is false, but the true hypothesis is too close to distinguish statistically. ◮ H 0 is false, but the sample size is too small for the test to have enough significance.

ACMS 20340 Statistics for Life Sciences Chapter 21: The Chi-Square - PowerPoint PPT Presentation

ACMS 20340 Statistics for Life Sciences Chapter 21: The Chi-Square Test for Goodness of Fit Proportions When studying population proportions, we are considering categorical variables that only have two possible classes, which we call

ACMS 20340 Statistics for Life Sciences Chapter 9: Introducing Probability Why Consider

ACMS 20340 Statistics for Life Sciences Chapter 3: Scatterplots and Correlation Exploratory

ACMS 20340 Statistics for Life Sciences Chapter 7: Samples and Observational Studies Obtaining

ACMS 20340 Statistics for Life Sciences Chapter 8: Designing Experiments Fishers Experiments

ACMS 20340 Statistics for Life Sciences Chapter 13: Sampling Distributions Sampling We use

ACMS 20340 Statistics for Life Sciences Chapter 18: Comparing Two Means Daily Activity and

ACMS 20340 Statistics for Life Sciences Chapter 15: Inference in Practice Inference in Practice

ACMS 20340 Statistics for Life Sciences Chapter 14: Introduction to Inference Sampling

ACMS 20340 Statistics for Life Sciences Chapter 4: Regression A Quick Recap of Chapter 3

ACMS 20340 Statistics for Life Sciences Chapter 11: The Normal Distributions Introducing the

ACMS 20340 Statistics for Life Sciences Chapter 20: Comparing Two Proportions Two sample tests

ACMS 20340 Statistics for Life Sciences Chapter 22: The Chi-Square Test for Two-Way Tables

ACMS 20340 Statistics for Life Sciences Chapter 17: Inference About a Population Mean

ACMS 20340 Statistics for Life Sciences Chapter 19: Inference about a Population Proportion

ACMS 20340 Statistics for Life Sciences Chapter 24: One-way Analysis of Variance: Comparing

ACMS 20340 Statistics for Life Sciences Chapter 12: Discrete Probability Distributions What

Ch05. Introduction to Probability Theory Ping Yu Faculty of Business and Economics The

Intro to Contingency Tables Author: Nicholas Reich and Anna Liu, based on Agresti Ch 1 Course:

Recall the Basics of Hypothesis Testing The level of significance , ( size of test ) is defined

Visualizing categorical data & inference Applied Multivariate Statistics Spring 2012

Introduction to Business Statistics QM 220 QM 220 Chapter 11 Dr. Mohammad Zainal Chapter 11:

Goodness-of-Fit Measures for Induction Trees Gilbert Ritschard, Department of Econometrics,

Probability and Statistics for Computer Science "StaGsGcal thinking will one day be as

S e n s i t i v i t y t o C P v i o l a t i o n i n n e u t r i n

Sambuz

Useful Links

Newsletter

Mail Us

ACMS 20340 Statistics for Life Sciences Chapter 21: The Chi-Square - PowerPoint PPT Presentation

ACMS 20340 Statistics for Life Sciences Chapter 21: The Chi-Square Test for Goodness of Fit Proportions When studying population proportions, we are considering categorical variables that only have two possible classes, which we call

ACMS 20340 Statistics for Life Sciences Chapter 9: Introducing Probability Why Consider

ACMS 20340 Statistics for Life Sciences Chapter 3: Scatterplots and Correlation Exploratory

ACMS 20340 Statistics for Life Sciences Chapter 7: Samples and Observational Studies Obtaining

ACMS 20340 Statistics for Life Sciences Chapter 8: Designing Experiments Fishers Experiments

ACMS 20340 Statistics for Life Sciences Chapter 13: Sampling Distributions Sampling We use

ACMS 20340 Statistics for Life Sciences Chapter 18: Comparing Two Means Daily Activity and

ACMS 20340 Statistics for Life Sciences Chapter 15: Inference in Practice Inference in Practice

ACMS 20340 Statistics for Life Sciences Chapter 14: Introduction to Inference Sampling

ACMS 20340 Statistics for Life Sciences Chapter 4: Regression A Quick Recap of Chapter 3

ACMS 20340 Statistics for Life Sciences Chapter 11: The Normal Distributions Introducing the

ACMS 20340 Statistics for Life Sciences Chapter 20: Comparing Two Proportions Two sample tests

ACMS 20340 Statistics for Life Sciences Chapter 22: The Chi-Square Test for Two-Way Tables

ACMS 20340 Statistics for Life Sciences Chapter 17: Inference About a Population Mean

ACMS 20340 Statistics for Life Sciences Chapter 19: Inference about a Population Proportion

ACMS 20340 Statistics for Life Sciences Chapter 24: One-way Analysis of Variance: Comparing

ACMS 20340 Statistics for Life Sciences Chapter 12: Discrete Probability Distributions What

Ch05. Introduction to Probability Theory Ping Yu Faculty of Business and Economics The

Intro to Contingency Tables Author: Nicholas Reich and Anna Liu, based on Agresti Ch 1 Course:

Recall the Basics of Hypothesis Testing The level of significance , ( size of test ) is defined

Visualizing categorical data &amp; inference Applied Multivariate Statistics Spring 2012

Introduction to Business Statistics QM 220 QM 220 Chapter 11 Dr. Mohammad Zainal Chapter 11:

Goodness-of-Fit Measures for Induction Trees Gilbert Ritschard, Department of Econometrics,

Probability and Statistics for Computer Science &quot;StaGsGcal thinking will one day be as

S e n s i t i v i t y t o C P v i o l a t i o n i n n e u t r i n

Sambuz

Useful Links

Newsletter

Mail Us

Visualizing categorical data & inference Applied Multivariate Statistics Spring 2012

Probability and Statistics for Computer Science "StaGsGcal thinking will one day be as