SLIDE 1
ACMS 20340 Statistics for Life Sciences Chapter 21: The Chi-Square - - PowerPoint PPT Presentation
ACMS 20340 Statistics for Life Sciences Chapter 21: The Chi-Square - - PowerPoint PPT Presentation
ACMS 20340 Statistics for Life Sciences Chapter 21: The Chi-Square Test for Goodness of Fit Proportions When studying population proportions, we are considering categorical variables that only have two possible classes, which we call
SLIDE 2
SLIDE 3
Categorical Variables with Multiple Categories
But categorical variables may have more than two outcomes:
A B C
This one has three.
SLIDE 4
How well does our data fit?
Suppose we expect that world looks a certain way:
A B C A B C
But our sample looks different. This difference may be due to the sample being random, or it may just be that the world does not look like our idea. A χ2 test measures how much the sample does not match the way we expected the world to be. χ is the Greek letter “chi” (rhymes with “eye”).
SLIDE 5
Births
Suppose we consider the day of the week on which a birth occurs for a random sample of 700 births. We hypothesize each day of the week is equally likely. That is, H0 : p1 = 1 7, p2 = 1 7, . . . , p7 = 1 7 where p1 is the proportion of births on Sunday, p2 is Monday, etc. This is our null hypothesis. Notice that we could have made p1, . . . , p7 anything we wanted, as long as they sum to 1. The alternate hypothesis is that at least one of the pi’s is not equal to pi0 (= 1
7).
SLIDE 6
Birth Data
We then collect the following birth data (from a sample of size n = 700): S M T W Th F S Births 84 110 124 104 94 112 72 Percent 12% 16% 18% 15% 13% 16% 10% The hypothesised value is that the overall proportion of births on each day is 1/7 ≈ 14%. The question is whether the observed data is significantly different from what we expected.
SLIDE 7
Birth Data
S M T W Th F S 20 40 60 80 100 120
SLIDE 8
How significant is the difference?
Just how many births did we expect to see on Sunday? (This is the expected count of births on Sunday.) Sunday Proportion × 700 = p1n = 1 7 · 700 = 100 Doing the same for the other days gives: S M T W Th F S Expected 100 100 100 100 100 100 100 Note: the expected counts will not always be a whole number
SLIDE 9
Computing the χ2 Statistic
S M T W Th F S Births 84 110 124 104 94 112 72 Expected 100 100 100 100 100 100 100 Difference
- 16
10 24 4
- 6
12
- 28
- Diff. Sq.
256 100 576 16 36 144 784 Normalized 2.56 1 5.76 0.16 0.36 1.44 7.84 Step 1: Take the difference of the actual counts of births and the expected counts of births. Step 2: Square each of the differences. Step 3: Normalize the squared difference by dividing by the number we expected to see. (e.g. Sunday = 256/100). These are called the χ2 components, one for each category. The chi-squared statistic is then the sum of the components: χ2 = 2.56 + 1 + 5.76 + 0.16 + 0.36 + 1.44 + 7.84 = 19.12
SLIDE 10
Normalization
By normalizing, we are measuring the difference-squared as the percentage of the difference of the actual values from the values we expected. Thinking about χ2:
◮ χ2 is always ≥ 0 (why?) ◮ If we observed exactly what we expected, the corresponding
χ2 value would be close to 0.
◮ The farther the data is from what we expected, the larger the
χ2 value will be.
◮ Since we expect some random variation, we expect each
component to contribute around 1 or so.
◮ Adding together 7 components means we should expect a χ2
around 7 or so.
SLIDE 11
The χ2 distribution
To interpret the value χ2 = 19.2 we need to use the (appropriately named) chi-squared (χ2) distribution. This distribution is actually a family of distributions indexed by degrees of freedom, like the t-distribution.
SLIDE 12
The χ2 distribution
The χ2 distribution is not symmetric, and it never takes on values less than 0. To turn the test score into a P-value we need to determine the correct df to use and then find the area in the tail to the right of the test score.
SLIDE 13
Birthday Example
S M T W Th F S Births 84 110 124 104 94 112 72 Expected 100 100 100 100 100 100 100 The degrees of freedom is 1 less than the number of categories. In this case df = 7 − 1 = 6. Reasoning behind the degrees of freedom: If we were to alter these proportions, how many could we alter? We could directly change at most 6 since the 7th one must be whatever is needed for the overall sum of proportions to be 1. Note: Unlike the t tests, the sample size does not determine the degrees of freedom.
SLIDE 14
Table Look-Up
Ex: df = 6
We find that with df = 6 the score χ2 = 19.12 has a P-value between 0.005 and 0.0025. This P-value indicates that the difference in our observation from
- ur expectation is statistically significant, so we reject H0.
SLIDE 15
How is it Significant?
Since the variation in our data seems to be significant, we are interested in how it is different. S M T W Th F S Observed 84 110 124 104 94 112 72 Expected 100 100 100 100 100 100 100 χ2 component 2.56 1 5.76 0.16 0.36 1.44 7.84 The larger components indicate days where there is a greater difference. It appears Tuesday has more births and Saturday has fewer births than expected from random variation.
SLIDE 16
Summary
The χ2 test for goodness of fit with k categories has a null hypothesis H0 : p1 = p10, p2 = p20, . . . , pk = pk0 and alternate hypothesis that at least one of the above inequalities does not hold. With a sample of size n the test statistic is χ2 = (Xi − npi0)2 npi0 = (observed counti − expected counti)2 expected counti The P-value comes from finding the area under the chi-square curve with df = k − 1 degrees of freedom to the right of χ2.
SLIDE 17
Conditions on χ2 test
◮ We have some number n of independent observations. ◮ There are only a finite number k of categories. ◮ The probability of observing any given category does not
change from observation to observation.
◮ The sample size n is large enough that we expect to see at
least 1 of every category (although we may not actually see
- ne of each category).
◮ And n is large enough that at least 80% of the expected
counts are larger than 5.
SLIDE 18
Another Example
In the fish H. rosaceus a certain genetic marker has two possible variants, A and B. From previous studies we expect the following proportions in the population: Variant A 61% Variant B 19% No Marker 20% We take a random sample n = 27 of from remote fish population which we believe to be H. rosaceus. Testing for the marker in the sample gives the counts Variant A = 19, Variant B = 2, No Marker = 6. Does the distribution of the marker agree with the previous studies? Are the conditions for performing the χ2 test met?
◮ The expected counts are (0.61)(27) = 16.47,
(0.19)(27) = 5.13, (0.20)(27) = 5.40, which are all large enough.
SLIDE 19
Another Example
Marker
- Exp. %
Expected Obs. χ2 component Variant A 61% 16.47 19 0.39 = (19 − 16.47)2/16.47 Variant B 19% 5.13 2 1.91 = (2 − 5.13)2/5.13 No Marker 20% 5.40 6 0.07 = (6 − 5.40)2/5.40 Summing the components gives χ2 = 2.37. There are df = 3 − 1 = 2 degrees of freedom. Look up in table to get p-value > 0.25. (in fact = 0.307) Thus our observation supports the conclusion that the fish sampled are indeed H. rosaceus.
SLIDE 20