SLIDE 1
ACMS 20340 Statistics for Life Sciences Chapter 19: Inference - - PowerPoint PPT Presentation
ACMS 20340 Statistics for Life Sciences Chapter 19: Inference - - PowerPoint PPT Presentation
ACMS 20340 Statistics for Life Sciences Chapter 19: Inference about a Population Proportion Estimating the population proportion Recall that we estimated the proportion p of a population having some characteristic with the sample proportion
SLIDE 2
SLIDE 3
Properties of ˆ p
Recall that if the sample size is large enough, then the sampling distribution of ˆ p is approximately Normal with mean p and s.d.
- p(1 − p)/n.
We can form confidence intervals of the form ˆ p ± z∗
- p(1 − p)
n . Problem: We do not know p.
SLIDE 4
Confidence intervals for p, take 1
First solution: use ˆ p in place of p! ˆ p ± z∗
- p(1 − p)
n becomes ˆ p ± z∗
- ˆ
p(1 − ˆ p) n This works when there are at least 15 successes and 15 failures in the sample.
◮ Why z∗ and not t∗? ◮ Why is this the “first solution”?
SLIDE 5
Why not t∗?
Previously, when we used the sample mean ¯ x to estimate the population mean µ, σ was involved in the description of the spread
- f the distribution of ¯
x. But since we also estimated σ, we were forced to use a t distribution. Now, we are using the sample proportion ˆ p to estimate the population proportion p. However, since the standard deviation of ˆ p is
- p(1 − p)/n, this
- nly depends on p.
So there is really only one parameter, p, describing the distribution
- f ˆ
p. Thus, a t distribution isn’t needed.
SLIDE 6
Why is this only a “first solution”?
This approach works for large samples: at least 15 successes and 15 failures means the sample must contain at least 30 observations. For smaller samples, however, the estimate
- ˆ
p(1 − ˆ p)/n is not a very good one. In particular, using
- ˆ
p(1 − ˆ p)/n gives confidence intervals which are too small.
SLIDE 7
Confidence intervals for p, take 2
A simple modification to calculating ˆ p which almost always produces better estimates is the so-called “plus four” method. Before calculating ˆ p, we first add four imaginary observations, two
- f which are successes and two of which are failures.
ˆ p = number of successes n ⇒ ˜ p = number of successes + 2 n + 4 We then use the resulting statistic ˜ p for our estimates. A few conditions:
◮ We need n > 10, and ◮ We should only work with confidence levels of at least 90%.
SLIDE 8
An example
Find a 95% confidence interval for the proportion p of faculty members that wore costumes for halloween. In a sample of size n = 42 there were 6 faculty that wore costumes. ˜ p = 6 + 2 42 + 4 = 0.174 and SE(˜ p) =
- ˜
p(1 − ˜ p) n + 4 =
- 0.144
42 + 4 = 0.055. This yields the interval interval ˜ p ± z∗SE(˜ p) = 0.174 ± (1.96)(0.055) = [0.067, 0.281]. Note that we used n + 4 when calculating the standard error!
SLIDE 9
Estimating the needed sample size
When calculating a confidence interval for p from a large sample, we used the interval ˆ p ± z∗
- ˆ
p(1 − ˆ p) n (We are not taking the “plus four” estimation into account here). Suppose we want a confidence interval of a certain width m: ˆ p ± m How large of sample do we need? We want m = z∗
- ˆ
p(1 − ˆ p) n
SLIDE 10
Estimating the needed sample size
m = z∗
- ˆ
p(1 − ˆ p) n Solve for n: n = z∗ m 2 ˆ p(1 − ˆ p) Problem: ˆ p is found after taking a sample. Yet, we need to find n before taking the sample. We need a number to use in place of ˆ
- p. There are a few options.
◮ Guess some value p∗ which we think will be close to ˆ
p (perhaps using some prior knowledge of the population)
◮ Use p∗ = 0.5 as our guess. (Why 0.5? It maximizes the
needed sample size over all possible p values)
SLIDE 11
Why does p∗ = 0.5 maximize the sample size?
Fixing z∗ = 1.96 and m = 0.1 we consider what happens to the sample size as our guess p∗ varies between 0 and 1.
0.0 0.2 0.4 0.6 0.8 1.0 20 40 60 80 p* (z*/m)^2 p*(1-p*)
This is the graph of n = (z∗/m)2 p∗(1 − p∗). The graph is always a parabola with zeros p∗ = 0 and p∗ = 1. The maximum is always at p∗ = 0.5.
SLIDE 12
An Example 1
We want to survey residents in the South Bend area to see how many are aware of the dangers of dihydrogen monoxide. How large of a sample would we need to get a 95% confidence interval with a 2% margin of error? For a 95% CI we use z∗ = 1.96. Asking for a 2% margin of error is the same as asking for m = 0.02. We have no idea what the true proportion will be so we take p∗ = 0.5.
SLIDE 13
An Example 2
Calculating using the formula from before: n = z∗ m 2 p∗(1 − p∗) = 1.96 0.02 2 (0.5)(0.5) = 2401 We would need a sample of size at least 2401 to get such an CI.
SLIDE 14
Hypotheses about Proportions
Continuing the dihydrogen monoxide study, we think that about 25% of the South Bend population knows about the dangers. We can formulate this thought as a hypothesis test: H0 : p = 0.25 Ha : p = 0.25 In general: H0 : p = p0 Ha : p = p0
SLIDE 15
The Test Statistic for a Proportion Hypothesis Test
Remember, when we do a hypothesis test we calculate the test statistic under the assumption that H0 is true. If H0 is true, then p = p0 and so ˆ p has mean p0 and standard deviation
- p0(1 − p0)/n.
So, from ˆ p we calculate the test statistic as z = ˆ p − p0
- p0(1 − p0)/n
where z has a standard normal distribution.
SLIDE 16
Conditions on proportion hypothesis tests
Some conditions which need to be met to do a hypothesis test H0 : p = p0.
◮ This test requires enough samples n so that both np0 ≥ 10
and n(1 − p0) ≥ 10. (i.e. the expected number of successes and failures are both ≥ 10)
◮ Fortunately, the counts only depend on p0, the proportion we
are performing the test against, not on the actual counts which show up in the sample.
◮ The “plus four” technique only applies to confidence intervals.
We do not need to use it for the hypothesis test since knowing p0 gives us the standard deviation of ˆ p.
SLIDE 17
DHMO Example Test, 1
In our survey we find 29 people are aware of DHMO in a sample of size n = 183. Let’s perform a large sample hypothesis test at a α = 0.05 significance level. H0 : p = 0.25 Ha : p = 0.25 Do the conditions to use the test apply? np0 = 183 × 0.25 = 45.75 ≥ 10 and n(1 − p0) = 137.25 ≥ 10. So, yes they do.
SLIDE 18
DHMO Example Test, 2
Finishing the calculation, we have ˆ p = 29/183 = 0.158 and hence z = ˆ p − p0
- p0(1 − p0)/n
= −2.87. Using the table for a two tailed test we get a p-value between 0.005 and 0.002. Reject H0 at the level α = 0.05.
SLIDE 19
DHMO confidence interval, 1
Using our DHMO survey data, n = 183 with 29 successes, find a 95% confidence interval for p. We will use the “plus four” technique. Are the conditions met?
◮ The confidence level is at least 90%. ◮ n = 183 ≥ 10.
Yes, the conditions are met.
SLIDE 20