acms 20340 statistics for life sciences
play

ACMS 20340 Statistics for Life Sciences Chapter 19: Inference - PowerPoint PPT Presentation

ACMS 20340 Statistics for Life Sciences Chapter 19: Inference about a Population Proportion Estimating the population proportion Recall that we estimated the proportion p of a population having some characteristic with the sample proportion


  1. ACMS 20340 Statistics for Life Sciences Chapter 19: Inference about a Population Proportion

  2. Estimating the population proportion Recall that we estimated the proportion p of a population having some characteristic with the sample proportion ˆ p . Some percentage of faculty wore costumes on Halloween. A sample of 42 faculty members showed that 6 wore costumes and 36 did not. The parameter p is the true proportion of faculty that wore costumes. The statistic ˆ p is the proportion of the sample that wore costumes. p = the number who wore costumes ˆ the total size of the sample p = 6 In this case, ˆ 42 = 0 . 143.

  3. Properties of ˆ p Recall that if the sample size is large enough, then the sampling distribution of ˆ p is approximately Normal with mean p and s.d. � p (1 − p ) / n . We can form confidence intervals of the form � p (1 − p ) p ± z ∗ ˆ . n Problem: We do not know p .

  4. Confidence intervals for p , take 1 First solution: use ˆ p in place of p ! � p (1 − p ) p ± z ∗ ˆ n becomes � p (1 − ˆ ˆ p ) p ± z ∗ ˆ n This works when there are at least 15 successes and 15 failures in the sample. ◮ Why z ∗ and not t ∗ ? ◮ Why is this the “first solution”?

  5. Why not t ∗ ? Previously, when we used the sample mean ¯ x to estimate the population mean µ , σ was involved in the description of the spread of the distribution of ¯ x . But since we also estimated σ , we were forced to use a t distribution. Now, we are using the sample proportion ˆ p to estimate the population proportion p . � However, since the standard deviation of ˆ p is p (1 − p ) / n , this only depends on p . So there is really only one parameter, p , describing the distribution of ˆ p . Thus, a t distribution isn’t needed.

  6. Why is this only a “first solution”? This approach works for large samples: at least 15 successes and 15 failures means the sample must contain at least 30 observations. � For smaller samples, however, the estimate p (1 − ˆ ˆ p ) / n is not a very good one. � In particular, using p (1 − ˆ ˆ p ) / n gives confidence intervals which are too small.

  7. Confidence intervals for p , take 2 A simple modification to calculating ˆ p which almost always produces better estimates is the so-called “plus four” method. Before calculating ˆ p , we first add four imaginary observations, two of which are successes and two of which are failures. p = number of successes p = number of successes + 2 ˆ ⇒ ˜ n n + 4 We then use the resulting statistic ˜ p for our estimates. A few conditions: ◮ We need n > 10, and ◮ We should only work with confidence levels of at least 90%.

  8. An example Find a 95% confidence interval for the proportion p of faculty members that wore costumes for halloween. In a sample of size n = 42 there were 6 faculty that wore costumes. p = 6 + 2 ˜ 42 + 4 = 0 . 174 and � � p (1 − ˜ ˜ p ) 0 . 144 SE (˜ p ) = = 42 + 4 = 0 . 055 . n + 4 This yields the interval interval p ± z ∗ SE (˜ ˜ p ) = 0 . 174 ± (1 . 96)(0 . 055) = [0 . 067 , 0 . 281] . Note that we used n + 4 when calculating the standard error!

  9. Estimating the needed sample size When calculating a confidence interval for p from a large sample, we used the interval � p (1 − ˆ ˆ p ) p ± z ∗ ˆ n (We are not taking the “plus four” estimation into account here). Suppose we want a confidence interval of a certain width m : p ± m ˆ How large of sample do we need? We want � p (1 − ˆ ˆ p ) m = z ∗ n

  10. Estimating the needed sample size � p (1 − ˆ ˆ p ) m = z ∗ n Solve for n : � 2 � z ∗ n = p (1 − ˆ ˆ p ) m Problem: ˆ p is found after taking a sample. Yet, we need to find n before taking the sample. We need a number to use in place of ˆ p . There are a few options. ◮ Guess some value p ∗ which we think will be close to ˆ p (perhaps using some prior knowledge of the population) ◮ Use p ∗ = 0 . 5 as our guess. (Why 0.5? It maximizes the needed sample size over all possible p values)

  11. Why does p ∗ = 0 . 5 maximize the sample size? Fixing z ∗ = 1 . 96 and m = 0 . 1 we consider what happens to the sample size as our guess p ∗ varies between 0 and 1. This is the graph of n = ( z ∗ / m ) 2 p ∗ (1 − p ∗ ) . 80 60 (z*/m)^2 p*(1-p*) The graph is always a parabola 40 with zeros p ∗ = 0 and p ∗ = 1. 20 The maximum is always at p ∗ = 0 . 5. 0 0.0 0.2 0.4 0.6 0.8 1.0 p*

  12. An Example 1 We want to survey residents in the South Bend area to see how many are aware of the dangers of dihydrogen monoxide. How large of a sample would we need to get a 95% confidence interval with a 2% margin of error? For a 95% CI we use z ∗ = 1 . 96. Asking for a 2% margin of error is the same as asking for m = 0 . 02. We have no idea what the true proportion will be so we take p ∗ = 0 . 5.

  13. An Example 2 Calculating using the formula from before: � 2 � z ∗ n = p ∗ (1 − p ∗ ) m � 2 � 1 . 96 = (0 . 5)(0 . 5) = 2401 0 . 02 We would need a sample of size at least 2401 to get such an CI.

  14. Hypotheses about Proportions Continuing the dihydrogen monoxide study, we think that about 25% of the South Bend population knows about the dangers. We can formulate this thought as a hypothesis test: H 0 : p = 0 . 25 H a : p � = 0 . 25 In general: H 0 : p = p 0 H a : p � = p 0

  15. The Test Statistic for a Proportion Hypothesis Test Remember, when we do a hypothesis test we calculate the test statistic under the assumption that H 0 is true. If H 0 is true, then p = p 0 and so ˆ p has mean p 0 and standard � deviation p 0 (1 − p 0 ) / n . So, from ˆ p we calculate the test statistic as p − p 0 ˆ z = � p 0 (1 − p 0 ) / n where z has a standard normal distribution.

  16. Conditions on proportion hypothesis tests Some conditions which need to be met to do a hypothesis test H 0 : p = p 0 . ◮ This test requires enough samples n so that both np 0 ≥ 10 and n (1 − p 0 ) ≥ 10. (i.e. the expected number of successes and failures are both ≥ 10) ◮ Fortunately, the counts only depend on p 0 , the proportion we are performing the test against, not on the actual counts which show up in the sample. ◮ The “plus four” technique only applies to confidence intervals. We do not need to use it for the hypothesis test since knowing p 0 gives us the standard deviation of ˆ p .

  17. DHMO Example Test, 1 In our survey we find 29 people are aware of DHMO in a sample of size n = 183. Let’s perform a large sample hypothesis test at a α = 0 . 05 significance level. H 0 : p = 0 . 25 H a : p � = 0 . 25 Do the conditions to use the test apply? np 0 = 183 × 0 . 25 = 45 . 75 ≥ 10 and n (1 − p 0 ) = 137 . 25 ≥ 10 . So, yes they do.

  18. DHMO Example Test, 2 Finishing the calculation, we have p = 29 / 183 = 0 . 158 ˆ and hence p − p 0 ˆ z = � p 0 (1 − p 0 ) / n = − 2 . 87 . Using the table for a two tailed test we get a p -value between 0.005 and 0.002. Reject H 0 at the level α = 0 . 05.

  19. DHMO confidence interval, 1 Using our DHMO survey data, n = 183 with 29 successes, find a 95% confidence interval for p . We will use the “plus four” technique. Are the conditions met? ◮ The confidence level is at least 90%. ◮ n = 183 ≥ 10. Yes, the conditions are met.

  20. DHMO confidence interval, 2 Proceeding, we calculate z ∗ = 1 . 96 p = 29 + 2 ˜ 183 + 4 = 0 . 166 . Plugging values into the formula p ± z ∗ � ˜ p (1 − ˜ ˜ p ) / ( n + 4) yields 0 . 166 ± 0 . 053 , which simplifies to [0 . 112 , 0 . 219] .

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend