ACMS 20340 Statistics for Life Sciences Chapter 19: Inference - - PowerPoint PPT Presentation

▶

Jul 06, 2023 316 likes •528 views

ACMS 20340 Statistics for Life Sciences Chapter 19: Inference about a Population Proportion Estimating the population proportion Recall that we estimated the proportion p of a population having some characteristic with the sample proportion

SLIDE 1

ACMS 20340 Statistics for Life Sciences

Chapter 19: Inference about a Population Proportion

SLIDE 2

Estimating the population proportion

Recall that we estimated the proportion p of a population having some characteristic with the sample proportion ˆ p. Some percentage of faculty wore costumes on Halloween. A sample of 42 faculty members showed that 6 wore costumes and 36 did not. The parameter p is the true proportion of faculty that wore costumes. The statistic ˆ p is the proportion of the sample that wore costumes. ˆ p = the number who wore costumes the total size of the sample In this case, ˆ p = 6

42 = 0.143.

SLIDE 3

Properties of ˆ p

Recall that if the sample size is large enough, then the sampling distribution of ˆ p is approximately Normal with mean p and s.d.

p(1 − p)/n.

We can form confidence intervals of the form ˆ p ± z∗

p(1 − p)

n . Problem: We do not know p.

SLIDE 4

Confidence intervals for p, take 1

First solution: use ˆ p in place of p! ˆ p ± z∗

p(1 − p)

n becomes ˆ p ± z∗

p(1 − ˆ p) n This works when there are at least 15 successes and 15 failures in the sample.

◮ Why z∗ and not t∗? ◮ Why is this the “first solution”?

SLIDE 5

Why not t∗?

Previously, when we used the sample mean ¯ x to estimate the population mean µ, σ was involved in the description of the spread

f the distribution of ¯

x. But since we also estimated σ, we were forced to use a t distribution. Now, we are using the sample proportion ˆ p to estimate the population proportion p. However, since the standard deviation of ˆ p is

p(1 − p)/n, this
nly depends on p.

So there is really only one parameter, p, describing the distribution

f ˆ

p. Thus, a t distribution isn’t needed.

SLIDE 6

Why is this only a “first solution”?

This approach works for large samples: at least 15 successes and 15 failures means the sample must contain at least 30 observations. For smaller samples, however, the estimate

p(1 − ˆ p)/n is not a very good one. In particular, using

p(1 − ˆ p)/n gives confidence intervals which are too small.

SLIDE 7

Confidence intervals for p, take 2

A simple modification to calculating ˆ p which almost always produces better estimates is the so-called “plus four” method. Before calculating ˆ p, we first add four imaginary observations, two

f which are successes and two of which are failures.

ˆ p = number of successes n ⇒ ˜ p = number of successes + 2 n + 4 We then use the resulting statistic ˜ p for our estimates. A few conditions:

◮ We need n > 10, and ◮ We should only work with confidence levels of at least 90%.

SLIDE 8

An example

Find a 95% confidence interval for the proportion p of faculty members that wore costumes for halloween. In a sample of size n = 42 there were 6 faculty that wore costumes. ˜ p = 6 + 2 42 + 4 = 0.174 and SE(˜ p) =

p(1 − ˜ p) n + 4 =

0.144

42 + 4 = 0.055. This yields the interval interval ˜ p ± z∗SE(˜ p) = 0.174 ± (1.96)(0.055) = [0.067, 0.281]. Note that we used n + 4 when calculating the standard error!

SLIDE 9

Estimating the needed sample size

When calculating a confidence interval for p from a large sample, we used the interval ˆ p ± z∗

p(1 − ˆ p) n (We are not taking the “plus four” estimation into account here). Suppose we want a confidence interval of a certain width m: ˆ p ± m How large of sample do we need? We want m = z∗

p(1 − ˆ p) n

SLIDE 10

Estimating the needed sample size

m = z∗

p(1 − ˆ p) n Solve for n: n = z∗ m 2 ˆ p(1 − ˆ p) Problem: ˆ p is found after taking a sample. Yet, we need to find n before taking the sample. We need a number to use in place of ˆ

p. There are a few options.

◮ Guess some value p∗ which we think will be close to ˆ

p (perhaps using some prior knowledge of the population)

◮ Use p∗ = 0.5 as our guess. (Why 0.5? It maximizes the

needed sample size over all possible p values)

SLIDE 11

Why does p∗ = 0.5 maximize the sample size?

Fixing z∗ = 1.96 and m = 0.1 we consider what happens to the sample size as our guess p∗ varies between 0 and 1.

0.0 0.2 0.4 0.6 0.8 1.0 20 40 60 80 p* (z*/m)^2 p*(1-p*)

This is the graph of n = (z∗/m)2 p∗(1 − p∗). The graph is always a parabola with zeros p∗ = 0 and p∗ = 1. The maximum is always at p∗ = 0.5.

SLIDE 12

An Example 1

We want to survey residents in the South Bend area to see how many are aware of the dangers of dihydrogen monoxide. How large of a sample would we need to get a 95% confidence interval with a 2% margin of error? For a 95% CI we use z∗ = 1.96. Asking for a 2% margin of error is the same as asking for m = 0.02. We have no idea what the true proportion will be so we take p∗ = 0.5.

SLIDE 13

An Example 2

Calculating using the formula from before: n = z∗ m 2 p∗(1 − p∗) = 1.96 0.02 2 (0.5)(0.5) = 2401 We would need a sample of size at least 2401 to get such an CI.

SLIDE 14

Hypotheses about Proportions

Continuing the dihydrogen monoxide study, we think that about 25% of the South Bend population knows about the dangers. We can formulate this thought as a hypothesis test: H0 : p = 0.25 Ha : p = 0.25 In general: H0 : p = p0 Ha : p = p0

SLIDE 15

The Test Statistic for a Proportion Hypothesis Test

Remember, when we do a hypothesis test we calculate the test statistic under the assumption that H0 is true. If H0 is true, then p = p0 and so ˆ p has mean p0 and standard deviation

p0(1 − p0)/n.

So, from ˆ p we calculate the test statistic as z = ˆ p − p0

p0(1 − p0)/n

where z has a standard normal distribution.

SLIDE 16

Conditions on proportion hypothesis tests

Some conditions which need to be met to do a hypothesis test H0 : p = p0.

◮ This test requires enough samples n so that both np0 ≥ 10

and n(1 − p0) ≥ 10. (i.e. the expected number of successes and failures are both ≥ 10)

◮ Fortunately, the counts only depend on p0, the proportion we

are performing the test against, not on the actual counts which show up in the sample.

◮ The “plus four” technique only applies to confidence intervals.

We do not need to use it for the hypothesis test since knowing p0 gives us the standard deviation of ˆ p.

SLIDE 17

DHMO Example Test, 1

In our survey we find 29 people are aware of DHMO in a sample of size n = 183. Let’s perform a large sample hypothesis test at a α = 0.05 significance level. H0 : p = 0.25 Ha : p = 0.25 Do the conditions to use the test apply? np0 = 183 × 0.25 = 45.75 ≥ 10 and n(1 − p0) = 137.25 ≥ 10. So, yes they do.

SLIDE 18

DHMO Example Test, 2

Finishing the calculation, we have ˆ p = 29/183 = 0.158 and hence z = ˆ p − p0

p0(1 − p0)/n

= −2.87. Using the table for a two tailed test we get a p-value between 0.005 and 0.002. Reject H0 at the level α = 0.05.

SLIDE 19

DHMO confidence interval, 1

Using our DHMO survey data, n = 183 with 29 successes, find a 95% confidence interval for p. We will use the “plus four” technique. Are the conditions met?

◮ The confidence level is at least 90%. ◮ n = 183 ≥ 10.

Yes, the conditions are met.

SLIDE 20

DHMO confidence interval, 2

Proceeding, we calculate z∗ = 1.96 ˜ p = 29 + 2 183 + 4 = 0.166. Plugging values into the formula ˜ p ± z∗ ˜ p(1 − ˜ p)/(n + 4) yields 0.166 ± 0.053, which simplifies to [0.112, 0.219].