Inference for Proportions Marc H. Mehlman marcmehlman@yahoo.com - - PowerPoint PPT Presentation

inference for proportions
SMART_READER_LITE
LIVE PREVIEW

Inference for Proportions Marc H. Mehlman marcmehlman@yahoo.com - - PowerPoint PPT Presentation

Inference for Proportions Marc H. Mehlman marcmehlman@yahoo.com University of New Haven Based on Rare Event Rule: rare events happen but not to me. Marc Mehlman (University of New Haven) Inference for Proportions 1 / 20 Table of


slide-1
SLIDE 1

Inference for Proportions

Marc H. Mehlman marcmehlman@yahoo.com

University of New Haven

Based on Rare Event Rule: “rare events happen – but not to me”.

Marc Mehlman (University of New Haven) Inference for Proportions 1 / 20

slide-2
SLIDE 2

Table of Contents

1

Inference for a Single Proportion

2

Comparing Two Proportions

Marc Mehlman (University of New Haven) Inference for Proportions 2 / 20

slide-3
SLIDE 3

Inference for a Single Proportion

Inference for a Single Proportion

Inference for a Single Proportion

Marc Mehlman (University of New Haven) Inference for Proportions 3 / 20

slide-4
SLIDE 4

Inference for a Single Proportion

Let X1, · · · , Xn be a random sample from BIN(1, p). Then X = n

j=1 Xj ∼ BIN(n, p).

Definition The sample population proportion is ˆ p

def

=

X n = ¯

X. The standard error of ˆ p is SEˆ

p def

=

  • ˆ

p(1 − ˆ p) n . By the CLT, ¯ X is approximately N

  • p,
  • p(1−p)

n

  • for big n and also ˆ

p is approximately p for big n. Thus for big n, ¯ X is approximately N

  • ˆ

p,

  • ˆ

p(1−ˆ p) n

  • .

Theorem (Large Sample Confidence Interval for p:) margin of error = m = z⋆

  • ˆ

p(1 − ˆ p) n = z⋆SEˆ

p

and the confidence interval is ˆ p ± m. Use this interval for confidence 90% or more and when the number of successes and failures are both at least 15.

Marc Mehlman (University of New Haven) Inference for Proportions 4 / 20

slide-5
SLIDE 5

Inference for a Single Proportion

We compute a 90% confidence interval for the population proportion of arthritis patients who suffer some "adverse symptoms." What is the sample proportion p̂ ?

ˆ ˆ * (1 ) 1.645* 0.052(1 0.052) / 440 1.645*0.0106 0.017 m z p p n m m = − = − = ≈ 052 . 440 23 ˆ ≈ = p

For a 90% confidence level, z* = 1.645. Using the large sample method:  With 90% confidence level, between 3.5% and 6.9% of arthritis patients taking this pain medication experience some adverse symptoms.

ˆ 90%CIfor : 0.052 0.017 p p m ± ±

Confidence level C df 0.50 0.60 0.70 0.80 0.90 0.95 0.96 z* 0.674 0.841 1.036 1.282 1.645 1.960 2.054

Marc Mehlman (University of New Haven) Inference for Proportions 5 / 20

slide-6
SLIDE 6

Inference for a Single Proportion

“Plus four” confidence interval for p

The “plus four” method gives reasonably accurate confidence

  • intervals. We act as if we had four additional observations, two

successes and two failures. Thus, the new sample size is n + 4 and the count of successes is X + 2.

4 ns

  • bservatio

all

  • f

count 2 successes

  • f

counts ~ + + = p

) 4 ( ) ~ 1 ( ~ * ~ * with , ~ : + − = = ± n p p z E S z m m p CI

The “plus four” estimate of p is: An approximate level C confidence interval is: Use this method when C is at least 90% and sample size is at least 10.

Marc Mehlman (University of New Haven) Inference for Proportions 6 / 20

slide-7
SLIDE 7

Inference for a Single Proportion

We want a 90% CI for the population proportion of arthritis patients who suffer some “adverse symptoms.”

018 . 011 . * 645 . 1 444 / ) 056 . 1 ( 056 . * 645 . 1 ) 4 ( ) ~ 1 ( ~ * ≈ = − = + − = m m n p p z m

An approximate 90% confidence interval for p using the “plus four” method is:  With 90% confidence, between 3.8% and 7.4% of the population of arthritis patients taking this pain medication experience some adverse symptoms.

90%CIfor : 0.056 0.018 p p m ± ± %

056 . 444 25 4 440 2 23 ~ ≈ = + + = p What is the value of the “plus four” estimate of p?

Confidence level C df 0.50 0.60 0.70 0.80 0.90 0.95 0.96 0.98 0.99 0.995 0.998 0.999 z* 0.674 0.841 1.036 1.282 1.645 1.960 2.054 2.326 2.576 2.807 3.091 3.291 Marc Mehlman (University of New Haven) Inference for Proportions 7 / 20

slide-8
SLIDE 8

Inference for a Single Proportion

Theorem (Sample Size) Given a desired margin of error, m, one should chose the following sample size, n, to obtain the confidence interval, ˆ p ± m (or ˜ p ± m) of p. n = z⋆

m

2 p⋆(1 − p⋆) when p⋆ is an educated guess of what p is z⋆

2m

2 with no educated guess of p . Note:

1 round up n to ensure it is a positive integer. 2 the closer one’s educated guess, p⋆, of p is to 1/2, the safer one is. 3 n = (z⋆)2

4m2 (ie, when p⋆ = 1/2) is the most conservative estimate of n.

Marc Mehlman (University of New Haven) Inference for Proportions 8 / 20

slide-9
SLIDE 9

Inference for a Single Proportion

Theorem (Sample Size) Given a desired margin of error, m, one should chose the following sample size, n, to obtain the confidence interval, ˆ p ± m (or ˜ p ± m) of p. n = z⋆

m

2 p⋆(1 − p⋆) when p⋆ is an educated guess of what p is z⋆

2m

2 with no educated guess of p . Note:

1 round up n to ensure it is a positive integer. 2 the closer one’s educated guess, p⋆, of p is to 1/2, the safer one is. 3 n = (z⋆)2

4m2 (ie, when p⋆ = 1/2) is the most conservative estimate of n.

Marc Mehlman (University of New Haven) Inference for Proportions 8 / 20

slide-10
SLIDE 10

Inference for a Single Proportion

Theorem (Sample Size) Given a desired margin of error, m, one should chose the following sample size, n, to obtain the confidence interval, ˆ p ± m (or ˜ p ± m) of p. n = z⋆

m

2 p⋆(1 − p⋆) when p⋆ is an educated guess of what p is z⋆

2m

2 with no educated guess of p . Note:

1 round up n to ensure it is a positive integer. 2 the closer one’s educated guess, p⋆, of p is to 1/2, the safer one is. 3 n = (z⋆)2

4m2 (ie, when p⋆ = 1/2) is the most conservative estimate of n.

Marc Mehlman (University of New Haven) Inference for Proportions 8 / 20

slide-11
SLIDE 11

Inference for a Single Proportion

Theorem (Sample Size) Given a desired margin of error, m, one should chose the following sample size, n, to obtain the confidence interval, ˆ p ± m (or ˜ p ± m) of p. n = z⋆

m

2 p⋆(1 − p⋆) when p⋆ is an educated guess of what p is z⋆

2m

2 with no educated guess of p . Note:

1 round up n to ensure it is a positive integer. 2 the closer one’s educated guess, p⋆, of p is to 1/2, the safer one is. 3 n = (z⋆)2

4m2 (ie, when p⋆ = 1/2) is the most conservative estimate of n.

Marc Mehlman (University of New Haven) Inference for Proportions 8 / 20

slide-12
SLIDE 12

Inference for a Single Proportion

What sample size would we need in order to achieve a margin of error no more than 0.01 (1 percentage point) with a 90% confidence level?

4 . 2434 ) 9 . )( 1 . ( 01 . 645 . 1 *) 1 ( * *

2 2

≈       = −       = p p m z n

We could use 0.5 for our guessed p*. However, since the drug has been approved for sale over the counter, we can safely assume that no more than 10% of patients should suffer “adverse symptoms” (a better guess than 50%). For a 90% confidence level, z* = 1.645.  To obtain a margin of error no more than 0.01 we need a sample size n of at least 2435 arthritis patients.

Confidence level C df 0.50 0.60 0.70 0.80 0.90 0.95 0.96 z* 0.674 0.841 1.036 1.282 1.645 1.960 2.054

Marc Mehlman (University of New Haven) Inference for Proportions 9 / 20

slide-13
SLIDE 13

Inference for a Single Proportion

Theorem (Large Sample z–Test for a Population Proportion) Let X1, · · · , Xn be a random sample where Xj ∼ BIN(1, p) and such that np ≥ 10 and n(1 − p) ≥ 10. Let H0 : p = p0 where p is unknown. Then z = ˆ p − p0

  • p0(1−p0)

n

∼ N(0, 1) is a test statistic for H0.

Marc Mehlman (University of New Haven) Inference for Proportions 10 / 20

slide-14
SLIDE 14

Inference for a Single Proportion

13

Example

A potato-chip producer has just received a truckload of potatoes from its main

  • supplier. If the producer determines that more than 8% of the potatoes in the

shipment have blemishes, the truck will be sent away to get another load from the

  • supplier. A supervisor selects a random sample of 500 potatoes from the truck. An

inspection reveals that 47 of the potatoes have blemishes. Carry out a significance test at the α = 0.10 significance level. What should the producer conclude? We want to perform a test at the α = 0.10 significance level of H0: p = 0.08 Ha: p > 0.08 where p is the actual proportion of potatoes in this shipment with blemishes. If conditions are met, we should do a one-sample z test for the population proportion p.  Random: The supervisor took a random sample of 500 potatoes from the shipment.  Normal: Assuming H0: p = 0.08 is true, the expected numbers of blemished and unblemished potatoes are np0 = 500(0.08) = 40 and n(1 – p0) = 500(0.92) = 460, respectively. Because both of these values are at least 10, we should be safe doing Normal calculations.

Marc Mehlman (University of New Haven) Inference for Proportions 11 / 20

slide-15
SLIDE 15

Inference for a Single Proportion

14

Since our P-value, 0.1251, is greater than the chosen significance level of α = 0.10, we fail to reject H0. There is not sufficient evidence to conclude that the shipment contains more than 8% blemished

  • potatoes. The producer will use this truckload of potatoes to make

potato chips. The sample proportion of blemished potatoes is ˆ

p = 47/500 = 0.094.

Test statistic z= ˆ p − p p

0(1− p 0)

n = 0.094 − 0.08 0.08(0.92) 500 =1.15 P-value The desired P-value is: P(z ≥ 1.15) = 1 – 0.8749 = 0.1251

Example

Marc Mehlman (University of New Haven) Inference for Proportions 12 / 20

slide-16
SLIDE 16

Comparing Two Proportions

Comparing Two Proportions

Comparing Two Proportions

Marc Mehlman (University of New Haven) Inference for Proportions 13 / 20

slide-17
SLIDE 17

Comparing Two Proportions

Comparing 2 independent samples

We often need to compare 2 treatments with 2 independent samples. For large enough samples, the sampling distribution of is approximately Normal.

) ˆ ˆ (

2 1

p p −

However, neither p1 nor p2 are known.

Marc Mehlman (University of New Haven) Inference for Proportions 14 / 20

slide-18
SLIDE 18

Comparing Two Proportions

Given two random samples, X1, · · · , XnX and Y1, · · · , YnY , where Xi ∼ BIN(1, pX) and Yj ∼ BIN(1, pY ), define D def = ˆ pX − ˆ pY . Notice that

1 D is approximately normal for large nX and nY . 2 µD = µˆ

pX − µˆ pY = pX − pY .

3 σ2

D = σ2 ˆ pX + σ2 ˆ pY = pX (1−pX ) nX

+ pY (1−pY )

nY

. Definition One can approximate σD =

  • pX (1−pX )

nX

+ pY (1−pY )

nY

with the standard error of D, SED

def

=

  • ˆ

pX(1 − ˆ pX) nX + ˆ pY (1 − ˆ pY ) nY

Marc Mehlman (University of New Haven) Inference for Proportions 15 / 20

slide-19
SLIDE 19

Comparing Two Proportions

Thus for large nX and nY , D is approximately N (pX − pY , SED) . This gives Theorem (Large–Sample CI for Difference Between Two Proportions) A (1 − α)100% CI for pX − pY is (ˆ pX − ˆ pY ) ± z⋆ ⋆ SED. Warning Use this method only when the number of heads and tails is at least 10 for each sample. A plus four confidence interval for pX − pY is obtained by using above procedure, but first adding one head and one tail to each of the random samples (increasing each sample size by 2). Use when confidence level is 90% or higher and each sample size is at least 5.

Marc Mehlman (University of New Haven) Inference for Proportions 16 / 20

slide-20
SLIDE 20

Comparing Two Proportions

Example

Lyme disease is spread by infected ticks. Ticks feed mainly on mice. Mice feed on acorn. An experiment compared two similar forest areas in a year with low acorn amounts. One area was supplied large amounts of acorns, and the other untouched. The next spring mice populations were compared: trapped mice breeding mice Area 1: high in acorns 72 54 Area 2: low in acorns 17 10 Find a large–sample 95% confidence interval for the difference in proportion of breeders in high acorn and low acorn areas. Also find the plus–four 95% confidence interval. Solution for Large–Sample 95% confidence interval: (ˆ pX − ˆ pY ) ± z⋆ ⋆ SED = 54 72 − 10 17 ± 1.96

  • 54

72

  • 1 − 54

72

  • 72

+

10 17

  • 1 − 10

17

  • 17

= 0.1642959 ± 0.2544338. Thus the answer is (−0.09, 0.42) (don’t imply more accuracy than there is). Solution for Four Plus 95% confidence interval: (ˆ pX − ˆ pY ) ± z⋆ ⋆ SED = 55 74 − 11 19 ± 1.96

  • 55

74

  • 1 − 55

74

  • 74

+

11 19

  • 1 − 11

19

  • 19

= 0.1642959 ± 0.2432937. Thus the answer is (−0.08, 0.41). Marc Mehlman (University of New Haven) Inference for Proportions 17 / 20

slide-21
SLIDE 21

Comparing Two Proportions

Theorem (Difference Between Two Proportions)

Let X1, · · · , XnX and Y1, · · · , YnY be independent r.s. where Xj ∼ BIN(1, pX ) and Yk ∼ BIN(1, pY ). Let H0 : pX = pY = p where p is unknown. Define the pooled estimate, ˆ p, and the pooled standard error of pX and pY to be ˆ p def = nX ˆ pX + nY ˆ pY nX + nY and SEDp

def

=

  • ˆ

p(1 − ˆ p) nX + ˆ p(1 − ˆ p) nY =

  • ˆ

p(1 − ˆ p) 1 nX + 1 nY

  • and the test statistic be

z = ˆ pX − ˆ pY

  • ˆ

p(1 − ˆ p)

  • 1

nX + 1 nY

= ˆ px − ˆ pY SEDp ∼ N(0, 1) for H0.

Warning Use this method only when the number of heads and tails in each sample is at least 5.

Marc Mehlman (University of New Haven) Inference for Proportions 18 / 20

slide-22
SLIDE 22

Comparing Two Proportions

Example

Gastric Freezing Gastric freezing was once a treatment for ulcers. Patients would swallow a deflated balloon with tubes to cool the stomach for an hour in hope of reducing acid production and relieving ulcer pain. The treatment was shown to be safe and significantly reducing ulcer pain and was widely used for years. A randomized comparative experiment later compared the outcome of gastric freezing with that of a placebo: 28 of the 82 patients subjected to gastric freezing improved, while 30 of the 78 in the control group improved.

H0: pgf = pplacebo Ha: pgf > pplacebo

Marc Mehlman (University of New Haven) Inference for Proportions 19 / 20

slide-23
SLIDE 23

Comparing Two Proportions

Example (cont.)

Results: 28 of the 82 patients subjected to gastric freezing improved 30 of the 78 patients in the control group improved

1 2 1 2

ˆ ˆ 0.342 0.385 0.043 0.57 0.076 1 1 1 1 0.3625*0.6375 ˆ ˆ (1 ) 82 78 p p z p p n n − − − = = = ≈ −     + − +  ÷  ÷     3625 . 78 82 30 28 ˆ = + + =

pooled

p

  • 0.3

0.0 0.3 p^ gf - p^ pl

The P-value is greater than 50%...  Gastric freezing was not significantly better than a placebo (P-value > 0.1), and this treatment was abandoned. ALWAYS USE A CONTROL!!!

H0: pgf = pplacebo Ha: pgf > pplacebo

ˆ ˆ

gf plac

p p − Marc Mehlman (University of New Haven) Inference for Proportions 20 / 20