[PPT] - Lecture 4: Hypothesis Testing Ani Manichaikul amanicha@jhsph.edu PowerPoint Presentation

SLIDE 1

Lecture 4: Hypothesis Testing

Ani Manichaikul amanicha@jhsph.edu 20 April 2007

1 / 69

SLIDE 2

Steps of Hypothesis Testing

Define the null hypothesis, H0 Define the alternative hypothesis, Ha, where Ha is usually of the form “not H0” Define the type I error, α, usually 0.05 Calculate the test statistic Calculate the p-value If the p-value is less than α, reject H0 Otherwise, fail to reject H0

2 / 69

SLIDE 3

Hypothesis Testing

We will first discuss hypothesis testing as it applies to means

f distributions for continuous variables

We will then discuss discrete data (specifically dichotomous variables)

3 / 69

SLIDE 4

Hypothesis test for a single mean I

Assume a population of normally distributed birth weights with a known standard deviation, σ = 1000 grams Birth weights are obtained on a sample of 10 infants; the sample mean is calculated as 2500 grams Question: Is the mean birth weight in this population different from 3000 grams? Set up a two-sided test of H0 : µ = 3000

vs. Ha

: µ = 3000 Let α = 0.05 denote a 5% significance level

4 / 69

SLIDE 5

Hypothesis test for a single mean II

Calculate the test statistic: zobs = ¯ X − µ0 σ/√n = 2500 − 3000 1000/ √ 10 = −1.58 What does this mean? Our observed mean is 1.58 standard errors below the hypothesized mean The test statistic is the standardized value of our data assuming the null hypothesis is true! Question: If the true mean is 3000 grams, is our observed sample mean of 2500 “common” or is this value unlikely to

ccur?

5 / 69

SLIDE 6

Hypothesis test for a single mean III

Calculate the p-value: p-value = P(Z < −|zobs|)+P(Z > |zobs|) = 2×0.057 = 0.114 If the true mean is 3000 grams, our data or data more extreme than ours would occur in 11 out of 100 studies (of the same size, n=10) In 11 out of 100 studies, just by chance we are likely to

bserve a sample mean of 2500 or more extreme if the true

mean is 3000 grams What does this say about our hypothesis? General guideline: if p-value < α, then reject H0

6 / 69

SLIDE 7

Hypothesis test for a single mean IV

Could also use the “critical region” or “rejection region” approach Based on our significance level (α = 0.05) and assuming H0 is true, how “far” does our sample mean have to be from H0 : µ = 3000 in order to reject? Critical value = zc where 2 × P(Z > |zc|) = 0.05 In our example, zc = 1.96 The rejection region is any value of our test statistic that is less than -1.96 or greater than 1.96 Decision should be the same whether using the p-value or critical / rejection region

7 / 69

SLIDE 8

Hypothesis test for a single mean V

An alternative approach for the two sided hypothesis test is to calculate a 100(1-α)% confidence interval for the mean We are 95% confident that the interval (1880, 3120) contains the true population mean µ ¯ X ± zα/2 σ √ 10 → 2500 ± 1.961000 √ 10 The hypothetical true mean 3000 is a plausible value of the true mean given out data We cannot say that the true mean is different from 3000

8 / 69

SLIDE 9

P-values

Definition: The p-value for a hypothesis test is the null probability of obtaining a value of the test statistic as or more extreme than the observed test statistic The rejection region is determined by α, the desired level of significance, or probability of committing a type I error Reporting the p-value associated with a test gives an indication of how common or rare the computed value of the test statistic is, given that H0 is true We often use zobs to denote the computed value of the test statistic

9 / 69

SLIDE 10

Determining the correct test statistic

Depends on your assumptions on σ When σ is known, we have a standard normal test statistic When σ is unknown and our sample size is relatively small, the test statistic has a t-distribution The only chance in the procedure is the calculation of the p-value or rejection region uses a t- instead of normal distribution

10 / 69

SLIDE 11

Hypothesis tests for one mean H0 : µ = µ0, Ha : µ = µ0

Population Sample Population Test Distribution Size Variance Statistic Normal Any σ2 known zobs =

¯ X−µ0 σ/√n

Any σ2 unknown tobs =

¯ X−µ0 s/√n

uses s2, df=n-1 Not Normal/ Large σ2 known zobs =

¯ X−µ0 σ/√n

Unknown Large s2 unknown zobs =

¯ X−µ0 s/√n

uses s2 Small Any Non-parametric methods

11 / 69

SLIDE 12

Hypothesis tests for one proportion H0 : p = p0, Ha : p = p0

Population Sample Test Distribution Size Statistic Binomial Large zobs =

ˆ p−p0 q

p0(1−p0) n

Small Exact methods

12 / 69

SLIDE 13

Hypothesis tests for a difference of two means H0 : µ1 − µ2 = µ0, Ha : µ1 − µ2 = µ0

Population Sample Population Test Distribution Size Variances Statistic Normal Any Known zobs = ( ¯

X1− ¯ X2)−µ0 r

σ2 1 n1 + σ2 2 n2

Any unknown tobs = ( ¯

X1− ¯ X2)−µ0 r

s2 p n1 + s2 p n2

assume σ2

1 = σ2 2,

df = n1 + n2 − 2 Any unknown tobs = ( ¯

X1− ¯ X2)−µ0 r

s2 1 n1 + s2 2 n2

assume σ2

1 = σ2 2,

df = ν

13 / 69

SLIDE 14

Example: Hypothesis test for two means (two independent samples) I

The EPREDA Trial: randomized, placebo-controlled trial to determine whether dipyridamole improves the efficacy of aspirin in preventing fetal growth retardation Pregnant women randomized to placebo (n=73), aspirin or aspirin plus dipyridamole (n=156) Mean birth weight was statistically significantly higher in the treated than in the placebo group

2751 (SD 670) grams vs. 2526 (SD 848) grams

14 / 69

SLIDE 15

Example: Hypothesis test for two means (two independent samples) II

Test the hypothesis: H0 : µplacebo = µtreated

vs. Ha : µplacebo

= µtreated at the 5% significance level The data are: Treatment n mean SD Placebo 73 2526 848 Treated 156 2751 670

15 / 69

SLIDE 16

Example: Hypothesis test for two means (two independent samples) III

Calculate the test statistic: tobs = ( ¯ X1 − ¯ X2) − µ0

s2

1

np + s2

2

nt

= 2526 − 2751

8482

73 + 6762 156

= −1.99 The observed difference in mean birth weight comparing the placebo to treated groups is approximately 2 standard errors below the hypothesized difference of 0 Our sample size is pretty large, so the test statistic will behave like a standard normal variable

16 / 69

SLIDE 17

Example: Hypothesis test for two means (two independent samples) IV

What is the p-value in this example?

p-value= 0.047

What is your decision in this case?

Not straightforward There may be a difference in birth weight comparing the two groups Need to consider the practical implications

17 / 69

SLIDE 18

Example: Hypothesis test for two means (two independent samples) V

Can also give 95% confidence interval for the difference in the two means: (-446.13, -3.87) Again, this is a plausible range of values for the true difference in birth weights comparing the placebo to treated groups What is your null hypothesis? No difference! Given this confidence interval, is “no difference” a plausible value? Almost?

18 / 69

SLIDE 19

Hypothesis tests for a difference of two means H0 : µ1 − µ2 = µ0, Ha : µ1 − µ2 = µ0

Population Sample Population Test Distribution Size Variances Statistic Large Known zobs = ( ¯

X1− ¯ X2)−µ0 r

σ2 1 n1 + σ2 2 n2

Not Large unknown zobs = ( ¯

X1− ¯ X2)−µ0 r

s2 p n1 + s2 p n2

Normal/ assume σ2

1 = σ2 2,

Unknown Large unknown zobs = ( ¯

X1− ¯ X2)−µ0 r

σ2 1 n1 + σ2 2 n2

assume σ2

1 = σ2 2,

small Any Nonparametric Methods

19 / 69

SLIDE 20

Additional Considerations: We’re not always right

Conclusion based on “Truth” Data (sample) H0 true H0 false Reject H0 Type I error Correct Fail to reject H0 Correct Type II error

20 / 69

SLIDE 21

Errors in hypothesis testing α

α = P(Type I error) = probability of rejecting a true null hypothesis = “level of significance” Aim: to keep Type I error small by specifying a small rejection region α is usually set before performing a test, typically at level α = 0.05

21 / 69

SLIDE 22

Errors in hypothesis testing β I

β = P(Type II error) = P(fail to reject H0 given H0 is false) Power = 1 − β = probability of rejecting H0 when H0 is false Aim: to keep Type II error small and achieve large power

22 / 69

SLIDE 23

Errors in hypothesis testing β II

β depends on sample size, α, and the specified alternative value The value of β is usually unknown since the true mean (or

ther parameter) is generally unknown

Before data collection, scientists should decide

the test they will perform the desired Type I error rate α the desired β, for a specified alternative value

After specifying this information, an appropriate sample size can be determined

23 / 69

SLIDE 24

Critical Regions I

24 / 69

SLIDE 25

Critical Regions II

25 / 69

SLIDE 26

Critical Regions III

26 / 69

SLIDE 27

Type II error

27 / 69

SLIDE 28

Dichotomous variables

Proportions 2 × 2 tables Study Design Hypothesis tests

28 / 69

SLIDE 29

Proportions and 2 × 2 tables

Population Success Failure Total Population 1 x1 n1 − x1 n1 Population 2 x2 n2 − x2 n2 Total x1 + x2 n − (x1 + x2) n Row 1 shows results of a binomial experiment with n1 trials Row 2 shows results of a binomial experiment with n2 trials

29 / 69

SLIDE 30

How do we compare these proportions

Often, we want to compare p1, the probability of success in population 1, to p2, the probability of success in population 2

Usually: “Success” = Disease Population 1 = Treatment 1

How do we compare these proportions?

It depends!

30 / 69

SLIDE 31

Study Designs

Cross-sectional Cohort Case-control

Matched case-control

31 / 69

SLIDE 32

Cohort Studies

Application to Aceh Vitamin A Trial 25,939 pre-school children in 450 Indonesian villages in northern Sumatra 200,000 IU vitamin A given 1-3 months after the baseline census, and again at 6-8 months Consider 23,682 out of 25,939 who were visited on a pre-designed schedule

32 / 69

SLIDE 33

Trial Outcome

Alive at 12 months? Vit A No Yes Total Yes 46 12,048 12,094 No 74 11,514 11,588 Total 120 23,562 23,682 Does Vitamin A reduce mortality? Calculate risk ratio or “relative risk”

Relative Risk abbreviated as RR Could also compare difference in proportions: called “attributable risk”

33 / 69

SLIDE 34

Relative Risk Calculation

Relative Risk = Rate with Vitamin A Rate without Vitamin A = ˆ p1 ˆ p2 = 46/12, 094 74/11, 588 = 0.0038 0.0064 = 0.59 Vitamin A group had 40% lower mortality!

34 / 69

SLIDE 35

Confidence interval for RR

Step 1: Find the estimate of the log RR log(ˆ p1 ˆ p2 ) Step 2: Estimate the variance of the log(RR) as: 1 − p1 n1p1 + 1 − p2 n2p2 Step 3: Find the 95% CI for log(RR): log(RR) ± 1.96 · SD(log RR) = (lower, upper) Step 4: Exponentiate to get 95% CI for RR; e(lower, upper)

35 / 69

SLIDE 36

Confidence interval for RR from Vitamin A Trial

95% CI for log relative risk is: log(RR) ± 1.96 · SD(log RR) = log(0.59) ± 1.96 ·

0.9962

46 + 0.9936 74 = −0.53 ± 0.37 = (−0.90, −0.16) 95% CI for relative risk (e−0.90, e−0.16) = (0.41, 0.85) Does this confidence interval contain 1?

36 / 69

SLIDE 37

What if the data were from a case-control study?

Recall: in case-control studies, individuals are selected by

utcome status

Disease (mortality) status defines the population, and exposure status defines the success p1 and p2 have a difference interpretation in a case-control study than in a cohort study Cohort:

p1 = P(Disease | Exposure) p2 = P(Disease | No Exposure)

Case-Control:

p1 = P(Exposure | Disease) p2 = P(Exposure | No Disease)

⇒ This is why we cannot estimate the relative risk from case-control data!

37 / 69

SLIDE 38

The Odds Ratio

The odds ratio measures association in Case-Control studies Odds = P(event occurs) P(event does not occur) Odds ratio for death given Vitamin A status is the odds of death given Vitamin A divided by the odds of death given no Vitamin A OR = ˆ

p1/(1−ˆ p1) ˆ p2/(1−ˆ p2)

38 / 69

SLIDE 39

Which p1 and p2 do we use?

Calculate OR both ways Using “case-control” p1 and p2 OR = (46/120)/(74/120) (12048/23562)/(11514/23562) = 46/74 12048/11514 = 0.59 Using “cohort” p1 and p2 OR = (46/12094)/(12048/12094) (74/11588)/(11514/11588) = 46/12048 74/11514 = 0.59 We get the same answer either way!

39 / 69

SLIDE 40

Bottom Line

The relative risk cannot be estimated from a case-control study The odds ratio can be estimated from a case-control study OR estimates the RR when the disease is rare The OR is invariant to cohort or case-control designs, the RR is not

40 / 69

SLIDE 41

Confidence interval for OR

Step 1: Find the estimate of the log OR log(ˆ p1/(1 − ˆ p1) ˆ p2/(1 − ˆ p2)) Step 2: Estimate the variance of the log(OR) as: 1 n1p1 + 1 n1q1 + 1 n2p2 + 1 n2q2 Step 3: Find the 95% CI for log(OR): log(OR) ± 1.96 · SD(log OR) = (lower, upper) Step 4: Exponentiate to get 95% CI for OR; e(lower, upper)

41 / 69

SLIDE 42

Matched-pairs case-control study design I

Samples not independent Cases and controls matched on age, race, sex, etc. The data are summarized in a different type of table

42 / 69

SLIDE 43

Matched-pairs case-control study design II

Results E = exposed Ec = not exposed N = total number of pairs Concordant pair Same exposure Discordant pair Different exposure Controls E Ec Cases E a b a+b Ec c d c+d a+c b+d N

43 / 69

SLIDE 44

Matched-pairs case-control study design III

Concordant pairs provide little information about differences We focus on the discordant pairs

EEc pairs (b), in which the case is exposed and the control is unexposed EcE pairs (c), in which the case is unexposed and the control is exposed

44 / 69

SLIDE 45

Matched-pairs case-control study design IV

Under the null hypothesis of no difference: P(EE c) = P(E cE) = 1

2 = p

The number of EEc discordant pairs follows a binomial distribution

mean = np variance = npq n = b+c (the total number of discordant pairs)

So we can test the null hypothesis, H0 : p = 1

2 using the test

statistic z =

b− n

2

q

1 2 · 1 2 ·n, which is approximately normally distributed 45 / 69

SLIDE 46

McNemar’s Test

Algebra shows that: z2 = ( b − n

2

1

2 · 1 2 · n

)2 = (b − c)2 b + c ∼ χ2

1

This test statistic is much easier to look at, but always gives us the same result as our original z-test Note that the χ2

1 distribution is defined as the distribution of Z 2

where Z ∼ N(0, 1)

46 / 69

SLIDE 47

Example: Estrogen and Endometrial Cancer I

H0 : OR = 1 Ha : OR = 1 Matched pairs design Controls Estrogen No estrogen Cases Estrogen 17 76 93 No estrogen 10 111 121 27 187 214 pairs

47 / 69

SLIDE 48

Example: Estrogen and Endometrial Cancer II

OR = b c = 76 10 = 7.6 = estimate of the relative risk

f disease for exposed vs. unexposed

McNemar’s test statistic: z2 = (b − c)2 b + c = (76 − 10)2 76 + 10 = 50.65 The estimated odds of endometrial cancer among estrogen users is 7.6 times the odds of cancer among those with no estrogen exposure (p<0.001).

48 / 69

SLIDE 49

Confidence Interval for Matched-Pairs OR

Step 1: Find the estimate of log(OR) log(b c ) = log(b) − log(c) Step 2: Estimate the variance of log(OR) var[log(OR)] = 1 b + 1 c Step 3: Find the 95% CI for log(OR) log(OR) ± 1.96 · SD(log OR) = (lower, upper) Step 4: Exponentiate (elower, eupper)

49 / 69

SLIDE 50

Confidence Interval for Matched-Pairs OR Estrogen and Endometrial Cancer Example

Thus, a 95% CI for the log odds ratio is: log(OR) ± 1.96 · SD(log OR) = log(7.6) ± 1.96 ·

1

76 + 1 10 = 2.03 ± 0.66 = (1.37, 2.69) The 95% CI for the odds ratio is (e1.37, e2.69) = (3.93, 14.73) Does this interval contain 1?

50 / 69

SLIDE 51

When to match subjects

Genetic predisposition to glaucoma or lifestyle might confound the association between contact lens use and development of glaucoma Matching on potential confounders removes those variables from the analysis: the association is automatically adjusted for any matched variables

51 / 69

SLIDE 52

When not to match subjects I

Never match for a variable in the causal pathway between the predictor and the outcome — such matching may remove association Don’t match on too many things

It may be hard to find controls matched on age, gender, SES, rase, BMI, and smoking for each available case If such matched controls were available, the data might be “overmatched”, so that few differences remain between cases and controls

52 / 69

SLIDE 53

When not to match subjects II

An alternative to matching for potential confounders is to adjust for them

Continuous outcome: linear regression Binary outcome: logistic regression

When in doubt, design an unmatched study

We can always adjust for confounders We can never unmatch matched data

53 / 69

SLIDE 54

All kinds of χ2 tests

Test of Goodness of fit Test of independence Test of homogeneity or (no) association All of these test statistics have a χ2 distribution under the null

54 / 69

SLIDE 55

The χ2 distribution

Derived from the normal distribution χ2

1

= (y − µ σ )2 = Z 2 χ2

k

= Z 2

1 + Z 2 2 + · · · + Z 2 k

where Z1, . . . , Zk are all standard normal random variables k denotes the degrees of freedom A χ2

k random variable has

mean = k variance = 2k

55 / 69

SLIDE 56

The χ2 test statistic

χ2 = k

i=1[(Oi−Ei)2 Ei

] where Oi = ith observed frequency Ei = ith expected frequency in the ith cell of a table Note: This test is based on frequencies (cell counts) in a table, not proportions

56 / 69

SLIDE 57

χ2 Family of Distributions

57 / 69

SLIDE 58

χ2 Table

58 / 69

SLIDE 59

χ2 Goodness-of-Fit Test

Determine whether or not a sample of observed values of some random variable is compatible with the hypothesis that the sample was drawn from a population with a specified distributional form, e.g. Normal Binomial Poisson etc. Here, the expected cell counts would be derived from the distributional assumption under the null hypothesis

59 / 69

SLIDE 60

Example: Handgun survey I

Survey 200 adults regarding handgun bill:

Statement: “I agree with a ban on handguns” Four categories: Strongly agree, agree, disagree, strongly disagree

Can one conclude that opinions are equally distributed over four responses?

60 / 69

SLIDE 61

Example: Handgun survey II

Response (count) 1 2 3 4 Strongly agree disagree Strongly agree disagree Responding (Oi) 102 30 60 8 Expected (Ei) 50 50 50 50 χ2 =

k

i=1

[(Oi − Ei)2 Ei ] = (102 − 50)2 50 + (30 − 50)2 50 + (60 − 50)2 50 + (8 − 50)2 50 = 99.36

61 / 69

SLIDE 62

Example: Handgun survey III

Critical value: χ2

4−1,0.05 = χ2 3,0.05 = 7.81

Since 99.36 > 7.81, we conclude that our observation was unlikely by chance alone Based on these data, opinions do not appear to be equally distributed among the four responses

62 / 69

SLIDE 63

χ2 Test of Independence I

Test the null hypothesis that two criteria of classification are independent r × c contingency table Criterion 1 1 2 3 · · · c Total Criterion 2 1 n11 n12 n13 · · · n1c n1· 2 n21 n22 n23 · · · n2c n2· 3 n31 n32 n33 · · · n3c n3· . . . . . . . . . . . . . . . . . . r nr1 nr2 nr3 · · · nrc nr· Total n·1 n·2 n·3 · · · n·c n

63 / 69

SLIDE 64

χ2 Test of Independence II

Test statistic: χ2 =

k

i=1

[(Oi − Ei)2 Ei ] Degrees of freedom = (r-1)(c-1) Assume the marginal totals are fixed

64 / 69

SLIDE 65

χ2 Test of Homogeneity (No association)

Test the null hypothesis that the samples are drawn from populations that are homogenous with respect to some factor

i.e. no association between group and factor

Same test statistic as χ2 test of independence

65 / 69

SLIDE 66

Example: Treatment response I

Response to Treatment Treatment Yes No Total Observed Numbers A 37 13 50 B 17 53 70 Total 54 66 120 Calculate what numbers of “Yes” and “No” would be expected assuming the probability of “Yes” was the same in both groups Condition on total the number of “Yes” and “No” responses

66 / 69

SLIDE 67

Example: Treatment response II

Expected proportion with “Yes” response =

54 120 = 0.45

Expected proportion with “No” response =

66 120 = 0.55

Response to Treatment Treatment Yes No Total Observed A 37 (22.5) 13 (27.5) 50 (Expected) B 17 (31.5) 53 (38.5) 70 Total 54 66 120

67 / 69

SLIDE 68

Example: Treatment response III

Test statistic: χ2 =

k

i=1

[(Oi − Ei)2 Ei ] = (37 − 22.5)2 22.5 + (13 − 27.5)2 27.5 +(17 − 31.5)2 31.5 + (53 − 38.5)2 38.5 = 29.1 Degrees of freedom = (r-1)(c-1) = (2-1)(2-1) = 1 where r= num. rows, and c= num. columns

68 / 69

SLIDE 69

Example: Treatment response IV

We see p<0.001 Reject the null hypothesis, and conclude that the treatment groups are not homogenous (similar) with respect to response Response appears to be associated with treatment

69 / 69