Hypothesis testing DS GA 1002 Statistical and Mathematical Models - - PowerPoint PPT Presentation
Hypothesis testing DS GA 1002 Statistical and Mathematical Models - - PowerPoint PPT Presentation
Hypothesis testing DS GA 1002 Statistical and Mathematical Models http://www.cims.nyu.edu/~cfgranda/pages/DSGA1002_fall15 Carlos Fernandez-Granda Example In a medical study 10% of women and 12.5% of men suffer from heart disease Hypothesis:
Example
In a medical study 10% of women and 12.5% of men suffer from heart disease Hypothesis: Men are more prone to have heart disease than women If there are 20 people in the study, effect could be by chance If there are 20 000 people, we are more convinced Hypothesis testing makes this precise
Hypothesis testing
Framework to decide whether patterns in data are random fluctuations Aim: Establish whether a predefined hypothesis is supported by the data
The hypothesis testing framework Parametric testing Nonparametric testing Multiple testing
Null and alternative hypotheses
Null hypothesis H0: There is no underlying phenomenon (men are not more prone to heart disease) Alternative hypothesis H1: There is an underlying phenomenon We reject H0 if it does not explain the data well Failing to reject H0 does not mean that we think it holds, we just don’t have enough evidence Frequentist perspective: A hypothesis holds or does not hold deterministically
Tests
A test is a procedure to decide whether to reject the null hypothesis General strategy:
- 1. Compute a test statistic from the data T (x1, . . . , xn)
- 2. Decide on a rejection region R such that if T (x1, . . . , xn) ∈ R it is
very unlikely that the null hypothesis holds
- 3. Reject the null hypothesis if T (x1, . . . , xn) ∈ R
Errors Reject H0? No Yes H0 is true
- Type I error
H1 is true Type II error
Size and significance level
Priority: Control Type I errors The size of a test is the probability of making a Type I error The significance level is an upper bound on the size
Significance level
The effect is significant (at a level of 5%) Translation: Given the assumed probabilistic model, the probability that we reject the null hypothesis when it is true is at most 5%
p value
The p value is the smallest significance level at which we would reject H0 for a particular dataset It is a function of the data, not a probability
Power
The power of a test is the probability of rejecting H0 under H1 For a given significance level, we want as much power as possible Problem: We need to know the distribution of the data under H1!
Overview
- 1. Choose a conjecture
- 2. Determine the corresponding null hypothesis
- 3. Choose a test
- 4. Gather the data
- 5. Compute the test statistic from the data
- 6. Compute the p value and reject the null hypothesis if it is below a
predefined limit (typically 1% or 5%)
Example: Clutch
Conjecture: NBA player is more effective in 4th quarter Null hypothesis: He’s equally effective Test statistic: Games out of 20 in which he scores more points per minute in the 4th quarter What threshold do we need to ensure a significance level of 1%, 5% The test statistic is 14, what is the p value?
Example: Clutch
T0 represents the test statistic under the null hypothesis We consider a rejection region of the form R := {t | t ≥ η} The size of the test is P (T0 > η)
Example: Clutch
T0 represents the test statistic under the null hypothesis We consider a rejection region of the form R := {t | t ≥ η} The size of the test is P (T0 > η) What is the distribution of the test statistic T0 under the null hypothesis?
Example: Clutch
T0 represents the test statistic under the null hypothesis We consider a rejection region of the form R := {t | t ≥ η} The size of the test is P (T0 > η) What is the distribution of the test statistic T0 under the null hypothesis? Binomial with parameters 20 and 1/2
Example: Clutch
T0 represents the test statistic under the null hypothesis We consider a rejection region of the form R := {t | t ≥ η} The size of the test is P (T0 > η) = 1 2n
n
- k=η
n k
- What is the distribution of the test statistic T0 under the null hypothesis?
Binomial with parameters 20 and 1/2
Distribution under null hypothesis
η 1 2 3 4 5 P (T0 ≥ η) 1.000 1.000 1.000 0.999 0.994 η 6 7 8 9 10 P (T0 ≥ η) 0.979 0.942 0.868 0.748 0.588 η 11 12 13 14 15 P (T0 ≥ η) 0.412 0.252 0.132 0.058 0.021 η 16 17 18 19 20 P (T0 ≥ η) 0.006 0.001 0.000 0.000 0.000
Example: Clutch
What threshold do we need to ensure a significance level of 1%? What threshold do we need to ensure a significance level of 5%? The test statistic is 14, what is the p value?
Example: Clutch
What threshold do we need to ensure a significance level of 1%? 16 What threshold do we need to ensure a significance level of 5%? The test statistic is 14, what is the p value?
Example: Clutch
What threshold do we need to ensure a significance level of 1%? 16 What threshold do we need to ensure a significance level of 5%? 15 The test statistic is 14, what is the p value?
Example: Clutch
What threshold do we need to ensure a significance level of 1%? 16 What threshold do we need to ensure a significance level of 5%? 15 The test statistic is 14, what is the p value? 5.8 %
Example: Clutch
What threshold do we need to ensure a significance level of 1%? 16 What threshold do we need to ensure a significance level of 5%? 15 The test statistic is 14, what is the p value? 5.8 % Is this the probability that the null hypothesis holds?
Example: Clutch
What threshold do we need to ensure a significance level of 1%? 16 What threshold do we need to ensure a significance level of 5%? 15 The test statistic is 14, what is the p value? 5.8 % Is this the probability that the null hypothesis holds? No!
The hypothesis testing framework Parametric testing Nonparametric testing Multiple testing
Parametric testing
Data are sampled from a known distribution with unknown parameters Probability measure Pθ depends on θ Frequentist perspective The parameter is deterministic and so are the hypotheses Notation: X is a random vector distributed according to Pθ, the data x are a realization of X
If H0 is θ = θ0
The size of a test with test statistic T and rejection region R is α := Pθ0
- T(
X) ∈ R
- If the rejection region is of the form T (
x) ≥ η α = Pθ0
- T(
X) ≥ η
- Smallest η at which we reject H0 is T (
x) p = Pθ0
- T(
X) ≥ T ( x)
- p value: probability under H0 of observing a test statistic that is as
extreme as the one we observe
Composite hypotheses
θ = θ0 is a simple hypothesis A composite hypothesis is of the form θ ∈ S for a certain set S The size of a composite test is α = sup
θ∈H0
Pθ
- T(
X) ≥ η
- The p value is
p = sup
θ∈H0
Pθ
- T(
X) ≥ T ( x)
Power function
The power function of the test is defined as β (θ) := Pθ
- T(
X) ∈ R
- We want β (θ) ≈ 0 for θ ∈ H0 and β (θ) ≈ 1 for θ ∈ H1
Example: Coin flip
Conjecture: Coin is biased towards heads θ > 1/2 Null hypothesis: Coin not biased towards heads θ ≤ 1/2 Test statistic: Number of heads out of n = 5, 10, 100 flips Rejection region: Heads = n, Heads ≥ 3n/5 Power function?
Coin flip power function
If η = n, β1 (θ) = Pθ
- T(
X) ∈ R
- If η = 3n/5,
β2 (θ) = Pθ
- T(
X) ∈ R
Coin flip power function
If η = n, β1 (θ) = Pθ
- T(
X) ∈ R
- = θn
If η = 3n/5, β2 (θ) = Pθ
- T(
X) ∈ R
Coin flip power function
If η = n, β1 (θ) = Pθ
- T(
X) ∈ R
- = θn
If η = 3n/5, β2 (θ) = Pθ
- T(
X) ∈ R
- =
n
- k=3n/5
n k
- θk (1 − θ)n−k
η = n
0.25 0.50 0.75
θ
0.05 0.25 0.50 0.75
β(θ)
n = 5 n = 50 n = 100
η ≥ 3n/5
0.25 0.50 0.75
θ
0.05 0.25 0.50 0.75
β(θ)
n = 5 n = 50 n = 100
Likelihood-ratio test
Threshold ratio between likelihoods {Λ (x) ≤ η}, where Λ ( x) := supθ∈H0 L
x (θ)
supθ∈H1 L
x (θ)
Intuition: Unless the ratio is low, we cannot rule out the null hypothesis
Example: Gaussian with known variance σ2
Conjecture: µ = µ0 Null hypothesis: µ = µ0 Test statistic: Likelihood ratio Find threshold for significance level α
Example: Gaussian with known variance σ2
Empirical mean maximizes likelihood for any value of σ av ( x) := 1 n
n
- i=1
- xi = arg max
µ L x (µ, σ)
Example: Gaussian with known variance σ2
Λ ( x) = supµ∈H0 L
x (µ)
supµ∈H1 L
x (µ)
Example: Gaussian with known variance σ2
Λ ( x) = supµ∈H0 L
x (µ)
supµ∈H1 L
x (µ)
= L
x (µ0)
L
x (av (
x))
Example: Gaussian with known variance σ2
Λ ( x) = supµ∈H0 L
x (µ)
supµ∈H1 L
x (µ)
= L
x (µ0)
L
x (av (
x)) = exp
- − 1
2σ2
n
- i=1
- (
xi − av ( x))2 − ( xi − µ0)2
Example: Gaussian with known variance σ2
Λ ( x) = supµ∈H0 L
x (µ)
supµ∈H1 L
x (µ)
= L
x (µ0)
L
x (av (
x)) = exp
- − 1
2σ2
n
- i=1
- (
xi − av ( x))2 − ( xi − µ0)2 = exp
- − 1
2σ2
- −2 av (
x)
n
- i=1
- xi + n av (
x)2 − 2µ0
n
- i=1
- xi + nµ2
Example: Gaussian with known variance σ2
Λ ( x) = supµ∈H0 L
x (µ)
supµ∈H1 L
x (µ)
= L
x (µ0)
L
x (av (
x)) = exp
- − 1
2σ2
n
- i=1
- (
xi − av ( x))2 − ( xi − µ0)2 = exp
- − 1
2σ2
- −2 av (
x)
n
- i=1
- xi + n av (
x)2 − 2µ0
n
- i=1
- xi + nµ2
- = exp
- −n (av (
x) − µ0)2 2σ2
Example: Gaussian with known variance σ2
The likelihood test is |av ( x) − µ0| ≥ σ
- −2 log η
n Under the null hypothesis av( X) is Gaussian with mean µ0 and var. σ2/n α = Pµ0
- av(
X) − µ0 σ/√n
- ≥
- −2 log η
- = 2 Q
- −2 log η
- .
For a significant level of α, |av ( x) − µ0| ≥ σ Q−1 (α/2) √n
Neyman-Pearson Lemma
If H0 is θ = θ0 and H1 is θ = θ1 then the likelihood-ratio test has the highest power among all tests with a fixed size
Neyman-Pearson Lemma: Proof
We denote the rejection region of the likelihood-ratio test by RLR An arbitrary test with rejection region R has power Pθ1
- X ∈ R
- Our aim is to prove
Pθ1
- X ∈ RLR
- ≥ Pθ1
- X ∈ R
- r equivalently
Pθ1
- X ∈ Rc ∩ RLR
- ≥ Pθ1
- X ∈ Rc
LR ∩ R
Neyman-Pearson Lemma: Proof
Both tests have size α so Pθ0
- X ∈ R
- = α = Pθ0
- X ∈ RLR
- .
and consequently Pθ0
- X ∈ Rc ∩ RLR
- = Pθ0
- X ∈ RLR
- − Pθ0
- X ∈ R ∩ RLR
- = Pθ0
- X ∈ R
- − Pθ0
- X ∈ R ∩ RLR
- = Pθ0
- X ∈ R ∩ Rc
LR
Neyman-Pearson Lemma: Proof
◮ If Λ (
x) ∈ RLR fθ1 ( x) ≥ fθ0 ( x) η
◮ If Λ (
x) ∈ Rc
LR
fθ1 ( x) ≤ fθ0 ( x) η
Neyman-Pearson Lemma: Proof
Pθ1
- X ∈ Rc ∩ RLR
Neyman-Pearson Lemma: Proof
Pθ1
- X ∈ Rc ∩ RLR
- =
- x∈Rc∩RLR
fθ1 ( x) d x
Neyman-Pearson Lemma: Proof
Pθ1
- X ∈ Rc ∩ RLR
- =
- x∈Rc∩RLR
fθ1 ( x) d x ≥ 1 η
- x∈Rc∩RLR
fθ0 ( x) d x
Neyman-Pearson Lemma: Proof
Pθ1
- X ∈ Rc ∩ RLR
- =
- x∈Rc∩RLR
fθ1 ( x) d x ≥ 1 η
- x∈Rc∩RLR
fθ0 ( x) d x = 1 ηPθ0
- X ∈ Rc ∩ RLR
Neyman-Pearson Lemma: Proof
Pθ1
- X ∈ Rc ∩ RLR
- =
- x∈Rc∩RLR
fθ1 ( x) d x ≥ 1 η
- x∈Rc∩RLR
fθ0 ( x) d x = 1 ηPθ0
- X ∈ Rc ∩ RLR
- = 1
ηPθ0
- X ∈ R ∩ Rc
LR
Neyman-Pearson Lemma: Proof
Pθ1
- X ∈ Rc ∩ RLR
- =
- x∈Rc∩RLR
fθ1 ( x) d x ≥ 1 η
- x∈Rc∩RLR
fθ0 ( x) d x = 1 ηPθ0
- X ∈ Rc ∩ RLR
- = 1
ηPθ0
- X ∈ R ∩ Rc
LR
- = 1
η
- x∈R∩Rc
LR
fθ0 ( x) d x
Neyman-Pearson Lemma: Proof
Pθ1
- X ∈ Rc ∩ RLR
- =
- x∈Rc∩RLR
fθ1 ( x) d x ≥ 1 η
- x∈Rc∩RLR
fθ0 ( x) d x = 1 ηPθ0
- X ∈ Rc ∩ RLR
- = 1
ηPθ0
- X ∈ R ∩ Rc
LR
- = 1
η
- x∈R∩Rc
LR
fθ0 ( x) d x ≥
- x∈R∩Rc
LR
fθ1 ( x) d x
Neyman-Pearson Lemma: Proof
Pθ1
- X ∈ Rc ∩ RLR
- =
- x∈Rc∩RLR
fθ1 ( x) d x ≥ 1 η
- x∈Rc∩RLR
fθ0 ( x) d x = 1 ηPθ0
- X ∈ Rc ∩ RLR
- = 1
ηPθ0
- X ∈ R ∩ Rc
LR
- = 1
η
- x∈R∩Rc
LR
fθ0 ( x) d x ≥
- x∈R∩Rc
LR
fθ1 ( x) d x = Pθ1
- X ∈ R ∩ Rc
LR
The hypothesis testing framework Parametric testing Nonparametric testing Multiple testing
Permutation test
Aim: Compare two datasets xA and xB Null hypothesis: The two datasets are sampled from the same distribution No parametric model...
Test statistic
Choose test statistic T and evaluate the difference Tdiff ( x) := T ( xA) − T ( xB) , Test: R := {t | t ≥ η} Problem: How do we determine significance level or p value?
Main insight: Exchangeability under permutations
Under H0 distribution of Tdiff( X) does not change if we permute labels Joint distribution of X1, X2, . . . , Xn and of any permutation
- X24,
Xn, . . . , X3 are the same Values of Tdiff after permuting tdiff,1, . . . tdiff,n! are uniformly distributed P
- Tdiff(
X) ≥ η
- = 1
n!
n!
- i=1
1tdiff,i≥η This is the size of the test! p = P
- Tdiff(
X) ≥ Tdiff ( x)
- = 1
n!
n!
- i=1
1tdiff,i≥Tdiff(
x)
Permutation test
- 1. Choose a conjecture as to how
xA and xB are different
- 2. Choose a test statistic Tdiff
- 3. Compute Tdiff (
x)
- 4. Permute the labels m times and compute the corresponding values of
Tdiff: tdiff,1, tdiff,2, . . . tdiff,m
- 5. Compute the approximate p value
p = P
- Tdiff(
X) ≥ Tdiff ( x)
- = 1
m
m
- i=1
1tdiff,i≥Tdiff(
x)
and reject the null hypothesis if below a predefined limit (1%, 5%)
Cholesterol levels
- 1. Study with 86 men and 182 women
- 2. Conjecture: men have higher cholesterol than women
- 3. Test statistic: empirical mean of cholesterol level
- 4. 261.3 mg/dl amongst men and 242.0 mg/dl amongst women
- 5. Null hypothesis: No difference, permuting data yields same distribution
- 6. We sample 106 permutations to compute an approximate p value
Cholesterol levels
100 150 200 250 300 350 400 450 2 4 6 8 10 12 14 16 Men Women
p value = 0.119%
Approximate distribution under the null hypothesis of the difference between the empirical means in men and women
20.00 10.00 0.00 10.00 19.22 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07
p value = 0.112%
Approximate distribution under the null hypothesis of the difference between the empirical means in men and women
20.00 10.00 0.00 10.00 19.22 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07
p value = 0.115%
Approximate distribution under the null hypothesis of the difference between the empirical means in men and women
20.00 10.00 0.00 10.00 19.22 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07
Blood pressure
- 1. Study with 86 men and 182 women
- 2. Conjecture: men have higher blood pressure than women
- 3. Test statistic: empirical mean of blood pressure
- 4. 133.2 mmHg mg/dl amongst men and 130.6 mg/dl amongst women
- 5. Null hypothesis: No difference, permuting data yields same distribution
- 6. We sample 106 permutations to compute an approximate p value
Blood pressure
80 100 120 140 160 180 200 220 5 10 15 20 25 30 Men Women
p value = 13.48%
Approximate distribution under the null hypothesis of the difference between the empirical means in men and women
5.0 0.0 5.0 2.6 0.00 0.05 0.10 0.15 0.20
p value = 13.56%
Approximate distribution under the null hypothesis of the difference between the empirical means in men and women
5.0 0.0 5.0 2.6 0.00 0.05 0.10 0.15 0.20
p value = 13.50%
Approximate distribution under the null hypothesis of the difference between the empirical means in men and women
5.0 0.0 5.0 2.6 0.00 0.05 0.10 0.15 0.20
The hypothesis testing framework Parametric testing Nonparametric testing Multiple testing
Multiple testing
Often, we perform many simultaneous hypothesis tests Computational genomics, many genes could be relevant For n independent tests of size α P (at least one false positive) = 1 − P (no false positives) (1) (2)
Multiple testing
Often, we perform many simultaneous hypothesis tests Computational genomics, many genes could be relevant For n independent tests of size α P (at least one false positive) = 1 − P (no false positives) (1) = 1 − (1 − α)n (2) For α=1% and n = 500 genes, P (at least one false positive) = 0.99!
Bonferroni’s method
Given n hypothesis tests, compute the corresponding p values p1, . . . , pn For a fixed significance level α reject the ith null hypothesis if pi > α n Probability of making a Type I error is bounded by α
Bonferroni’s method
Union bound: For any events S1, . . . , Sn P (∪n
i=1Si) ≤ n
- i=1
P (Si)
Bonferroni’s method
Union bound: For any events S1, . . . , Sn P (∪n
i=1Si) ≤ n
- i=1
P (Si) P (Type I error) = P (∪n
i=1Type I error for test i)
Bonferroni’s method
Union bound: For any events S1, . . . , Sn P (∪n
i=1Si) ≤ n
- i=1
P (Si) P (Type I error) = P (∪n
i=1Type I error for test i)
≤
n
- i=1
P (Type I error for test i)
Bonferroni’s method
Union bound: For any events S1, . . . , Sn P (∪n
i=1Si) ≤ n
- i=1
P (Si) P (Type I error) = P (∪n
i=1Type I error for test i)
≤
n
- i=1