[PPT] - Hypothesis testing DS GA 1002 Probability and Statistics for Data PowerPoint Presentation

SLIDE 1

Hypothesis testing

DS GA 1002 Probability and Statistics for Data Science

http://www.cims.nyu.edu/~cfgranda/pages/DSGA1002_fall17 Carlos Fernandez-Granda

SLIDE 2

Example

In a medical study 10% of women and 12.5% of men suffer from heart disease Hypothesis: Men are more prone to have heart disease than women If there are 20 people in the study, effect could be by chance If there are 20 000 people, we are more convinced Hypothesis testing makes this precise

SLIDE 3

Hypothesis testing

Framework to decide whether patterns in data are random fluctuations Aim: Establish whether a predefined hypothesis is supported by the data

SLIDE 4

The hypothesis testing framework Parametric testing Nonparametric testing Multiple testing

SLIDE 5

Null and alternative hypotheses

Null hypothesis H0: There is no underlying phenomenon (men are not more prone to heart disease) Alternative hypothesis H1: There is an underlying phenomenon We reject H0 if it does not explain the data well Failing to reject H0 does not mean that we think it holds, we just don’t have enough evidence Frequentist perspective: A hypothesis holds or does not hold deterministically

SLIDE 6

Tests

A test is a procedure to decide whether to reject the null hypothesis General strategy:

1. Compute a test statistic from the data T (x1, . . . , xn)
2. Decide on a rejection region R such that if T (x1, . . . , xn) ∈ R it is

very unlikely that the null hypothesis holds

3. Reject the null hypothesis if T (x1, . . . , xn) ∈ R

SLIDE 7

Errors Reject H0? No Yes H0 is true

Type I error

H1 is true Type II error

SLIDE 8

Size and significance level

Priority: Control Type I errors The size of a test is the probability of making a Type I error The significance level is an upper bound on the size

SLIDE 9

Significance level

The effect is significant (at a level of 5%) Translation: Given the assumed probabilistic model, the probability that we reject the null hypothesis when it is true is at most 5%

SLIDE 10

p value

The p value is the smallest significance level at which we would reject H0 for a particular dataset It is a function of the data, not a probability

SLIDE 11

Overview

1. Choose a conjecture
2. Determine the corresponding null hypothesis
3. Choose a test
4. Gather the data
5. Compute the test statistic from the data
6. Compute the p value and reject the null hypothesis if it is below a

predefined limit (typically 1% or 5%)

SLIDE 12

Example: Clutch

Conjecture: NBA player is more effective in 4th quarter Null hypothesis: He’s equally effective Test statistic: Games out of 20 in which he scores more points per minute in the 4th quarter What threshold do we need to ensure a significance level of 1%, 5%? The test statistic is 14, what is the p value?

SLIDE 13

Example: Clutch

T0 represents the test statistic under the null hypothesis We consider a rejection region of the form R := {t | t ≥ η} The size of the test is P (T0 > η)

SLIDE 14

Example: Clutch

T0 represents the test statistic under the null hypothesis We consider a rejection region of the form R := {t | t ≥ η} The size of the test is P (T0 > η) What is the distribution of the test statistic T0 under the null hypothesis?

SLIDE 15

Example: Clutch

T0 represents the test statistic under the null hypothesis We consider a rejection region of the form R := {t | t ≥ η} The size of the test is P (T0 > η) What is the distribution of the test statistic T0 under the null hypothesis? Binomial with parameters 20 and 1/2

SLIDE 16

Example: Clutch

T0 represents the test statistic under the null hypothesis We consider a rejection region of the form R := {t | t ≥ η} The size of the test is P (T0 > η) = 1 2n

n

k=η

n k

What is the distribution of the test statistic T0 under the null hypothesis?

Binomial with parameters 20 and 1/2

SLIDE 17

Distribution under null hypothesis

η 1 2 3 4 5 P (T0 ≥ η) 1.000 1.000 1.000 0.999 0.994 η 6 7 8 9 10 P (T0 ≥ η) 0.979 0.942 0.868 0.748 0.588 η 11 12 13 14 15 P (T0 ≥ η) 0.412 0.252 0.132 0.058 0.021 η 16 17 18 19 20 P (T0 ≥ η) 0.006 0.001 0.000 0.000 0.000

SLIDE 18

Example: Clutch

What threshold do we need to ensure a significance level of 1%? What threshold do we need to ensure a significance level of 5%? The test statistic is 14, what is the p value?

SLIDE 19

Example: Clutch

What threshold do we need to ensure a significance level of 1%? 16 What threshold do we need to ensure a significance level of 5%? The test statistic is 14, what is the p value?

SLIDE 20

Example: Clutch

What threshold do we need to ensure a significance level of 1%? 16 What threshold do we need to ensure a significance level of 5%? 15 The test statistic is 14, what is the p value?

SLIDE 21

Example: Clutch

What threshold do we need to ensure a significance level of 1%? 16 What threshold do we need to ensure a significance level of 5%? 15 The test statistic is 14, what is the p value? 5.8 %

SLIDE 22

Example: Clutch

What threshold do we need to ensure a significance level of 1%? 16 What threshold do we need to ensure a significance level of 5%? 15 The test statistic is 14, what is the p value? 5.8 % Is this the probability that the null hypothesis holds?

SLIDE 23

Example: Clutch

What threshold do we need to ensure a significance level of 1%? 16 What threshold do we need to ensure a significance level of 5%? 15 The test statistic is 14, what is the p value? 5.8 % Is this the probability that the null hypothesis holds? No!

SLIDE 24

Power

The power of a test is the probability of rejecting H0 under H1 For a given significance level, we want as much power as possible Problem: We need to know the distribution of the data under H1!

SLIDE 25

The hypothesis testing framework Parametric testing Nonparametric testing Multiple testing

SLIDE 26

Parametric testing

Data are sampled from a known distribution with unknown parameters Probability measure Pθ depends on θ Frequentist perspective The parameter is deterministic and so are the hypotheses Notation: X is a random vector distributed according to Pθ, the data x are a realization of X

SLIDE 27

If H0 is θ = θ0

The size of a test with test statistic T and rejection region R is α := Pθ0

T(

X) ∈ R

If the rejection region is of the form T (

x) ≥ η α = Pθ0

T(

X) ≥ η

Smallest η at which we reject H0 is T (

x) p = Pθ0

T(

X) ≥ T ( x)

p value: probability under H0 of observing a test statistic that is as

extreme as the one we observe

SLIDE 28

Composite hypotheses

θ = θ0 is a simple hypothesis A composite hypothesis is of the form θ ∈ S for a certain set S The size of a composite test is α = sup

θ∈H0

Pθ

T(

X) ≥ η

The p value is

p = sup

θ∈H0

Pθ

T(

X) ≥ T ( x)

SLIDE 29

Power function

The power function of the test is defined as β (θ) := Pθ

T(

X) ∈ R

We want β (θ) ≈ 0 for θ ∈ H0 and β (θ) ≈ 1 for θ ∈ H1

SLIDE 30

Example: Coin flip

Conjecture: Coin is biased towards heads θ > 1/2 Null hypothesis: Coin not biased towards heads θ ≤ 1/2 Test statistic: Number of heads out of n = 5, 10, 100 flips Rejection region: Heads = n, Heads ≥ 3n/5 Power function?

SLIDE 31

Coin flip power function

If η = n, β1 (θ) = Pθ

T(

X) ∈ R

If η = 3n/5,

β2 (θ) = Pθ

T(

X) ∈ R

SLIDE 32

Coin flip power function

If η = n, β1 (θ) = Pθ

T(

X) ∈ R

= θn

If η = 3n/5, β2 (θ) = Pθ

T(

X) ∈ R

SLIDE 33

Coin flip power function

If η = n, β1 (θ) = Pθ

T(

X) ∈ R

= θn

If η = 3n/5, β2 (θ) = Pθ

T(

X) ∈ R

=

n

k=3n/5

n k

θk (1 − θ)n−k

SLIDE 34

η = n

0.25 0.50 0.75

θ

0.05 0.25 0.50 0.75

β(θ)

n = 5 n = 50 n = 100

SLIDE 35

η ≥ 3n/5

0.25 0.50 0.75

θ

0.05 0.25 0.50 0.75

β(θ)

n = 5 n = 50 n = 100

SLIDE 36

The hypothesis testing framework Parametric testing Nonparametric testing Multiple testing

SLIDE 37

Permutation test

Aim: Compare two datasets xA and xB Null hypothesis: The two datasets are sampled from the same distribution No parametric model...

SLIDE 38

Test statistic

Choose test statistic T and evaluate the difference Tdiff ( x) := T ( xA) − T ( xB) , Test: R := {t | t ≥ η} Problem: How do we determine significance level or p value?

SLIDE 39

Main insight: Exchangeability under permutations

Under H0 distribution of Tdiff( X) does not change if we permute labels Joint distribution of X1, X2, . . . , Xn and of any permutation

X24,

Xn, . . . , X3 are the same Values of Tdiff after permuting tdiff,1, . . . tdiff,n! are uniformly distributed P

Tdiff(

X) ≥ η

= 1

n!

i=1

1tdiff,i≥η This is the size of the test! p = P

Tdiff(

X) ≥ Tdiff ( x)

= 1

n!

i=1

1tdiff,i≥Tdiff(

x)

SLIDE 40

Permutation test

1. Choose a conjecture as to how

xA and xB are different

2. Choose a test statistic Tdiff
3. Compute Tdiff (

x)

4. Permute the labels m times and compute the corresponding values of

Tdiff: tdiff,1, tdiff,2, . . . tdiff,m

5. Compute the approximate p value

p = P

Tdiff(

X) ≥ Tdiff ( x)

= 1

m

i=1

1tdiff,i≥Tdiff(

x)

and reject the null hypothesis if below a predefined limit (1%, 5%)

SLIDE 41

Cholesterol levels

1. Study with 86 men and 182 women
2. Conjecture: men have higher cholesterol than women
3. Test statistic: empirical mean of cholesterol level
4. 261.3 mg/dl amongst men and 242.0 mg/dl amongst women
5. Null hypothesis: No difference, permuting data yields same distribution
6. We sample 106 permutations to compute an approximate p value

SLIDE 42

Cholesterol levels

100 150 200 250 300 350 400 450 2 4 6 8 10 12 14 16 Men Women

SLIDE 43

p value = 0.119%

Approximate distribution under the null hypothesis of the difference between the empirical means in men and women

20.00 10.00 0.00 10.00 19.22 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07

SLIDE 44

p value = 0.112%

Approximate distribution under the null hypothesis of the difference between the empirical means in men and women

20.00 10.00 0.00 10.00 19.22 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07

SLIDE 45

p value = 0.115%

Approximate distribution under the null hypothesis of the difference between the empirical means in men and women

20.00 10.00 0.00 10.00 19.22 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07

SLIDE 46

Blood pressure

1. Study with 86 men and 182 women
2. Conjecture: men have higher blood pressure than women
3. Test statistic: empirical mean of blood pressure
4. 133.2 mmHg mg/dl amongst men and 130.6 mg/dl amongst women
5. Null hypothesis: No difference, permuting data yields same distribution
6. We sample 106 permutations to compute an approximate p value

SLIDE 47

Blood pressure

80 100 120 140 160 180 200 220 5 10 15 20 25 30 Men Women

SLIDE 48

p value = 13.48%

Approximate distribution under the null hypothesis of the difference between the empirical means in men and women

5.0 0.0 5.0 2.6 0.00 0.05 0.10 0.15 0.20

SLIDE 49

p value = 13.56%

Approximate distribution under the null hypothesis of the difference between the empirical means in men and women

5.0 0.0 5.0 2.6 0.00 0.05 0.10 0.15 0.20

SLIDE 50

p value = 13.50%

Approximate distribution under the null hypothesis of the difference between the empirical means in men and women

5.0 0.0 5.0 2.6 0.00 0.05 0.10 0.15 0.20

SLIDE 51

The hypothesis testing framework Parametric testing Nonparametric testing Multiple testing

SLIDE 52

Multiple testing

Often, we perform many simultaneous hypothesis tests Computational genomics, many genes could be relevant For n independent tests of size α P (at least one false positive) = 1 − P (no false positives) (1) (2)

SLIDE 53

Multiple testing

Often, we perform many simultaneous hypothesis tests Computational genomics, many genes could be relevant For n independent tests of size α P (at least one false positive) = 1 − P (no false positives) (1) = 1 − (1 − α)n (2) For α=1% and n = 500 genes, P (at least one false positive) = 0.99!

SLIDE 54

Bonferroni’s method

Given n hypothesis tests, compute the corresponding p values p1, . . . , pn For a fixed significance level α reject the ith null hypothesis if pi > α n Probability of making a Type I error is bounded by α

SLIDE 55

Bonferroni’s method

Union bound: For any events S1, . . . , Sn P (∪n

i=1Si) ≤ n

i=1

P (Si)

SLIDE 56

Bonferroni’s method

Union bound: For any events S1, . . . , Sn P (∪n

i=1Si) ≤ n

i=1

P (Si) P (Type I error) = P (∪n

i=1Type I error for test i)

SLIDE 57

Bonferroni’s method

Union bound: For any events S1, . . . , Sn P (∪n

i=1Si) ≤ n

i=1

P (Si) P (Type I error) = P (∪n

i=1Type I error for test i)

≤

n

i=1

P (Type I error for test i)

SLIDE 58

Bonferroni’s method

Union bound: For any events S1, . . . , Sn P (∪n

i=1Si) ≤ n

i=1

P (Si) P (Type I error) = P (∪n

i=1Type I error for test i)

≤

n

i=1

P (Type I error for test i) = n · α n = α

SLIDE 59

Example: Clutch

Conjecture: 10 NBA players, some are more effective in 4th quarter Null hypothesis: None more effective in 4th quarter Test statistic: Games out of 20 where player scores more in the 4th What threshold do we need to ensure a significance level of 5%?

SLIDE 60

Distribution under null hypothesis

η 1 2 3 4 5 P (T0 ≥ η) 1.000 1.000 1.000 0.999 0.994 η 6 7 8 9 10 P (T0 ≥ η) 0.979 0.942 0.868 0.748 0.588 η 11 12 13 14 15 P (T0 ≥ η) 0.412 0.252 0.132 0.058 0.021 η 16 17 18 19 20 P (T0 ≥ η) 0.006 0.001 0.000 0.000 0.000