[PPT] - Multiple Comparisons Methods in Genetic E id Epidemiology Studies PowerPoint Presentation

SLIDE 1

Multiple Comparisons Methods in Genetic E id i l St di Epidemiology Studies

Yi Ren Wang, MPH Department of Epidemiology UCLA School of Public Health

SLIDE 2

G ti E id i l T d Genetic Epidemiology Today

Genetic association studies have become

more ambitious: more ambitious:

Early studies focused on one or a few candidate SNPs Recent studies target many SNPs and haplotypes using high throughput platforms

SLIDE 3

G id A i ti St d Genome-wide Association Study

Large number of genetic variations involved

1 test for 500 000 SNPs
1 test for 500,000 SNPs
25,000 expected to be significant at

p<0.05, by chance alone

To make things worse To make things worse

Dominance

(additive/dominant/recessive)

Epistasis (multiple combinations of
Epistasis (multiple combinations of

SNPs)

Multiple phenotype definitions
Subgroup analyses
Subgroup analyses
Multiple analytic methods

SLIDE 4

Motivating Example

DNA-DSBR Pathway and Lung & DNA DSBR Pathway and Lung & UADT Cancer Study

SLIDE 5

G l f th t d Goal of the study

This study intends to cover the genetic variations on the whole DNA-DSBR variations on the whole DNA-DSBR pathway, in order to systematically reveal a f ll i t f h ti l hi i full picture of how genetic polymorphisms in double-strand break pathway alters risks of lung cancer and UADT cancer The potential gene-gene and gene- The potential gene gene and gene environment interactions will be explored

SLIDE 6

St d D i Study Design

Population-based case-control study in Los Angeles Angeles 611 new cases of lung cancer 601 new cases of UADT cancer 1040 cancer free controls matched to cases 1040 cancer-free controls matched to cases by age (within 10 years category) and d gender

SLIDE 7

G S l ti Gene Selection

19 genes involved in the DNA-DSBR pathway were selected for evaluation based pathway were selected for evaluation based

n evidence for their role in either the

h l bi ti i (HR) homologous recombination repair (HR) or the non-homologous end joining (NHEJ) pathways.

SLIDE 8

SNP S l ti SNPs Selection

Known functional SNPs within the DNA double stranded break repair pathway were double stranded break repair pathway were selected As well as potential functional SNPs such as amino-acid-changing (nonsynonymous) g g ( y y ) SNPs (nsSNPs) With a minor allele frequency (MAF) greater With a minor allele frequency (MAF) greater than 5%

SLIDE 9

SNP S l ti SNPs Selection

189 SNPs analyzed are in or near one of 19 189 SNPs analyzed are in or near one of 19 DNA-DSBR genes.

SLIDE 10

St d D i Study Design

SAS 9.1 software will be used for data analysis. ORs and 95% CLs will be computed using p g unconditional logistic regression Potential confounding factors adjusted: age, g j g gender, ethnicity, educational level and tobacco smoking for lung cancer; age, gender, ethnicity, educational level tobacco smoking alcohol educational level, tobacco smoking, alcohol drinking and diet for UADT cancer χ2 test is performed to evaluate Hardy χ2 test is performed to evaluate Hardy- Weinberg equilibrium.

SLIDE 11

St tifi d A l Stratified Analyses

L C Lung Cancer: Non-small cell lung carcinoma (NSCLC) g ( ) Small cell lung carcinoma (SCLC) Head and Neck Cancer: Oral cancer Oral cancer Pharyngeal cancer Laryngeal cancer Esophageal cancer Esophageal cancer

SLIDE 12

Stratified and Multivariate Analyses

Interaction between DSBR and smoking for lung cancer lung cancer Interaction between DSBR and smoking for UADT cancer Interaction between DSBR and alcohol Interaction between DSBR and alcohol drinking for UADT cancer H l t l i Haplotype analysis

SLIDE 13

What are the Genetic Epidemiology Issues?

Population stratification

Variation of SNP frequency by ethnicity
Genomic control parameter will be calculated to

assess the validity of the results

Hi h di i l d t High dimensional data

Gene-environment interactions

Interaction of host genetics with environment Interaction of host genetics with environment

Gene-gene interactions

Interaction of different SNPs

Multiple comparisons

SLIDE 14

Multiple comparisons issue

SLIDE 15

Hypothesis Testing Hypothesis Testing

H0 : Null hypotheis

vs. H1 : Alternative

Hypothesis Hypothesis T : test statistics C : critical value T : test statistics C : critical value If |T|>C, H0 is rejected. Otherwise H0 is retained | | , j Ex ) H0 : μ1 = μ2 vs. H1 : μ1 ≠ μ2 T = (x1- x2) / pooled Ex ) H0 : μ1 μ2 vs. H1 : μ1 ≠ μ2 T (x1 x2) / pooled se If |T| > z(1- α/2), H0 is rejected at the significance | |

(1 α/2),

j g level α Cα

SLIDE 16

Hypothesis Testing Hypothesis Testing

Hypothesis Result Hypothesis Result Retained Rejected T th H0 T I Truth H0 Type I error H1 Type II error Type I error rate = false positives (α : significance level ) level ) Type II error rate = false negatives Power : 1 Type II error rate Power : 1–Type II error rate

P-values : p=inf{α | H0 is rejected at the significance level α }

SLIDE 17

Issues in Multiple Comparison Issues in Multiple Comparison

Q : Given n treatments, which two treatments are Q G e t eat e ts, c t o t eat e ts a e significantly different ? (simultaneous testing) cf) Is treatment A different from treatment B ? ) Ex ) m treatment means : μ1,…,μn Hj : μi = μj where i≠j Tj = (xi- xj) / pooled

j

μi μj j

j

(

i j)

p SE

Type I error when testing each at 0.05 significance level
ne by one : 1 – (0.95)n
Inflated Type I error, ex) α =1 – (0.95)10 = 0.401263
Remedies : Bonferroni Method

Type I error rate = α / # of comparison

SLIDE 18

M lti l C i Multiple Comparisons

Probability of finding a false association by chance = 1 - 0 95n chance = 1 - 0.95

n = 10, p = 40%
n = 100, p = 99.4%

Our data: Our data:

189 genotypes, 2 cancer sites, 10 Subgroup

analyses analyses

N = 2268, p = 99.99999%

SLIDE 19

Type I Error Rates Type I Error Rates

Hypothesis Result Hypothesis Result #retained #rejected Total Truth H0 U V m0 Truth H0 U V m0 H1 T S m1 T t l R R Total m-R R m Per-comparison error rate ( PCER ) = E(V) / m p ( ) ( ) Per-family error rate ( PFER ) = E(V) Family-wise error rate = pr ( V ≥ 1 ) y p ( ) False discovery rate ( FDR ) = E(Q), Q V/R , if R > 0 0, if R = 0 ,

SLIDE 20

F l P iti False Positives

In the absence of bias, three factors determine the probability that a statistically determine the probability that a statistically significant finding is actually a false-positive fi di finding the magnitude of the P value g statistical power f ti f t t d h th th t i t fraction of tested hypotheses that is true

SLIDE 21

M lti l C i Multiple Comparisons

There is a lack of consensus regarding the

ptimal approach to address the false-
ptimal approach to address the false-

positive probability of single nucleotide l hi (SNP) i ti polymorphism (SNP) associations.

SLIDE 22

Methods for Multiple p Comparisons

Ignore it Adjust p-values Adjust p-values

Familywise Error Rate (FWER)

Ch f f l iti Chance of any false positives

False discovery rate (FDR) Benjamini et al 2001

Use Bayesian methods

False positive report probability (FPRP) Wacholder et al

False positive report probability (FPRP) Wacholder et al

2004

SLIDE 23

FWER t lli d FWER controlling procedures

Bonferonni

adj Pvalue = min(n*Pvalue 1)
adj Pvalue = min(n Pvalue,1)

Holm (1979) Hochberg (1986) Westfall & Young (1993) maxT and minP Westfall & Young (1993) maxT and minP

SLIDE 24

B f i ti Bonferroni correction

For testing 500,000 SNPs

5,000 expected to be significant at p<0.01

5,000 e pected to be s g ca t at p 0 0

500 expected to be significant at p<0.001
……
0.05 expected to be significant at p<0.0000001

Suggests setting significance level to α = 10 7* Suggests setting significance level to α = 10-7* Bonferroni correction for m tests t i ifi l l f l t 0 05 / set significance level for p-values to α = 0.05 / m

SLIDE 25

Multiple Testing Procedures based on P values Multiple Testing Procedures based on P-values that control the family-wise error rate

For a single hypothesis H1, p1=inf{ α | H1 is rejected at the significance level α } If p1 < α, H1 is rejected. Otherwise H1 is retained Adjusted p-values for multiple testing (p) pj=inf{ α | H1 is rejected at FWER=α }

j

If pj* < α, Hj is rejected. Otherwise Hj is retained Single-Step, Step-Down and Step-Up procedure

SLIDE 26

Single Step Procedure Single-Step Procedure

For a strong control of FWER For a strong control of FWER, single-step Bonferroni adjusted p-values : pj= min( mpj,1) single-Step Sidak adjsted pvalues : pj= 1- (1-pj)m For a weak control of FWER For a weak control of FWER, single-step minP adjusted p-values pj*= min 1≤k≤m (Pk ≤ pj | complete null)m pj

1≤k≤m ( k ≤ pj |

p ) single-step maxP adjusted p-values p*= max (|T | ≤ C | complete null)m pj = max 1≤k≤m (|Tk| ≤ Cj | complete null)m Under subset pivotal property, weak control = strong control

SLIDE 27

Step Down Procedure Step-Down Procedure

Order the observed unadjusted p-values such that p Order the observed unadjusted p-values such that pr1 ≤ pr2 ≤ … ≤ prm Accordingly, order Hr1 ≤ Hr2 ≤ … ≤ Hrm g y,

r1 ≤ r2 ≤

≤

rm

Holm’s procedure j* i { j | / ( j 1) } j t H f j 1 j* 1 j* = min { j | prj > α / (m-j+1) }, reject Hrj for j=1, .., j-1 Adjusted step down Holm’s p values Adjusted step-down Holm’s p-values prj = max{ min( (m-k+1) prk , 1) } p = max{ 1 (1 p )(m-k+1) } prj = max{ 1-(1-prk)(m k 1) } prj = max{ Pr( min rk<l<rm Pl ≤ prk | complete null) } p j *= max{ Pr( max k<l< |Tl| ≤ C k | complete null) } prj max{ Pr( max rk<l<rm |Tl| ≤ Crk | complete null) }

SLIDE 28

Step Up Procedure Step-Up Procedure

Order the observed unadjusted p-values such that p Order the observed unadjusted p-values such that pr1 ≤ pr2 ≤ … ≤ prm Accordingly, order Hr1 ≤ Hr2 ≤ … ≤ Hrm g y,

r1 ≤ r2 ≤

≤

rm

j* = max { j | prj ≤ α / (m-j+1) }, reject Hrj for j=1, .., j* Adjusted step-down Holm’s p-values p = min{ min( (m k+1) p 1) } prj = min{ min( (m-k+1) prk , 1) }

SLIDE 29

Resampling Method Resampling Method

Bootstrap or permutation based method For the bth permutation, b=1, …, B, compute test statistics t1,b, …, tm,b

, ,

prj *= ∑j=1

B I (| tj,b | ≥ Cj ) / B

ex ) Colub (1999)

SLIDE 30

Resampling Method Resampling Method

Efron et al. (2000) and Tusher et al. (2001) Compute a test statistics tj for each gene j and define order statistics t(j) such that t(1) ≥ t(2) ≥ .. ≥ t(m)

( ) ( ) ( )

For each b permutation, b=1, ..,B, compute the test statistics and define the order statistics t(1),b ≥ t(2),b ≥ .. ≥ t(m) b

(m),b

From the permutations, estimate the expected value (under the complete null) of the order statistics by t(j)= ∑ t(j) b /B t(j),b /B Form a Q-Q plot of the observed t(j) vs. the expected t(j) Efron et al for a fixed threshold Δ genes with |t t* | ≥ Efron et al. – for a fixed threshold Δ, genes with |t(j)-t (j)| ≥ Δ Tusher et al. - for a fixed threshold Δ, let j=max{j: t(j)-t(j) ≥ Δ t* 0} Δ, t*(j) > 0}

SLIDE 31

M lti l T ti C ti O ti Multiple Testing Correction Options Without consideration of prior probability

Family-wise error rate (FWER)

Very conservative and does not tolerate any false positives

False Discovery Rate (FDR) y ( )

Rate False positives a percentage of called gene

No correction

False positives a percentage of genes being tested

SLIDE 32

th F l Di R t (FDR) the False Discovery Rate (FDR)

FDR is the expected ratio of erroneous rejections of the null hypothesis to the total rejections of the null hypothesis to the total number of rejected hypotheses among the SNP l d i thi t SNPs analyzed in this report.

SLIDE 33

A Measure Attached to Each Individual Association----Q Value

E t d ti f f l iti Expected proportion of false positives incurred when calling that association significant.

SLIDE 34

Comparison of p-value and q- p p q value

p value q value p-value q-value P( ll f t b i E t d ti f f l P(a null feature being as

r more extreme than the
bserved one)

Expected proportion of false positives among all features as or more extreme than the

bserved one)

as or more extreme than the

berved one

SLIDE 35

Q l S ft Q-value Software

http://faculty washington edu/~jstorey/qvalue/ http://faculty.washington.edu/~jstorey/qvalue/

SLIDE 36

I DSBR St d In DSBR Study

Bootstrap estimation method will be used to provide for each hypothesis test a q-value provide for each hypothesis test a q-value, which estimates the minimum FDR that can b tt i d h ll t t ith l be attained when all tests with lower or equal p-values are called significant This statistical procedure is appropriate to adjust for multiple testing in large scale adjust for multiple testing in large scale association studies

SLIDE 37

th F l P iti R t P b bilit the False-Positive Report Probability (FPRP) (FPRP)

FPRP is the probability of no true FPRP is the probability of no true association between a genetic variant and disease given a statistically significant disease given a statistically significant finding

SLIDE 38

D t i t f FPRP Determinants of FPRP

1) prior probability of a true association 2) observed P value 2) observed P value 3) statistical power to detect the odds ratio ) p

f the alternative hypothesis at the given

level or P value level or P value

SLIDE 39

I DSBR St d In DSBR Study

Will be applied on a range of prior probabilities (i e 0 01 to 0 25) probabilities (i.e. 0.01 to 0.25) A FPRP criteria of 0.2 will be used to identify which, if any, findings should be considered noteworthy

SLIDE 40

Multiple Comparisons Methods in Genetic E id i l St di Epidemiology Studies

Yi Ren Wang, MPH Department of Epidemiology UCLA School of Public Health

G ti E id i l T d Genetic Epidemiology Today

more ambitious: more ambitious:

G id A i ti St d Genome-wide Association Study

Large number of genetic variations involved

To make things worse To make things worse

Motivating Example

DNA-DSBR Pathway and Lung & DNA DSBR Pathway and Lung & UADT Cancer Study

G l f th t d Goal of the study

St d D i Study Design

Population-based case-control study in Los Angeles Angeles 611 new cases of lung cancer 601 new cases of UADT cancer 1040 cancer free controls matched to cases 1040 cancer-free controls matched to cases by age (within 10 years category) and d gender

G S l ti Gene Selection

19 genes involved in the DNA-DSBR pathway were selected for evaluation based pathway were selected for evaluation based

h l bi ti i (HR) homologous recombination repair (HR) or the non-homologous end joining (NHEJ) pathways.

SNP S l ti SNPs Selection

SNP S l ti SNPs Selection

189 SNPs analyzed are in or near one of 19 189 SNPs analyzed are in or near one of 19 DNA-DSBR genes.

St d D i Study Design

St tifi d A l Stratified Analyses

L C Lung Cancer: Non-small cell lung carcinoma (NSCLC) g ( ) Small cell lung carcinoma (SCLC) Head and Neck Cancer: Oral cancer Oral cancer Pharyngeal cancer Laryngeal cancer Esophageal cancer Esophageal cancer

Stratified and Multivariate Analyses

Interaction between DSBR and smoking for lung cancer lung cancer Interaction between DSBR and smoking for UADT cancer Interaction between DSBR and alcohol Interaction between DSBR and alcohol drinking for UADT cancer H l t l i Haplotype analysis

What are the Genetic Epidemiology Issues?

Population stratification

assess the validity of the results

Hi h di i l d t High dimensional data

Interaction of host genetics with environment Interaction of host genetics with environment

Interaction of different SNPs

Multiple comparisons

Multiple comparisons issue

Hypothesis Testing Hypothesis Testing

H0 : Null hypotheis

j g level α Cα

Hypothesis Testing Hypothesis Testing

Hypothesis Result Hypothesis Result Retained Rejected T th H0 T I Truth H0 Type I error H1 Type II error Type I error rate = false positives (α : significance level ) level ) Type II error rate = false negatives Power : 1 Type II error rate Power : 1–Type II error rate

Issues in Multiple Comparison Issues in Multiple Comparison

Q : Given n treatments, which two treatments are Q G e t eat e ts, c t o t eat e ts a e significantly different ? (simultaneous testing) cf) Is treatment A different from treatment B ? ) Ex ) m treatment means : μ1,…,μn Hj : μi = μj where i≠j Tj = (xi- xj) / pooled

μi μj j

(

p SE

M lti l C i Multiple Comparisons

Probability of finding a false association by chance = 1 - 0 95n chance = 1 - 0.95

Our data: Our data:

analyses analyses

Type I Error Rates Type I Error Rates

F l P iti False Positives

M lti l C i Multiple Comparisons

There is a lack of consensus regarding the

positive probability of single nucleotide l hi (SNP) i ti polymorphism (SNP) associations.

Methods for Multiple p Comparisons

Ignore it Adjust p-values Adjust p-values

Ch f f l iti Chance of any false positives

Use Bayesian methods

False positive report probability (FPRP) Wacholder et al

FWER t lli d FWER controlling procedures

Bonferonni

Holm (1979) Hochberg (1986) Westfall & Young (1993) maxT and minP Westfall & Young (1993) maxT and minP

B f i ti Bonferroni correction

For testing 500,000 SNPs

5,000 e pected to be s g ca t at p 0 0

Suggests setting significance level to α = 10 7* Suggests setting significance level to α = 10-7* Bonferroni correction for m tests t i ifi l l f l t 0 05 / set significance level for p-values to α = 0.05 / m

Multiple Testing Procedures based on P values Multiple Testing Procedures based on P-values that control the family-wise error rate

For a single hypothesis H1, p1=inf{ α | H1 is rejected at the significance level α } If p1 < α, H1 is rejected. Otherwise H1 is retained Adjusted p-values for multiple testing (p*) pj*=inf{ α | H1 is rejected at FWER=α }

If pj* < α, Hj is rejected. Otherwise Hj is retained Single-Step, Step-Down and Step-Up procedure

Single Step Procedure Single-Step Procedure

p ) single-step maxP adjusted p-values p*= max (|T | ≤ C | complete null)m pj = max 1≤k≤m (|Tk| ≤ Cj | complete null)m Under subset pivotal property, weak control = strong control

Step Down Procedure Step-Down Procedure

Order the observed unadjusted p-values such that p Order the observed unadjusted p-values such that pr1 ≤ pr2 ≤ … ≤ prm Accordingly, order Hr1 ≤ Hr2 ≤ … ≤ Hrm g y,

≤

Step Up Procedure Step-Up Procedure

Order the observed unadjusted p-values such that p Order the observed unadjusted p-values such that pr1 ≤ pr2 ≤ … ≤ prm Accordingly, order Hr1 ≤ Hr2 ≤ … ≤ Hrm g y,

≤

j* = max { j | prj ≤ α / (m-j+1) }, reject Hrj for j=1, .., j* Adjusted step-down Holm’s p-values p *= min{ min( (m k+1) p 1) } prj *= min{ min( (m-k+1) prk , 1) }

Resampling Method Resampling Method

Bootstrap or permutation based method For the bth permutation, b=1, …, B, compute test statistics t1,b, …, tm,b

prj *= ∑j=1

ex ) Colub (1999)

Resampling Method Resampling Method

Efron et al. (2000) and Tusher et al. (2001) Compute a test statistics tj for each gene j and define order statistics t(j) such that t(1) ≥ t(2) ≥ .. ≥ t(m)

For a single hypothesis H1, p1=inf{ α | H1 is rejected at the significance level α } If p1 < α, H1 is rejected. Otherwise H1 is retained Adjusted p-values for multiple testing (p) pj=inf{ α | H1 is rejected at FWER=α }

j* = max { j | prj ≤ α / (m-j+1) }, reject Hrj for j=1, .., j* Adjusted step-down Holm’s p-values p = min{ min( (m k+1) p 1) } prj = min{ min( (m-k+1) prk , 1) }