Statistical significance for untangling complex genotype- phenotype - PowerPoint PPT Presentation

Statistical significance for untangling complex genotype- phenotype connections Jun Sese sese.jun@aist.go.jp AIST http://seselab.org/

Higher-order analyses of genome-wide data are incompatible with p-values • Combinatorial e ff ects • Network analysis Epistatic effects Transcription factors s1 OCT3/4 SOX2 s5 s2 KLF4 C-MYC s4 s3 iPS cells Takahashi and Yamanaka. 2006, Cell Carlborg O, and Haley CS. 2004. Nature Reviews Genetics Few combinations have been found from genome-wide data. Why? Computationally high cost. Yes. But, recent supercomputer may be able to find small combinations. However, few results have been found. Statistical models are not suitable for the problem. Probably yes. Traditional approximation is too simple to analyze them. Statistical procedure have some problem. Try to solve this problem in this work.

Higher-order analyses of genome-wide data are incompatible with p-values • Combinatorial e ff ects • Network analysis Epistatic effects Transcription factors s1 OCT3/4 SOX2 s5 s2 KLF4 C-MYC s4 s3 iPS cells Takahashi and Yamanaka. 2006, Cell Carlborg O, and Haley CS. 2004. Nature Reviews Genetics Few combinations have been found from genome-wide data. Why? Existing multiple testing corrections are too conservative to find the combinations. We developed multiple testing correction method to find statistically significant combinations.

Contents • Multiple Testing and Correction • LAMP: multiple testing correction for combination discovery • Tarone’s method: modify Bonferroni correction • Key algorithm for LAMP • Application to combinatorial TF discovery • Derivative softwares • Summary 4

Active motif discovery • Think about association between motifs and gene expressions. • To simplify the explanation, gene expressions are categorized in high or low. Contingency Table Total Gene High 3 0 3 High High 0 5 5 Low High 3 5 8 Total Low Low Fisher’s exact test p=0.018 < 0.05 → significant? Low Low Low 5

Active motif discovery • Think about association between motifs and gene expressions. • To simplify the explanation, gene expressions are categorized in high or low. Contingency Table Total Gene High 3 0 3 High High 0 5 5 Low High 3 5 8 Total Low Low Fisher’s exact test p=0.018 < 0.05 → significant? Low Low No ! because we need Low multiple testing correction 6

Single test Significance level α Ten tests 5% ≦ 5% 40% 5% 0.5% 0.5% False discovery: 4.9% Multiple testing correction 7

Bonferroni Correction • Adjusted p-value = The number of tests * raw p-value • Theoretically, correct corrected significance level δ to α / N • Control family-wise error rate (FWER) • the probability that at least one significant test happens. δ : corrected significance level, N : # of tests 0 1 [ X A ≤ α = Pr { p i ≤ δ } Pr( p i ≤ δ ) ≤ N δ @ i ∈ { 1 ,...,N } i ∈ { 1 ,...,N } Family-Wise Error ≦ N ・ α = ≦ p > δ for all treatments p > δ for all treatments 8

Bonferroni Correction • Adjusted p-value = The number of tests * raw p-value • Theoretically, correct corrected significance level δ to α / N • Control family-wise error rate (FWER) • the probability that at least one significant test happens. δ : corrected significance level, N : # of tests 0 1 [ X FWER = A ≤ α = Pr { p i ≤ δ } Pr( p i ≤ δ ) ≤ N δ @ i ∈ { 1 ,...,N } i ∈ { 1 ,...,N } The upper bound should be less than α Family-Wise Error δ ≤ α /N ≦ N ・ FWER = α = ≦ p > δ for all treatments p > δ for all treatments 9

Bonferroni Correction • Adjusted p-value = The number of tests * raw p-value • Theoretically, correct corrected significance level δ to α / N • Control family-wise error rate (FWER) • the probability that at least one significant test happens. δ : corrected significance level, N : # of tests 0 1 [ X A ≤ α = Pr { p i ≤ δ } Pr( p i ≤ δ ) ≤ N δ @ i ∈ { 1 ,...,N } i ∈ { 1 ,...,N } δ ≤ α /N P-value P-value Take combinations Larger C D C D AB AC B B ... A A AD BC BD CD correction factor Detection of functional complex of genes is extremely unlikely 10

Two problems to discover the combinations statistically • Avoiding conservative multiple testing correction • But, FWER should be kept below α • We introduce Tarone’s method [Tarone, Biometrics, 1990] • Fast enumeration of all possible combinations/subgraphs • Counting Bonferroni factor e ffi ciently • We use • a frequent pattern mining method for combinations and • an e ffi cient graph enumeration technique for subgraphs. • Both are combined with Tarone’s method.

Our Proposal: [PNAS 2013] Limitless Arity Multiple testing Procedure • Can enumerate statistically significant combinations • Techniques • Count the exact number of “testable” combinations • Infrequent combinations do not a ff ect FWER • Stepwise procedure with frequent itemset mining • Calibrate the correction factor to the smallest possible value • Discovered statistically significant motif combinations in yeast and breast cancer expression data 13

Bonferroni inequation N δ 0 1 [ X A ≤ α = Pr { p i ≤ δ } Pr( p i ≤ δ ) ≤ N δ @ i ∈ { 1 ,...,N } i ∈ { 1 ,...,N } Bonferroni factor N = # of tests. Tests that have possibility to Testable have false positives. This should be counted in Bonf. factor. Tests that have NO possibility to Pr( p i ≤ δ ) = 0 Untestable have false positives. This can be safely removed from Bonf. factor. Tatone’s method: Only count testable ones in Bonferroni factor Bonferroni δ ≤ α /N δ N ≤ α Check all possible thresholds, Tarone |{ i | Pr( p i ≤ δ ) > 0 }| ≤ α and select largest δ 14

Infrequent combinations never cause significant result. From this contingency table, High Total n u High High ? ? n u High ? Low ? ? N-n u Low Total x N-x N Low N-n u minimum p-value of Fisher’s exact test Low can be calculated as Low ✓ ◆� ✓ ◆ n u N f ( x ) = x x Low f(x) depends only on x . f(x) decreases to increasing x With this f(x) , testable ones can be described as

Tarone correction with frequency 0 1 α 0 = Pr [ X A ≤ FWER { p i ≤ δ } Pr( p i ≤ δ ) @ i 2 { 1 ,...,N } i 2 { 1 ,...,N } X = Pr( p i ≤ δ ) ≤ |{ i | f ( x i ) ≤ δ }| · δ { i | f ( x i )  δ } Take maximum δ that keeps FWER bound below α . g ( x ) = |{ i | f ( x i ) ≤ δ }| δ Appropriate x i =N-2 x i =N-1 x i =N :corrected sig. thres. 17

Frequent Pattern Mining { } x { } { } { } { } m x … … { } { } { } { } { } f ( x ) m x … … { } { } { } x = x − 1 f ( x ) … … { } { } x { } { } { } { } m x … … { } { } { } { } { } f ( x ) m x … … { } { } { } x = x − 1 … … f ( x ) { } { } x … { } { } { } { } f ( x ) m x m x … … { } { } { } { } { } … … { } { } { } … … { } 18 f ( x )

An Example of Combinatorial Gene Regulation in Yeast 102 motifs Heat shock condition Gene High Expression: Gasch et al. ChIP-Chip: Harbison et al. High High Low 5,935 genes Low Low Low Low 20

An Example of Combinatorial Gene Regulation in Yeast Under heat shock condition Corrected p-value. Red: significant LAMP ( ≦ 102) Bonferroni ( ≦ 4) Motif combination K= 303 K = 4,426,528 HSF1 4.41E-24 6.44E-20 MSN2 3.73E-11 5.45E-07 MSN4 0.000532 >1 SKO1 0.00839 >1 SNT2 0.0192 >1 PHD1, SUT1, SOK2, SKN7 0.0272 >1 21

A Rank of gene expression p -value Up Down � PHD1 � > 1 � PHD1 � � >1 � > 1 � SUT1 � SUT1 � SKN7 � � � � � p -value � >1 � 0.0272 � 0.111 � 0.666 SOK2 � > 1 � � 0.111 1.0 � SKN7 ! � � 0.5 � 0.666 � � 0.05 � � PHD1 , SUT1 , ! � 0.0272 0.0 � SOK2 , SKN7 ! SOK2 � HAP4 GAT2 MSN4 MGA1 GID8 � YNL179C RHO5 � � � 22

Statistical significance for untangling complex genotype- phenotype - PowerPoint PPT Presentation

Statistical significance for untangling complex genotype- phenotype connections Jun Sese sese.jun@aist.go.jp AIST http://seselab.org/ Higher-order analyses of genome-wide data are incompatible with p-values Combinatorial e ff ects

Statistical-Significance Background & Goal Shortcuts Statistical significance is one of

Untangling Composite Commits Untangling Composite Commits Using Program Slicing Using Program

Greenhouse Gas CEQA Greenhouse Gas CEQA Significance Threshold Significance Threshold

Design of WHO Genotype Panels for HBsAg and HBV-DNA and of WHO anti-HBc Standard WHO Genotype

Statistical Significance Tests in NLP Natural Language Processing VU (706.230) - Andi Rexha

Untangling and Restructuring CTDB Martin Schwenke < martin@meltin.net > Samba Team IBM

Complex Numbers Complex Numbers 1 / 19 Complex Numbers Complex numbers ( C ) are an extension of

Significance How important is it? Thoughts on historical significance A property must have

CSE 427 Computational Biology Autumn 2015 3: BLAST, Alignment score significance 1 Significance

Intermembrane Space H + H + Cyt c Co Q Complex Complex III IV H + ATPase H + Complex

Statistical significance in CP violation Mattias Blennow emb@kth.se KTH Theoretical Physics

Medical Medical and social and social significance significance of str of stroke oke

Detecting gene-gene interactions in high-throughput genotype data through a Bayesian clustering

A Bayesian clustering approach for detecting gene-gene interactions in high-dimensional genotype

Genotype imputation accuracy with different reference panels Guan-Hua Huang and Yi-Chi Tseng

Lecture 3: Biology Basics Continued Spring 2020 January 28, 2020 Genotype/Phenotype Phenotype:

Robust Sparse Quadratic Discriminantion Jianqing Fan Princeton University with Tracy Ke, Han Liu

Typical Animal Characteristics Eukaryotic Multicellular Heterotrophic No cell walls

rt t

NetFPGA Summer Course Presented by: Andrew W Moore, Noa Zilberman, Gianni Antichi Stephen

Dynamics of gene activation Marc A. Marti-Renom CNAG-CRG ICREA Nature Genetics (2018) 50

Status of NOvA NuMI Off-axis e Appearance Luke A. Corwin Indiana University Advances in

MOL2NET, 2018 , 4, http://sciforum.net/conference/mol2net-04 2 Introduction Livestock farming is

Light Manufacturing in Africa Findings and Policy Lessons UNU WIDER PRESENTATION Hinh T. Dinh

Statistical significance for untangling complex genotype- phenotype - PowerPoint PPT Presentation

Statistical significance for untangling complex genotype- phenotype connections Jun Sese sese.jun@aist.go.jp AIST http://seselab.org/ Higher-order analyses of genome-wide data are incompatible with p-values Combinatorial e ff ects

Statistical-Significance Background &amp; Goal Shortcuts Statistical significance is one of

Untangling Composite Commits Untangling Composite Commits Using Program Slicing Using Program

Greenhouse Gas CEQA Greenhouse Gas CEQA Significance Threshold Significance Threshold

Design of WHO Genotype Panels for HBsAg and HBV-DNA and of WHO anti-HBc Standard WHO Genotype

Statistical Significance Tests in NLP Natural Language Processing VU (706.230) - Andi Rexha

Untangling and Restructuring CTDB Martin Schwenke &lt; martin@meltin.net &gt; Samba Team IBM

Complex Numbers Complex Numbers 1 / 19 Complex Numbers Complex numbers ( C ) are an extension of

Significance How important is it? Thoughts on historical significance A property must have

CSE 427 Computational Biology Autumn 2015 3: BLAST, Alignment score significance 1 Significance

Intermembrane Space H + H + Cyt c Co Q Complex Complex III IV H + ATPase H + Complex

Statistical significance in CP violation Mattias Blennow emb@kth.se KTH Theoretical Physics

Medical Medical and social and social significance significance of str of stroke oke

Detecting gene-gene interactions in high-throughput genotype data through a Bayesian clustering

A Bayesian clustering approach for detecting gene-gene interactions in high-dimensional genotype

Genotype imputation accuracy with different reference panels Guan-Hua Huang and Yi-Chi Tseng

Lecture 3: Biology Basics Continued Spring 2020 January 28, 2020 Genotype/Phenotype Phenotype:

Robust Sparse Quadratic Discriminantion Jianqing Fan Princeton University with Tracy Ke, Han Liu

Typical Animal Characteristics Eukaryotic Multicellular Heterotrophic No cell walls

rt t

NetFPGA Summer Course Presented by: Andrew W Moore, Noa Zilberman, Gianni Antichi Stephen

Dynamics of gene activation Marc A. Marti-Renom CNAG-CRG ICREA Nature Genetics (2018) 50

Status of NOvA NuMI Off-axis e Appearance Luke A. Corwin Indiana University Advances in

MOL2NET, 2018 , 4, http://sciforum.net/conference/mol2net-04 2 Introduction Livestock farming is

Light Manufacturing in Africa Findings and Policy Lessons UNU WIDER PRESENTATION Hinh T. Dinh

Statistical-Significance Background & Goal Shortcuts Statistical significance is one of

Untangling and Restructuring CTDB Martin Schwenke < martin@meltin.net > Samba Team IBM