Statistical significance for untangling complex genotype- phenotype - - PowerPoint PPT Presentation

statistical significance for untangling complex genotype
SMART_READER_LITE
LIVE PREVIEW

Statistical significance for untangling complex genotype- phenotype - - PowerPoint PPT Presentation

Statistical significance for untangling complex genotype- phenotype connections Jun Sese sese.jun@aist.go.jp AIST http://seselab.org/ Higher-order analyses of genome-wide data are incompatible with p-values Combinatorial e ff ects


slide-1
SLIDE 1

Statistical significance for untangling complex genotype- phenotype connections

Jun Sese

sese.jun@aist.go.jp AIST http://seselab.org/

slide-2
SLIDE 2

Higher-order analyses of genome-wide data are incompatible with p-values

  • Combinatorial effects

OCT3/4 SOX2 KLF4 C-MYC iPS cells

Carlborg O, and Haley CS. 2004. Nature Reviews Genetics

Transcription factors Epistatic effects

Takahashi and Yamanaka. 2006, Cell

  • Network analysis

s1 s5 s2 s4 s3

Few combinations have been found from genome-wide data. Why? Computationally high cost.

  • Yes. But, recent supercomputer may be able to find small
  • combinations. However, few results have been found.

Statistical models are not suitable for the problem. Probably yes. Traditional approximation is too simple to analyze them. Statistical procedure have some problem. Try to solve this problem in this work.

slide-3
SLIDE 3

Higher-order analyses of genome-wide data are incompatible with p-values

  • Combinatorial effects

OCT3/4 SOX2 KLF4 C-MYC iPS cells

Carlborg O, and Haley CS. 2004. Nature Reviews Genetics

Transcription factors

Takahashi and Yamanaka. 2006, Cell

  • Network analysis

s1 s5 s2 s4 s3

Few combinations have been found from genome-wide data. Why? Existing multiple testing corrections are too conservative to find the combinations. We developed multiple testing correction method to find statistically significant combinations.

Epistatic effects

slide-4
SLIDE 4

Contents

  • Multiple Testing and Correction
  • LAMP: multiple testing correction for combination discovery
  • Tarone’s method: modify Bonferroni correction
  • Key algorithm for LAMP
  • Application to combinatorial TF discovery
  • Derivative softwares
  • Summary

4

slide-5
SLIDE 5

Active motif discovery

  • Think about association between motifs and gene expressions.
  • To simplify the explanation, gene expressions are categorized in

high or low.

5 Gene

High High High Low Low Low Low Contingency Table Low

3 3 5 5 3 5 8

High Low Total Total Fisher’s exact test p=0.018 < 0.05→significant?

slide-6
SLIDE 6

Active motif discovery

  • Think about association between motifs and gene expressions.
  • To simplify the explanation, gene expressions are categorized in

high or low.

6

Contingency Table No ! because we need multiple testing correction

Gene

High High High Low Low Low Low Low

3 3 5 5 3 5 8

High Low Total Total Fisher’s exact test p=0.018 < 0.05→significant?

slide-7
SLIDE 7

Multiple testing correction

7

Single test Significance level α Ten tests

5% ≦ 5% 40% 5% 0.5% 0.5%

False discovery:

4.9%

slide-8
SLIDE 8

Bonferroni Correction

  • Adjusted p-value = The number of tests * raw p-value
  • Theoretically, correct corrected significance level δ to α/N
  • Control family-wise error rate (FWER)
  • the probability that at least one significant test happens.

8 δ: corrected significance level, N: # of tests p > δ for all treatments Family-Wise Error

p > δ for all treatments

≦ N ・ α =

α = Pr @ [

i∈{1,...,N}

{pi ≤ δ} 1 A ≤ X

i∈{1,...,N}

Pr(pi ≤ δ) ≤ Nδ

slide-9
SLIDE 9

Bonferroni Correction

9 δ: corrected significance level, N: # of tests p > δ for all treatments Family-Wise Error

p > δ for all treatments

≦ N ・ α =

α = Pr @ [

i∈{1,...,N}

{pi ≤ δ} 1 A ≤ X

i∈{1,...,N}

Pr(pi ≤ δ) ≤ Nδ

  • Adjusted p-value = The number of tests * raw p-value
  • Theoretically, correct corrected significance level δ to α/N
  • Control family-wise error rate (FWER)
  • the probability that at least one significant test happens.

δ ≤ α/N

The upper bound should be less than α

FWER = FWER =

slide-10
SLIDE 10

Bonferroni Correction

10 δ: corrected significance level, N: # of tests

α = Pr @ [

i∈{1,...,N}

{pi ≤ δ} 1 A ≤ X

i∈{1,...,N}

Pr(pi ≤ δ) ≤ Nδ

A P-value B C D A B C D AB AC AD BC BD CD P-value Take combinations Larger correction factor ...

Detection of functional complex of genes is extremely unlikely

  • Adjusted p-value = The number of tests * raw p-value
  • Theoretically, correct corrected significance level δ to α/N
  • Control family-wise error rate (FWER)
  • the probability that at least one significant test happens.

δ ≤ α/N

slide-11
SLIDE 11

Two problems to discover the combinations statistically

  • Avoiding conservative multiple testing correction
  • But, FWER should be kept below α
  • We introduce Tarone’s method [Tarone, Biometrics, 1990]
  • Fast enumeration of all possible combinations/subgraphs
  • Counting Bonferroni factor efficiently
  • We use
  • a frequent pattern mining method for combinations and
  • an efficient graph enumeration technique for subgraphs.
  • Both are combined with Tarone’s method.
slide-12
SLIDE 12

Contents

  • Multiple Testing and Correction
  • LAMP: multiple testing correction for combination discovery
  • Tarone’s method: modify Bonferroni correction
  • Key algorithm for LAMP
  • Application to combinatorial TF discovery
  • Derivative softwares
  • Summary

12

slide-13
SLIDE 13

Our Proposal: [PNAS 2013] Limitless Arity Multiple testing Procedure

  • Can enumerate statistically significant combinations
  • Techniques
  • Count the exact number of “testable” combinations
  • Infrequent combinations do not affect FWER
  • Stepwise procedure with frequent itemset mining
  • Calibrate the correction factor to the smallest possible value
  • Discovered statistically significant motif combinations in yeast and

breast cancer expression data

13

slide-14
SLIDE 14

14

α = Pr @ [

i∈{1,...,N}

{pi ≤ δ} 1 A ≤ X

i∈{1,...,N}

Pr(pi ≤ δ) ≤ Nδ

Pr(pi ≤ δ) = 0

Bonferroni inequation Testable Untestable

Tests that have NO possibility to have false positives. This can be safely removed from Bonf. factor. Tests that have possibility to have false positives. This should be counted in Bonf. factor.

Tatone’s method: Only count testable ones in Bonferroni factor

Bonferroni factor N = # of tests.

Bonferroni Tarone

Check all possible thresholds, and select largest δ

N δ

δ ≤ α/N

δN ≤ α

|{i | Pr(pi ≤ δ) > 0}| ≤ α

slide-15
SLIDE 15

Contents

  • Multiple Testing and Correction
  • LAMP: multiple testing correction for combination discovery
  • Tarone’s method: modify Bonferroni correction
  • Key algorithm for LAMP
  • Application to combinatorial TF discovery
  • Derivative softwares
  • Summary

15

slide-16
SLIDE 16

Infrequent combinations never cause significant result.

? ? nu ? ? N-nu x N-x N

High Low Total Total

f(x) = ✓ nu x ◆ ✓ N x ◆

From this contingency table, minimum p-value of Fisher’s exact test can be calculated as High High High Low Low Low Low Low

?

With this f(x), testable ones can be described as

nu N-nu

f(x) depends only on x. f(x) decreases to increasing x

slide-17
SLIDE 17

Tarone correction with frequency

17

Take maximum δ that keeps FWER bound below α.

Appropriate

g(x) = |{i|f(xi) ≤ δ}| δ

:corrected sig. thres. xi=N xi=N-1 xi=N-2

α0 = Pr @ [

i2{1,...,N}

{pi ≤ δ} 1 A ≤ X

i2{1,...,N}

Pr(pi ≤ δ) = X

{i|f(xi)δ}

Pr(pi ≤ δ) ≤ |{i|f(xi) ≤ δ}| · δ

FWER

slide-18
SLIDE 18

18

{ } { } { } { } { } { } { } { } { } { } { } { }

… … …

{ }

… …

{ }

{ } { } { } { } { } { } { } { }

… … …

{ }

… … …

{ } { } { } { } { }

… …

{ }

… … … …

{ } { } { } { } { } { } { } { } { } { } { } { } { }

Frequent Pattern Mining

x

x = x − 1 x = x − 1

f(x)mx

mx mx mx

f(x)mx f(x)mx f(x)

f(x) f(x)

x x

slide-19
SLIDE 19

Contents

  • Multiple Testing and Correction
  • LAMP: multiple testing correction for combination discovery
  • Tarone’s method: modify Bonferroni correction
  • Key algorithm for LAMP
  • Application to combinatorial TF discovery
  • Derivative softwares
  • Summary

19

slide-20
SLIDE 20

An Example of Combinatorial Gene Regulation in Yeast

20 Expression: Gasch et al. ChIP-Chip: Harbison et al. Gene

High High High Low Low Low Low Low

102 motifs 5,935 genes Heat shock condition

slide-21
SLIDE 21

An Example of Combinatorial Gene Regulation in Yeast

21

Motif combination LAMP (≦ 102) Bonferroni (≦ 4) K= 303 K = 4,426,528 HSF1 4.41E-24 6.44E-20 MSN2 3.73E-11 5.45E-07 MSN4 0.000532 >1 SKO1 0.00839 >1 SNT2 0.0192 >1 PHD1, SUT1, SOK2, SKN7 0.0272 >1

Corrected p-value. Red: significant

Under heat shock condition

slide-22
SLIDE 22

22

PHD1 SUT1 SOK2

p-value

SKN7!

Up Down

PHD1, SUT1,! SOK2, SKN7! > 1 > 1 0.666 0.111 0.0272 Rank of gene expression A

  • HAP4 GAT2 MSN4 MGA1 GID8 YNL179C RHO5
  • SOK2

SUT1 PHD1 SKN7

  • >1

>1 0.111

> 1 0.0 0.05 0.5 1.0 p-value

0.0272 0.666

slide-23
SLIDE 23

Examples of Combinatorial Gene Regulations in Human Breast Cancer

  • Used MCF-7 breast cancer cell expression profile
  • Treated with Epidermal growth factor (EGF)
  • Dose: 0.1, 0.5, 1.0, 10 uM
  • Time: 5, 10, 15, 30, 45, 60, 90 min after the treatment
  • Motifs taken from MsigDB
  • 397 motifs. Approx. 12,000 genes
  • LAMP K = 1,174,108 - 2,750,336
  • Bonferroni K = 1.4・1016 (maximum arity = 8)

23

slide-24
SLIDE 24

Examples of Combinatorial Gene Regulations in Human Breast Cancer

24

A B

  • 7

6 8 k

  • M
  • t

i f

  • 5

4 3 2 1 7 6 k

  • M
  • t

i f

  • 5

4 3 2 1

  • GR

EVI1 LEF1 OCT1 NFAT FOXO4 C/EBP SREBP1 TATA LEF1 MEF2 NFAT FOXO4 EVI1

8-Motif!

EGF! 0.5nM! 15min

7-Motif!

EGF! 1.0nM! 60min

  • CTTTAAR
  • AACTTT
  • C

T T T A A R

  • C

/ E B P

  • E

V I 1

  • F

O X O 4

  • G

R

  • L

E F 1

  • N

F A T

  • O

C T 1

  • E

V I 1

  • F

O X O 4

  • L

E F 1

  • M

E F 2

  • S

R E B P 1

  • T

A T A

  • N

F A T

  • >1

0.0462 >1 >1 >1 >1 >1 0.0207 >1 >1 >1 >1 >1 >1 >1 0.0137

  • >1
  • > 1

1.0 0.0 0.025 0.05 0.5 p-value

  • Expression: Nagashima et al.

Motif sites: Xie et al.

slide-25
SLIDE 25
  • Scripts are available on the Web
  • http://seselab.org/lamp

25

LAMP multiple testing

slide-26
SLIDE 26

ChIP2LAMP: Finding combinatorial regulations by RNA-seq and ChIP-seq

  • http://seselab.org/chip2lamp
  • Scripts to combine ChIP-seq and

RNA-seq and analyse it by LAMP

ChIP-seq

Mapping (Bowtie2) Peak calling (MACS1/2)

BED RNA-seq RNA-seq

DEG detection (Tophat, Cufflinks, Cuffdiff)

Cuffdiff result chip2lamp.py lamp.py

ChIP2LAMP

Genome annotation

Peaks DEG TSS

GR EVI1 LEF1 OCT1 NFAT FOXO4 C/EBP CTTTAAR

Ectoderm Mesoderm Endoderm hESC

  • Example
  • 41 ENCODE ChIP-seq data from Human ES (hESC)
  • RNA-seq: Before and after differentiated cells

Gillford et al

SIN3AK-20 (P=1) USF1 (P=1) P=6.57E-56 A candidate combination in Mesoderm Not significant on each mothf

slide-27
SLIDE 27

LAMPLINK: Integrate LAMP with PLINK

rs41292755 rs16879334 rs2303080 rs2287779 rs2287780 rs2472647 rs36012859 rs17238245 Adjusted sig. level. : 5.49e-07

Example: Find combinations of SNPs uniquely found in JPN from exome data of 1000 genomes

  • 12758 SNPs, 607 persons (including 105 persons)
  • Fisher’s exact test (significance level is 0.05)

Most widely used GWAS software is PLINK Integrated LAMP as an option of PLINK ・Easy to use for medical scientists Available from http://seselab.org/lamplink

slide-28
SLIDE 28

Summary

  • LAMP is much more sensitive than Bonferroni, whereas FWER is

strictly kept under threshold.

  • Any size of combinations can be considered.
  • A statistically significant eight motif combination was found from

breast cancer transcriptome.

  • Limitations:
  • p-value must be strictly positive
  • LAMP can be applicable to t-test in limited cases.
  • Chi-square and Mann-Whitney U are applicable.
  • Related to robustness of the index
  • Still Bonferroni factor might be too large to apply to GWAS.

28

slide-29
SLIDE 29

Acknowledgement

  • JSPS & U. Tokyo
  • Aika Terada
  • Tokyo Institute of Technology
  • Yuki Saito
  • RIKEN
  • Mariko Okada-Hatakeyama

http://seselab.org/lamp/

  • The Univ. of Tokyo
  • Koji Tsuda
  • NII
  • Takeaki Uno
  • Hokkaido Univ.
  • Shin-ichi Minato