CHiCAGO: Statistical methodology for signal detection in Capture - - PowerPoint PPT Presentation

chicago statistical methodology for signal detection in
SMART_READER_LITE
LIVE PREVIEW

CHiCAGO: Statistical methodology for signal detection in Capture - - PowerPoint PPT Presentation

CHiCAGO: Statistical methodology for signal detection in Capture Hi-C data Jonathan Cairns jonathan.cairns@babraham.ac.uk @jonathancairns Fraser/Spivakov labs, Babraham Insitute 4th October 2016 Table of Contents Introduction 1 The CHiCAGO


slide-1
SLIDE 1

CHiCAGO: Statistical methodology for signal detection in Capture Hi-C data

Jonathan Cairns

jonathan.cairns@babraham.ac.uk @jonathancairns

Fraser/Spivakov labs, Babraham Insitute

4th October 2016

slide-2
SLIDE 2

Table of Contents

1

Introduction

2

The CHiCAGO model

3

Results

  • J. Cairns (Babraham Institute)

regulatorygenomicsgroup.org/chicago 2 / 20

slide-3
SLIDE 3

Table of Contents

1

Introduction

2

The CHiCAGO model

3

Results

  • J. Cairns (Babraham Institute)

regulatorygenomicsgroup.org/chicago 3 / 20

slide-4
SLIDE 4

Motivation

  • J. Cairns (Babraham Institute)

regulatorygenomicsgroup.org/chicago 4 / 20

slide-5
SLIDE 5

CHi-C: improved resolution at promoters, over Hi-C

  • J. Cairns (Babraham Institute)

regulatorygenomicsgroup.org/chicago 5 / 20 Lieberman-Aiden et al (2009)

slide-6
SLIDE 6

CHi-C: improved resolution at promoters, over Hi-C

  • Approx. 12-fold increase in read coverage
  • J. Cairns (Babraham Institute)

regulatorygenomicsgroup.org/chicago 5 / 20 Sch¨

  • nfelder et al (2015), Mifsud et al (2015), Sahl´

en et al (2015)

slide-7
SLIDE 7

The data

Align reads & filter out artefacts with HiCUP

  • J. Cairns (Babraham Institute)

regulatorygenomicsgroup.org/chicago 6 / 20 Wingett et al (2016)

slide-8
SLIDE 8

The data

Align reads & filter out artefacts with HiCUP Obtain counts Xij:

  • J. Cairns (Babraham Institute)

regulatorygenomicsgroup.org/chicago 6 / 20 Wingett et al (2016)

slide-9
SLIDE 9

The data

Align reads & filter out artefacts with HiCUP Obtain counts Xij:

1 3 7 5 4 1 2 4 6 5 4 1 2 4 6 9 10 ... 1 1 2 5 3 4 1 1 5 7 ... ... 22,000 823,000 baits (j)

  • ther ends (i)
  • J. Cairns (Babraham Institute)

regulatorygenomicsgroup.org/chicago 6 / 20 Wingett et al (2016)

slide-10
SLIDE 10
  • ● ●
  • ●●
  • ● ●
  • ● ●●
  • ● ●
  • ●●
  • ● ●
  • −6e+05

−4e+05 −2e+05 0e+00 2e+05 4e+05 6e+05 50 100 200 300

MIR625−201 (224546)

Distance from viewpoint N

  • ● ●
  • ● ●
  • ●● ●
  • ●●

−6e+05 −4e+05 −2e+05 0e+00 2e+05 4e+05 6e+05 100 200 300 400 500

PPP1CB−004,PPP1CB−006,PPP1CB−005,PPP1CB−003,PPP1CB−001,PPP1CB−009,... (340147)

Distance from viewpoint N

no interaction interaction

MIR625 PPP1CB

slide-11
SLIDE 11

Table of Contents

1

Introduction

2

The CHiCAGO model

3

Results

  • J. Cairns (Babraham Institute)

regulatorygenomicsgroup.org/chicago 8 / 20

slide-12
SLIDE 12

CHiCAGO

CHiCAGO – Capture Hi-C Analysis of Genomic Organization.

  • J. Cairns (Babraham Institute)

regulatorygenomicsgroup.org/chicago 9 / 20

slide-13
SLIDE 13

Model

Background comes from two sources:

  • J. Cairns (Babraham Institute)

regulatorygenomicsgroup.org/chicago 10 / 20

slide-14
SLIDE 14

Model

Background comes from two sources: Brownian Technical Source Random collisions Sequencing artefacts

  • J. Cairns (Babraham Institute)

regulatorygenomicsgroup.org/chicago 10 / 20

slide-15
SLIDE 15

Model

Background comes from two sources: Brownian Technical Source Random collisions Sequencing artefacts Depends on distance? Yes (decreasing) No

  • J. Cairns (Babraham Institute)

regulatorygenomicsgroup.org/chicago 10 / 20

slide-16
SLIDE 16

Model

Background comes from two sources: Brownian Technical Source Random collisions Sequencing artefacts Depends on distance? Yes (decreasing) No Dominates Close to bait Far from bait

  • J. Cairns (Babraham Institute)

regulatorygenomicsgroup.org/chicago 10 / 20

slide-17
SLIDE 17

Model

Background comes from two sources: Brownian Technical Source Random collisions Sequencing artefacts Depends on distance? Yes (decreasing) No Dominates Close to bait Far from bait Under H0 (no interaction), counts are sum of the two components: Xij = Bij + Tij

  • J. Cairns (Babraham Institute)

regulatorygenomicsgroup.org/chicago 10 / 20

slide-18
SLIDE 18

Brownian background estimation

Xij = Bij + Tij

  • J. Cairns (Babraham Institute)

regulatorygenomicsgroup.org/chicago 11 / 20

slide-19
SLIDE 19

Brownian background estimation

Xij = Bij + Tij Bij ∼ NB, with E(Bij) = f (dij)

10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 250 1177 5348 5373 5382 −500000 −250000 250000 500000

Distance from bait Expected count

  • J. Cairns (Babraham Institute)

regulatorygenomicsgroup.org/chicago 11 / 20

slide-20
SLIDE 20

Brownian background estimation

Xij = Bij + Tij Bij ∼ NB, with E(Bij) = f (dij) × (bait bias)j

10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 250 1177 5348 5373 5382 −500000 −250000 250000 500000

Distance from bait Expected count

  • J. Cairns (Babraham Institute)

regulatorygenomicsgroup.org/chicago 11 / 20

slide-21
SLIDE 21

Brownian background estimation

Xij = Bij + Tij Bij ∼ NB, with E(Bij) = f (dij) × (bait bias)j × (other end bias)i

10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 250 1177 5348 5373 5382 −500000 −250000 250000 500000

Distance from bait Expected count

  • J. Cairns (Babraham Institute)

regulatorygenomicsgroup.org/chicago 11 / 20

slide-22
SLIDE 22

Brownian background estimation

Xij = Bij + Tij Bij ∼ NB, with E(Bij) = f (dij) × (bait bias)j × (other end bias)i

Distance function

10 11 12 13 14 −0.5 0.0 0.5 1.0 1.5 2.0 2.5

log(distance) log(f(d))

  • J. Cairns (Babraham Institute)

regulatorygenomicsgroup.org/chicago 11 / 20

slide-23
SLIDE 23

Brownian background estimation

Xij = Bij + Tij Bij ∼ NB, with E(Bij) = f (dij) × (bait bias)j × (other end bias)i f (d): estimated close to bait (< 1.5Mb) in 20kb bins.

Distance function

10 11 12 13 14 −0.5 0.0 0.5 1.0 1.5 2.0 2.5

log(distance) log(f(d))

  • J. Cairns (Babraham Institute)

regulatorygenomicsgroup.org/chicago 11 / 20

slide-24
SLIDE 24

Brownian background estimation

Xij = Bij + Tij Bij ∼ NB, with E(Bij) = f (dij) × (bait bias)j × (other end bias)i f (d): estimated close to bait (< 1.5Mb) in 20kb bins. bin-wise estimates f (db) from geometric mean across baits

Distance function

10 11 12 13 14 −0.5 0.0 0.5 1.0 1.5 2.0 2.5

log(distance) log(f(d))

  • J. Cairns (Babraham Institute)

regulatorygenomicsgroup.org/chicago 11 / 20

slide-25
SLIDE 25

Brownian background estimation

Xij = Bij + Tij Bij ∼ NB, with E(Bij) = f (dij) × (bait bias)j × (other end bias)i f (d): estimated close to bait (< 1.5Mb) in 20kb bins. bin-wise estimates f (db) from geometric mean across baits interpolation: cubic fit

  • n log-log scale

Distance function

10 11 12 13 14 −0.5 0.0 0.5 1.0 1.5 2.0 2.5

log(distance) log(f(d))

  • J. Cairns (Babraham Institute)

regulatorygenomicsgroup.org/chicago 11 / 20

slide-26
SLIDE 26

Brownian background estimation

Xij = Bij + Tij Bij ∼ NB, with E(Bij) = f (dij) × (bait bias)j × (other end bias)i Bait-specific bias:

  • J. Cairns (Babraham Institute)

regulatorygenomicsgroup.org/chicago 11 / 20

slide-27
SLIDE 27

Brownian background estimation

Xij = Bij + Tij Bij ∼ NB, with E(Bij) = f (dij) × (bait bias)j × (other end bias)i Bait-specific bias:

Get bin-wise estimates for each bait. Take median across bins – robust to interactions

  • J. Cairns (Babraham Institute)

regulatorygenomicsgroup.org/chicago 11 / 20

slide-28
SLIDE 28

Brownian background estimation

Xij = Bij + Tij Bij ∼ NB, with E(Bij) = f (dij) × (bait bias)j × (other end bias)i Bait-specific bias:

Get bin-wise estimates for each bait. Take median across bins – robust to interactions

Other-end-specific bias:

  • J. Cairns (Babraham Institute)

regulatorygenomicsgroup.org/chicago 11 / 20

slide-29
SLIDE 29

Brownian background estimation

Xij = Bij + Tij Bij ∼ NB, with E(Bij) = f (dij) × (bait bias)j × (other end bias)i Bait-specific bias:

Get bin-wise estimates for each bait. Take median across bins – robust to interactions

Other-end-specific bias:

Too sparse to estimate individually Assume trans-chromosomal reads are mostly noise Pool other-ends by trans counts Estimate bias parameter, pool-wise Bait-to-bait interactions treated separately

  • J. Cairns (Babraham Institute)

regulatorygenomicsgroup.org/chicago 11 / 20

slide-30
SLIDE 30

Brownian background estimation

Xij = Bij + Tij Bij ∼ NB, with E(Bij) = f (dij) × (bait bias)j × (other end bias)i Bait-specific bias:

Get bin-wise estimates for each bait. Take median across bins – robust to interactions

Other-end-specific bias:

Too sparse to estimate individually Assume trans-chromosomal reads are mostly noise Pool other-ends by trans counts Estimate bias parameter, pool-wise Bait-to-bait interactions treated separately

Dispersion parameter

  • J. Cairns (Babraham Institute)

regulatorygenomicsgroup.org/chicago 11 / 20

slide-31
SLIDE 31

Brownian background estimation

Xij = Bij + Tij Bij ∼ NB, with E(Bij) = f (dij) × (bait bias)j × (other end bias)i Bait-specific bias:

Get bin-wise estimates for each bait. Take median across bins – robust to interactions

Other-end-specific bias:

Too sparse to estimate individually Assume trans-chromosomal reads are mostly noise Pool other-ends by trans counts Estimate bias parameter, pool-wise Bait-to-bait interactions treated separately

Dispersion parameter

Established maximum likelihood methods.

  • J. Cairns (Babraham Institute)

regulatorygenomicsgroup.org/chicago 11 / 20

slide-32
SLIDE 32

Technical noise estimation

Xij = Bij + Tij

  • J. Cairns (Babraham Institute)

regulatorygenomicsgroup.org/chicago 11 / 20

slide-33
SLIDE 33

Technical noise estimation

Xij = Bij + Tij Tij ∼ Pois(λij)

  • J. Cairns (Babraham Institute)

regulatorygenomicsgroup.org/chicago 11 / 20

slide-34
SLIDE 34

Technical noise estimation

Xij = Bij + Tij Tij ∼ Pois(λij) Estimated entirely from trans-chromosomal reads

  • J. Cairns (Babraham Institute)

regulatorygenomicsgroup.org/chicago 11 / 20

slide-35
SLIDE 35

Technical noise estimation

Xij = Bij + Tij Tij ∼ Pois(λij) Estimated entirely from trans-chromosomal reads Pool baits and other-ends Pool-wise estimate: average number of reads per pair of trans fragments.

  • J. Cairns (Babraham Institute)

regulatorygenomicsgroup.org/chicago 11 / 20

slide-36
SLIDE 36

Calling p-values

Xij = Bij + Tij B is Negative Binomial, T is Poisson. ⇒ X has Delaporte distribution. One-sided hypothesis test – Observed more than expected by chance? Get p-value

  • J. Cairns (Babraham Institute)

regulatorygenomicsgroup.org/chicago 12 / 20

slide-37
SLIDE 37

Statistical model - p-value weighting

Simple p-value thresholding (even using Bonferroni/FDR) → many false positives (typically, at large distances, with only one read).

  • J. Cairns (Babraham Institute)

regulatorygenomicsgroup.org/chicago 13 / 20

slide-38
SLIDE 38

Statistical model - p-value weighting

Simple p-value thresholding (even using Bonferroni/FDR) → many false positives (typically, at large distances, with only one read). At large distances: far fewer reproducible interactions

Empirical probability of reproducible interaction

log(distance)

10 11 12 13 14 15 16

−15 −10 −5

Data Fit

log [empirical probability]

  • J. Cairns (Babraham Institute)

regulatorygenomicsgroup.org/chicago 13 / 20

slide-39
SLIDE 39

Statistical model - p-value weighting

Simple p-value thresholding (even using Bonferroni/FDR) → many false positives (typically, at large distances, with only one read). At large distances: far fewer reproducible interactions but vast majority of tests performed there

Empirical probability of reproducible interaction

log(distance)

10 11 12 13 14 15 16

−15 −10 −5

Data Fit

log [empirical probability]

  • J. Cairns (Babraham Institute)

regulatorygenomicsgroup.org/chicago 13 / 20

slide-40
SLIDE 40

Statistical model - p-value weighting

Simple p-value thresholding (even using Bonferroni/FDR) → many false positives (typically, at large distances, with only one read). At large distances: far fewer reproducible interactions but vast majority of tests performed there So, large-distance false positives dominate.

Empirical probability of reproducible interaction

log(distance)

10 11 12 13 14 15 16

−15 −10 −5

Data Fit

log [empirical probability]

  • J. Cairns (Babraham Institute)

regulatorygenomicsgroup.org/chicago 13 / 20

slide-41
SLIDE 41

Statistical model - p-value weighting

Simple p-value thresholding (even using Bonferroni/FDR) → many false positives (typically, at large distances, with only one read). At large distances: far fewer reproducible interactions but vast majority of tests performed there So, large-distance false positives dominate.

Empirical probability of reproducible interaction

log(distance)

10 11 12 13 14 15 16

−15 −10 −5

Data Fit

log [empirical probability]

Solution: p-value weighting (Genovese et al, 2009) to downweight long-distance interactions

  • J. Cairns (Babraham Institute)

regulatorygenomicsgroup.org/chicago 13 / 20

slide-42
SLIDE 42
  • ● ●
  • ●●
  • ● ●
  • ● ●●
  • ● ●
  • ●●
  • ● ●
  • −6e+05

−4e+05 −2e+05 0e+00 2e+05 4e+05 6e+05 50 100 200 300

MIR625−201 (224546)

Distance from viewpoint N

  • ● ●
  • ● ●
  • ●● ●
  • ●●

−6e+05 −4e+05 −2e+05 0e+00 2e+05 4e+05 6e+05 100 200 300 400 500

PPP1CB−004,PPP1CB−006,PPP1CB−005,PPP1CB−003,PPP1CB−001,PPP1CB−009,... (340147)

Distance from viewpoint N

no interaction interaction

MIR625 PPP1CB

slide-43
SLIDE 43

Table of Contents

1

Introduction

2

The CHiCAGO model

3

Results

  • J. Cairns (Babraham Institute)

regulatorygenomicsgroup.org/chicago 15 / 20

slide-44
SLIDE 44

Downstream analysis

CHiCAGO-derived interactions give us “Promoter-Interacting Regions” (PIRs). Histone marks? SNPs? Other features?

  • J. Cairns (Babraham Institute)

regulatorygenomicsgroup.org/chicago 16 / 20

slide-45
SLIDE 45

Histone marks – significant enrichment at other ends

2000 4000 6000 8000 10000 12000

CTCF H3K4me1 H3K4me3 H3K27ac H3K27me3 H3K9me3

GM12878

2000 4000 6000 8000 10000 12000

CTCF H3K4me1 H3K4me3 H3K27ac H3K27me3 H3K9me3

mESC

Significant interactions Random samples Significant interactions Random samples

Number of overlaps with feature Number of overlaps with feature

  • J. Cairns (Babraham Institute)

regulatorygenomicsgroup.org/chicago 17 / 20 Paula Freire Pritchett

slide-46
SLIDE 46

Interactions in blood cells

Javierre* / Burren* / Wilder* / Kreuzhuber* / Hill* et al. (in press) Genomic regulatory architecture links disease variants to target genes. PCHi-C in 17 blood cell types (primary cells) “Interactomes” found to be cell type-specific, matching lineage tree

D isease-associated S NP

  • J. Cairns (Babraham Institute)

regulatorygenomicsgroup.org/chicago 18 / 20

slide-47
SLIDE 47

Conclusions

CHiCAGO finds interactions in Capture Hi-C data:

robustly having normalised for various sources of bias using p-value weighting (to account for variable true positive rate)

Results provide biological understanding:

can detect cell type-specific interactions. can show enrichment for histone marks. can link disease-associated SNPs to their target genes.

  • J. Cairns (Babraham Institute)

regulatorygenomicsgroup.org/chicago 19 / 20

slide-48
SLIDE 48

Acknowledgements

CHiCAGO developers Paula Freire Pritchett Steven W. Wingett Mikhail Spivakov Statistical Advice Vincent Plagnol (UCL/Inivata) Daniel Zerbino (EBI) Additional Downstream Analysis Csilla V´ arnai Andrew Dimond Data Biola Javierre Stefan Sch¨

  • nfelder

Cameron Osborne (KCL) Peter Fraser

http://www.regulatorygenomicsgroup.org/chicago

  • J. Cairns (Babraham Institute)

regulatorygenomicsgroup.org/chicago 20 / 20

slide-49
SLIDE 49

p-value weighting

We make prior “guesses” Uij. We allow Uij to depend on dij, assuming that short-range interactions are more likely than long-range interactions, with a smooth transition between the two. The Uij are transformed into weights Wij by dividing through by the mean value, ¯ U, ensuring that the average Wij value is 1. Finally, weighted p-values are obtained by dividing the p-values by their respective weights: Qij = pij Wij We now specify the Uij model in our particular context. (next slide)

  • J. Cairns (Babraham Institute)

regulatorygenomicsgroup.org/chicago 1 / 8

slide-50
SLIDE 50

p-value weighting

Empirical probability of reproducible interaction

log(distance)

10 11 12 13 14 15 16

−15 −10 −5

Data Fit

log [empirical probability]

Bounded logistic regression model: Uij is assumed a function of both dij and a vector of parameters Θ = (α, β, γ, δ), according to Uij = ηijUmax + (1 − ηij)Umin where ηij = expit(α + βlog(dij)) Umin = expit(γ) Umax = expit(δ) using the expit function

  • J. Cairns (Babraham Institute)

regulatorygenomicsgroup.org/chicago 2 / 8

slide-51
SLIDE 51

Numbers of called interactions

# interactions per sample: 130,000 − 190,000

  • J. Cairns (Babraham Institute)

regulatorygenomicsgroup.org/chicago 3 / 8

slide-52
SLIDE 52

Numbers of called interactions

# interactions per sample: 130,000 − 190,000 # interactions per captured promoter:

5 10 15 20 25 30 e _ B

  • t

a l _ B u s a t e d C D 4 _ M F a t e d e _ C D 4 e _ C D 8

  • t

a l _ C D 8 h a g e s _ M 2 h a g e s _ M h a g e s _ M 1 p r e c u r s

  • r

s

  • c

y t e s l a s t s M

  • n
  • c

y t e s N e u t r

  • p

h i l s

Average of interactions per Captured Promoter 10 20 30

Naïve B Total B Fetal Thymus Naïve CD4+ Naïve CD8+ Total CD8+ Macrophages M2 Macrophages M0 Macrophages M1 Endothelial Precursors Megakaryocytes Erythroblasts Monocytes Neutrophils Total CD4+ Unstimulated Total CD4+ Stimulated Total CD4+

  • J. Cairns (Babraham Institute)

regulatorygenomicsgroup.org/chicago 3 / 8

slide-53
SLIDE 53
  • 2. PCHiC: sequencing, HICUP & CHiCAGO

Cell Type Processed Reads Capture Unique Valid Reads Significant Interactions Megakaryocytes 2,696,317,863 653,848,788 150,203 Erythroblasts 2,338,677,291 588,786,672 144,771 Neutrophils 2,241,977,639 736,055,569 131,609 Monocytes 1,942,858,536 572,357,387 151,389 Macrophages M0 2,125,716,849 668,675,248 163,791 Macrophages M1 2,067,485,594 497,683,496 163,399 Macrophages M2 2,055,090,022 523,561,551 173,449 Naïve B 2,127,262,739 629,928,642 171,439 Total B 1,874,130,921 702,533,922 183,1 19 Fetal Thymus 2,728,388,103 776,491,344 145,577 Naïve CD4+ 2,797,861,61 1 844,697,853 192,048 Total CD4+ 2,227,386,686 836,974,777 166,668 Unstimulated Total CD4+ 2,034,344,692 721,030,702 177,371 Stimulated Total CD4+ 1,971,143,855 749,720,649 188,714 Naïve CD8+ 1,910,881,702 747,834,572 187,399 Total CD8+ 1,849,225,803 628,771,947 183,964 Endothelial Precursors 2,308,749,174 420,536,621 141,382 37,297,499,080 11,299,489,740 2,816,292

Paula Freire-Pritchett Steven Wingett * HICUP *CHiCAGO

slide-54
SLIDE 54

C lustering

  • f BR

according the score of 10K random interactions (cis1Mb) Neutrophils Total B Naïve B Total C D 8+ Neutrophils Neutrophils Monocytes Monocytes Erythroblasts Erythroblasts Erythroblasts Meg akaryocytes Meg akaryocytes Meg akaryocytes Meg akaryocytes Endothelial Precursors Endothelial Precursors Endothelial Precursors Macrophag es M1 Macrophag es M1 Macrophag es M0 Macrophag es M1 Macrophag es M2 Macrophag es M2 Fetal Thym us Naïve C D 4+ Naïve C D 4+ Total C D 8+ Total C D 8+ S tim ulated Total C D 4+ S tim ulated Total C D 4+ S tim ulated Total C D 4+ Naïve C D 8+ Unstim ulatedTotal C D 4+ Unstim ulatedTotal C D 4+ Unstim ulatedTotal C D 4+ Total C D 4+ Naïve C D 8+ Naïve C D 8+ Naïve B Naïve B Total B Total B Fetal Thym us Fetal Thym us Naïve C D 4+ Total C D 4+ Naïve C D 4+ Total C D 4+ Monocytes Macrophag es M2 Macrophag es M0 Macrophag es M0

Distance

400 500 600 700

Sven Sewitz

slide-55
SLIDE 55

−1e+06 −5e+05 0e+00 5e+05 1e+06

50 100 150

N 100 120

−1e+06 −5e+05 0e+00 5e+05 1e+06

20 40 60 80 N

B1_Monocyte_D1_step2_chicago2 Megakaryocyte_D5_6_step2_chicago2 789407 − AP1S2 789407 − AP1S2

slide-56
SLIDE 56

410124 − CD93− 002,...

−1e+06 −5e+05 0e+00 5e+05 1e+06 50 100 150 200 250 300 N

B1_Monocyte_D1_step2_chicago2

−1e+06 −5e+05 0e+00 5e+05 1e+06 50 100 150 200 N

410124 − CD93− 002,... Megakaryocyte_D5_6_step2_chicago2

slide-57
SLIDE 57