CHiCAGO: Statistical methodology for signal detection in Capture - - PowerPoint PPT Presentation
CHiCAGO: Statistical methodology for signal detection in Capture - - PowerPoint PPT Presentation
CHiCAGO: Statistical methodology for signal detection in Capture Hi-C data Jonathan Cairns jonathan.cairns@babraham.ac.uk @jonathancairns Fraser/Spivakov labs, Babraham Insitute 4th October 2016 Table of Contents Introduction 1 The CHiCAGO
Table of Contents
1
Introduction
2
The CHiCAGO model
3
Results
- J. Cairns (Babraham Institute)
regulatorygenomicsgroup.org/chicago 2 / 20
Table of Contents
1
Introduction
2
The CHiCAGO model
3
Results
- J. Cairns (Babraham Institute)
regulatorygenomicsgroup.org/chicago 3 / 20
Motivation
- J. Cairns (Babraham Institute)
regulatorygenomicsgroup.org/chicago 4 / 20
CHi-C: improved resolution at promoters, over Hi-C
- J. Cairns (Babraham Institute)
regulatorygenomicsgroup.org/chicago 5 / 20 Lieberman-Aiden et al (2009)
CHi-C: improved resolution at promoters, over Hi-C
- Approx. 12-fold increase in read coverage
- J. Cairns (Babraham Institute)
regulatorygenomicsgroup.org/chicago 5 / 20 Sch¨
- nfelder et al (2015), Mifsud et al (2015), Sahl´
en et al (2015)
The data
Align reads & filter out artefacts with HiCUP
- J. Cairns (Babraham Institute)
regulatorygenomicsgroup.org/chicago 6 / 20 Wingett et al (2016)
The data
Align reads & filter out artefacts with HiCUP Obtain counts Xij:
- J. Cairns (Babraham Institute)
regulatorygenomicsgroup.org/chicago 6 / 20 Wingett et al (2016)
The data
Align reads & filter out artefacts with HiCUP Obtain counts Xij:
1 3 7 5 4 1 2 4 6 5 4 1 2 4 6 9 10 ... 1 1 2 5 3 4 1 1 5 7 ... ... 22,000 823,000 baits (j)
- ther ends (i)
- J. Cairns (Babraham Institute)
regulatorygenomicsgroup.org/chicago 6 / 20 Wingett et al (2016)
- ● ●
- ●●
- ● ●
- ●
- ● ●●
- ● ●
- ●●
- ● ●
- −6e+05
−4e+05 −2e+05 0e+00 2e+05 4e+05 6e+05 50 100 200 300
MIR625−201 (224546)
Distance from viewpoint N
- ● ●
- ●
- ●
- ● ●
- ●
- ●● ●
- ●
- ●●
−6e+05 −4e+05 −2e+05 0e+00 2e+05 4e+05 6e+05 100 200 300 400 500
PPP1CB−004,PPP1CB−006,PPP1CB−005,PPP1CB−003,PPP1CB−001,PPP1CB−009,... (340147)
Distance from viewpoint N
no interaction interaction
MIR625 PPP1CB
Table of Contents
1
Introduction
2
The CHiCAGO model
3
Results
- J. Cairns (Babraham Institute)
regulatorygenomicsgroup.org/chicago 8 / 20
CHiCAGO
CHiCAGO – Capture Hi-C Analysis of Genomic Organization.
- J. Cairns (Babraham Institute)
regulatorygenomicsgroup.org/chicago 9 / 20
Model
Background comes from two sources:
- J. Cairns (Babraham Institute)
regulatorygenomicsgroup.org/chicago 10 / 20
Model
Background comes from two sources: Brownian Technical Source Random collisions Sequencing artefacts
- J. Cairns (Babraham Institute)
regulatorygenomicsgroup.org/chicago 10 / 20
Model
Background comes from two sources: Brownian Technical Source Random collisions Sequencing artefacts Depends on distance? Yes (decreasing) No
- J. Cairns (Babraham Institute)
regulatorygenomicsgroup.org/chicago 10 / 20
Model
Background comes from two sources: Brownian Technical Source Random collisions Sequencing artefacts Depends on distance? Yes (decreasing) No Dominates Close to bait Far from bait
- J. Cairns (Babraham Institute)
regulatorygenomicsgroup.org/chicago 10 / 20
Model
Background comes from two sources: Brownian Technical Source Random collisions Sequencing artefacts Depends on distance? Yes (decreasing) No Dominates Close to bait Far from bait Under H0 (no interaction), counts are sum of the two components: Xij = Bij + Tij
- J. Cairns (Babraham Institute)
regulatorygenomicsgroup.org/chicago 10 / 20
Brownian background estimation
Xij = Bij + Tij
- J. Cairns (Babraham Institute)
regulatorygenomicsgroup.org/chicago 11 / 20
Brownian background estimation
Xij = Bij + Tij Bij ∼ NB, with E(Bij) = f (dij)
10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 250 1177 5348 5373 5382 −500000 −250000 250000 500000
Distance from bait Expected count
- J. Cairns (Babraham Institute)
regulatorygenomicsgroup.org/chicago 11 / 20
Brownian background estimation
Xij = Bij + Tij Bij ∼ NB, with E(Bij) = f (dij) × (bait bias)j
10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 250 1177 5348 5373 5382 −500000 −250000 250000 500000
Distance from bait Expected count
- J. Cairns (Babraham Institute)
regulatorygenomicsgroup.org/chicago 11 / 20
Brownian background estimation
Xij = Bij + Tij Bij ∼ NB, with E(Bij) = f (dij) × (bait bias)j × (other end bias)i
10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 250 1177 5348 5373 5382 −500000 −250000 250000 500000
Distance from bait Expected count
- J. Cairns (Babraham Institute)
regulatorygenomicsgroup.org/chicago 11 / 20
Brownian background estimation
Xij = Bij + Tij Bij ∼ NB, with E(Bij) = f (dij) × (bait bias)j × (other end bias)i
Distance function
10 11 12 13 14 −0.5 0.0 0.5 1.0 1.5 2.0 2.5
log(distance) log(f(d))
- J. Cairns (Babraham Institute)
regulatorygenomicsgroup.org/chicago 11 / 20
Brownian background estimation
Xij = Bij + Tij Bij ∼ NB, with E(Bij) = f (dij) × (bait bias)j × (other end bias)i f (d): estimated close to bait (< 1.5Mb) in 20kb bins.
Distance function
10 11 12 13 14 −0.5 0.0 0.5 1.0 1.5 2.0 2.5
log(distance) log(f(d))
- J. Cairns (Babraham Institute)
regulatorygenomicsgroup.org/chicago 11 / 20
Brownian background estimation
Xij = Bij + Tij Bij ∼ NB, with E(Bij) = f (dij) × (bait bias)j × (other end bias)i f (d): estimated close to bait (< 1.5Mb) in 20kb bins. bin-wise estimates f (db) from geometric mean across baits
Distance function
10 11 12 13 14 −0.5 0.0 0.5 1.0 1.5 2.0 2.5
log(distance) log(f(d))
- J. Cairns (Babraham Institute)
regulatorygenomicsgroup.org/chicago 11 / 20
Brownian background estimation
Xij = Bij + Tij Bij ∼ NB, with E(Bij) = f (dij) × (bait bias)j × (other end bias)i f (d): estimated close to bait (< 1.5Mb) in 20kb bins. bin-wise estimates f (db) from geometric mean across baits interpolation: cubic fit
- n log-log scale
Distance function
10 11 12 13 14 −0.5 0.0 0.5 1.0 1.5 2.0 2.5
log(distance) log(f(d))
- J. Cairns (Babraham Institute)
regulatorygenomicsgroup.org/chicago 11 / 20
Brownian background estimation
Xij = Bij + Tij Bij ∼ NB, with E(Bij) = f (dij) × (bait bias)j × (other end bias)i Bait-specific bias:
- J. Cairns (Babraham Institute)
regulatorygenomicsgroup.org/chicago 11 / 20
Brownian background estimation
Xij = Bij + Tij Bij ∼ NB, with E(Bij) = f (dij) × (bait bias)j × (other end bias)i Bait-specific bias:
Get bin-wise estimates for each bait. Take median across bins – robust to interactions
- J. Cairns (Babraham Institute)
regulatorygenomicsgroup.org/chicago 11 / 20
Brownian background estimation
Xij = Bij + Tij Bij ∼ NB, with E(Bij) = f (dij) × (bait bias)j × (other end bias)i Bait-specific bias:
Get bin-wise estimates for each bait. Take median across bins – robust to interactions
Other-end-specific bias:
- J. Cairns (Babraham Institute)
regulatorygenomicsgroup.org/chicago 11 / 20
Brownian background estimation
Xij = Bij + Tij Bij ∼ NB, with E(Bij) = f (dij) × (bait bias)j × (other end bias)i Bait-specific bias:
Get bin-wise estimates for each bait. Take median across bins – robust to interactions
Other-end-specific bias:
Too sparse to estimate individually Assume trans-chromosomal reads are mostly noise Pool other-ends by trans counts Estimate bias parameter, pool-wise Bait-to-bait interactions treated separately
- J. Cairns (Babraham Institute)
regulatorygenomicsgroup.org/chicago 11 / 20
Brownian background estimation
Xij = Bij + Tij Bij ∼ NB, with E(Bij) = f (dij) × (bait bias)j × (other end bias)i Bait-specific bias:
Get bin-wise estimates for each bait. Take median across bins – robust to interactions
Other-end-specific bias:
Too sparse to estimate individually Assume trans-chromosomal reads are mostly noise Pool other-ends by trans counts Estimate bias parameter, pool-wise Bait-to-bait interactions treated separately
Dispersion parameter
- J. Cairns (Babraham Institute)
regulatorygenomicsgroup.org/chicago 11 / 20
Brownian background estimation
Xij = Bij + Tij Bij ∼ NB, with E(Bij) = f (dij) × (bait bias)j × (other end bias)i Bait-specific bias:
Get bin-wise estimates for each bait. Take median across bins – robust to interactions
Other-end-specific bias:
Too sparse to estimate individually Assume trans-chromosomal reads are mostly noise Pool other-ends by trans counts Estimate bias parameter, pool-wise Bait-to-bait interactions treated separately
Dispersion parameter
Established maximum likelihood methods.
- J. Cairns (Babraham Institute)
regulatorygenomicsgroup.org/chicago 11 / 20
Technical noise estimation
Xij = Bij + Tij
- J. Cairns (Babraham Institute)
regulatorygenomicsgroup.org/chicago 11 / 20
Technical noise estimation
Xij = Bij + Tij Tij ∼ Pois(λij)
- J. Cairns (Babraham Institute)
regulatorygenomicsgroup.org/chicago 11 / 20
Technical noise estimation
Xij = Bij + Tij Tij ∼ Pois(λij) Estimated entirely from trans-chromosomal reads
- J. Cairns (Babraham Institute)
regulatorygenomicsgroup.org/chicago 11 / 20
Technical noise estimation
Xij = Bij + Tij Tij ∼ Pois(λij) Estimated entirely from trans-chromosomal reads Pool baits and other-ends Pool-wise estimate: average number of reads per pair of trans fragments.
- J. Cairns (Babraham Institute)
regulatorygenomicsgroup.org/chicago 11 / 20
Calling p-values
Xij = Bij + Tij B is Negative Binomial, T is Poisson. ⇒ X has Delaporte distribution. One-sided hypothesis test – Observed more than expected by chance? Get p-value
- J. Cairns (Babraham Institute)
regulatorygenomicsgroup.org/chicago 12 / 20
Statistical model - p-value weighting
Simple p-value thresholding (even using Bonferroni/FDR) → many false positives (typically, at large distances, with only one read).
- J. Cairns (Babraham Institute)
regulatorygenomicsgroup.org/chicago 13 / 20
Statistical model - p-value weighting
Simple p-value thresholding (even using Bonferroni/FDR) → many false positives (typically, at large distances, with only one read). At large distances: far fewer reproducible interactions
Empirical probability of reproducible interaction
log(distance)
10 11 12 13 14 15 16
−15 −10 −5
Data Fit
log [empirical probability]
- J. Cairns (Babraham Institute)
regulatorygenomicsgroup.org/chicago 13 / 20
Statistical model - p-value weighting
Simple p-value thresholding (even using Bonferroni/FDR) → many false positives (typically, at large distances, with only one read). At large distances: far fewer reproducible interactions but vast majority of tests performed there
Empirical probability of reproducible interaction
log(distance)
10 11 12 13 14 15 16
−15 −10 −5
Data Fit
log [empirical probability]
- J. Cairns (Babraham Institute)
regulatorygenomicsgroup.org/chicago 13 / 20
Statistical model - p-value weighting
Simple p-value thresholding (even using Bonferroni/FDR) → many false positives (typically, at large distances, with only one read). At large distances: far fewer reproducible interactions but vast majority of tests performed there So, large-distance false positives dominate.
Empirical probability of reproducible interaction
log(distance)
10 11 12 13 14 15 16
−15 −10 −5
Data Fit
log [empirical probability]
- J. Cairns (Babraham Institute)
regulatorygenomicsgroup.org/chicago 13 / 20
Statistical model - p-value weighting
Simple p-value thresholding (even using Bonferroni/FDR) → many false positives (typically, at large distances, with only one read). At large distances: far fewer reproducible interactions but vast majority of tests performed there So, large-distance false positives dominate.
Empirical probability of reproducible interaction
log(distance)
10 11 12 13 14 15 16
−15 −10 −5
Data Fit
log [empirical probability]
Solution: p-value weighting (Genovese et al, 2009) to downweight long-distance interactions
- J. Cairns (Babraham Institute)
regulatorygenomicsgroup.org/chicago 13 / 20
- ● ●
- ●●
- ● ●
- ●
- ● ●●
- ● ●
- ●●
- ● ●
- −6e+05
−4e+05 −2e+05 0e+00 2e+05 4e+05 6e+05 50 100 200 300
MIR625−201 (224546)
Distance from viewpoint N
- ● ●
- ●
- ●
- ● ●
- ●
- ●● ●
- ●
- ●●
−6e+05 −4e+05 −2e+05 0e+00 2e+05 4e+05 6e+05 100 200 300 400 500
PPP1CB−004,PPP1CB−006,PPP1CB−005,PPP1CB−003,PPP1CB−001,PPP1CB−009,... (340147)
Distance from viewpoint N
no interaction interaction
MIR625 PPP1CB
Table of Contents
1
Introduction
2
The CHiCAGO model
3
Results
- J. Cairns (Babraham Institute)
regulatorygenomicsgroup.org/chicago 15 / 20
Downstream analysis
CHiCAGO-derived interactions give us “Promoter-Interacting Regions” (PIRs). Histone marks? SNPs? Other features?
- J. Cairns (Babraham Institute)
regulatorygenomicsgroup.org/chicago 16 / 20
Histone marks – significant enrichment at other ends
2000 4000 6000 8000 10000 12000
CTCF H3K4me1 H3K4me3 H3K27ac H3K27me3 H3K9me3
GM12878
2000 4000 6000 8000 10000 12000
CTCF H3K4me1 H3K4me3 H3K27ac H3K27me3 H3K9me3
mESC
Significant interactions Random samples Significant interactions Random samples
Number of overlaps with feature Number of overlaps with feature
- J. Cairns (Babraham Institute)
regulatorygenomicsgroup.org/chicago 17 / 20 Paula Freire Pritchett
Interactions in blood cells
Javierre* / Burren* / Wilder* / Kreuzhuber* / Hill* et al. (in press) Genomic regulatory architecture links disease variants to target genes. PCHi-C in 17 blood cell types (primary cells) “Interactomes” found to be cell type-specific, matching lineage tree
D isease-associated S NP
- J. Cairns (Babraham Institute)
regulatorygenomicsgroup.org/chicago 18 / 20
Conclusions
CHiCAGO finds interactions in Capture Hi-C data:
robustly having normalised for various sources of bias using p-value weighting (to account for variable true positive rate)
Results provide biological understanding:
can detect cell type-specific interactions. can show enrichment for histone marks. can link disease-associated SNPs to their target genes.
- J. Cairns (Babraham Institute)
regulatorygenomicsgroup.org/chicago 19 / 20
Acknowledgements
CHiCAGO developers Paula Freire Pritchett Steven W. Wingett Mikhail Spivakov Statistical Advice Vincent Plagnol (UCL/Inivata) Daniel Zerbino (EBI) Additional Downstream Analysis Csilla V´ arnai Andrew Dimond Data Biola Javierre Stefan Sch¨
- nfelder
Cameron Osborne (KCL) Peter Fraser
http://www.regulatorygenomicsgroup.org/chicago
- J. Cairns (Babraham Institute)
regulatorygenomicsgroup.org/chicago 20 / 20
p-value weighting
We make prior “guesses” Uij. We allow Uij to depend on dij, assuming that short-range interactions are more likely than long-range interactions, with a smooth transition between the two. The Uij are transformed into weights Wij by dividing through by the mean value, ¯ U, ensuring that the average Wij value is 1. Finally, weighted p-values are obtained by dividing the p-values by their respective weights: Qij = pij Wij We now specify the Uij model in our particular context. (next slide)
- J. Cairns (Babraham Institute)
regulatorygenomicsgroup.org/chicago 1 / 8
p-value weighting
Empirical probability of reproducible interaction
log(distance)
10 11 12 13 14 15 16
−15 −10 −5
Data Fit
log [empirical probability]
Bounded logistic regression model: Uij is assumed a function of both dij and a vector of parameters Θ = (α, β, γ, δ), according to Uij = ηijUmax + (1 − ηij)Umin where ηij = expit(α + βlog(dij)) Umin = expit(γ) Umax = expit(δ) using the expit function
- J. Cairns (Babraham Institute)
regulatorygenomicsgroup.org/chicago 2 / 8
Numbers of called interactions
# interactions per sample: 130,000 − 190,000
- J. Cairns (Babraham Institute)
regulatorygenomicsgroup.org/chicago 3 / 8
Numbers of called interactions
# interactions per sample: 130,000 − 190,000 # interactions per captured promoter:
5 10 15 20 25 30 e _ B
- t
a l _ B u s a t e d C D 4 _ M F a t e d e _ C D 4 e _ C D 8
- t
a l _ C D 8 h a g e s _ M 2 h a g e s _ M h a g e s _ M 1 p r e c u r s
- r
s
- c
y t e s l a s t s M
- n
- c
y t e s N e u t r
- p
h i l s
Average of interactions per Captured Promoter 10 20 30
Naïve B Total B Fetal Thymus Naïve CD4+ Naïve CD8+ Total CD8+ Macrophages M2 Macrophages M0 Macrophages M1 Endothelial Precursors Megakaryocytes Erythroblasts Monocytes Neutrophils Total CD4+ Unstimulated Total CD4+ Stimulated Total CD4+
- J. Cairns (Babraham Institute)
regulatorygenomicsgroup.org/chicago 3 / 8
- 2. PCHiC: sequencing, HICUP & CHiCAGO
Cell Type Processed Reads Capture Unique Valid Reads Significant Interactions Megakaryocytes 2,696,317,863 653,848,788 150,203 Erythroblasts 2,338,677,291 588,786,672 144,771 Neutrophils 2,241,977,639 736,055,569 131,609 Monocytes 1,942,858,536 572,357,387 151,389 Macrophages M0 2,125,716,849 668,675,248 163,791 Macrophages M1 2,067,485,594 497,683,496 163,399 Macrophages M2 2,055,090,022 523,561,551 173,449 Naïve B 2,127,262,739 629,928,642 171,439 Total B 1,874,130,921 702,533,922 183,1 19 Fetal Thymus 2,728,388,103 776,491,344 145,577 Naïve CD4+ 2,797,861,61 1 844,697,853 192,048 Total CD4+ 2,227,386,686 836,974,777 166,668 Unstimulated Total CD4+ 2,034,344,692 721,030,702 177,371 Stimulated Total CD4+ 1,971,143,855 749,720,649 188,714 Naïve CD8+ 1,910,881,702 747,834,572 187,399 Total CD8+ 1,849,225,803 628,771,947 183,964 Endothelial Precursors 2,308,749,174 420,536,621 141,382 37,297,499,080 11,299,489,740 2,816,292
Paula Freire-Pritchett Steven Wingett * HICUP *CHiCAGO
C lustering
- f BR
according the score of 10K random interactions (cis1Mb) Neutrophils Total B Naïve B Total C D 8+ Neutrophils Neutrophils Monocytes Monocytes Erythroblasts Erythroblasts Erythroblasts Meg akaryocytes Meg akaryocytes Meg akaryocytes Meg akaryocytes Endothelial Precursors Endothelial Precursors Endothelial Precursors Macrophag es M1 Macrophag es M1 Macrophag es M0 Macrophag es M1 Macrophag es M2 Macrophag es M2 Fetal Thym us Naïve C D 4+ Naïve C D 4+ Total C D 8+ Total C D 8+ S tim ulated Total C D 4+ S tim ulated Total C D 4+ S tim ulated Total C D 4+ Naïve C D 8+ Unstim ulatedTotal C D 4+ Unstim ulatedTotal C D 4+ Unstim ulatedTotal C D 4+ Total C D 4+ Naïve C D 8+ Naïve C D 8+ Naïve B Naïve B Total B Total B Fetal Thym us Fetal Thym us Naïve C D 4+ Total C D 4+ Naïve C D 4+ Total C D 4+ Monocytes Macrophag es M2 Macrophag es M0 Macrophag es M0
Distance
400 500 600 700
Sven Sewitz
−1e+06 −5e+05 0e+00 5e+05 1e+06
50 100 150
N 100 120
−1e+06 −5e+05 0e+00 5e+05 1e+06
20 40 60 80 N
B1_Monocyte_D1_step2_chicago2 Megakaryocyte_D5_6_step2_chicago2 789407 − AP1S2 789407 − AP1S2
410124 − CD93− 002,...
−1e+06 −5e+05 0e+00 5e+05 1e+06 50 100 150 200 250 300 N
B1_Monocyte_D1_step2_chicago2
−1e+06 −5e+05 0e+00 5e+05 1e+06 50 100 150 200 N
410124 − CD93− 002,... Megakaryocyte_D5_6_step2_chicago2