A linear time algorithm for constrained optimal segmentation Toby - - PowerPoint PPT Presentation
A linear time algorithm for constrained optimal segmentation Toby - - PowerPoint PPT Presentation
A linear time algorithm for constrained optimal segmentation Toby Dylan Hocking toby.hocking@mail.mcgill.ca joint work with Guillem Rigaill, Paul Fearnhead, Guillaume Bourque 11 Nov 2016 Problem: optimizing ChIP-seq peak detection New linear
Problem: optimizing ChIP-seq peak detection New linear time algorithm using functional pruning Results on benchmark data sets Conclusions and future work
Chromatin immunoprecipitation sequencing (ChIP-seq)
Analysis of DNA-protein interactions. Source: “ChIP-sequencing,” Wikipedia.
Problem: find peaks in each of several samples
Scale chr11: 10 kb hg19 118,095,000 118,100,000 118,105,000 118,110,000 118,115,000 118,120,000 118,125,000 UCSC Genes (RefSeq, GenBank, CCDS, Rfam, tRNAs & Comparative Genomics) Alignability of 100mers by GEM from ENCODE/CRG(Guigo) McGill0002.MS000201: monocyte, H3K4me3, signal McGill0004.MS000401: CD4-positive helper T cell, H3K4me3, signal McGill0091.MS009101: B cell, H3K4me3, signal McGill0103.MS010302: B cell, H3K4me3, signal AMICA1 AK289390 MPZL3 MPZL2 CRG Align 100 1 _ 0 _ 000201mono.k4me3 36.366 _ 0.1254 _ 000401htc.k4me3 5.1414 _ 0.1254 _ 009101bCell.k4me3 13.0345 _ 0.0995 _ 010302bCell.k4me3 7.8597 _ 0.1107 _
Grey profiles are normalized aligned read count signals. Black bars are “peaks” called by MACS2 (Zhang et al, 2008):
◮ many false positives. ◮ overlapping peaks have different start/end positions.
Previous work in genomic peak detection
◮ Model-based analysis of ChIP-Seq (MACS), Zhang et al, 2008. ◮ SICER, Zang et al, 2009. ◮ HOMER, Heinz et al, 2010. ◮ CCAT, Xu et al, 2010. ◮ RSEG, Song et al, 2011. ◮ Triform, Kornacker et al, 2012. ◮ Histone modifications in cancer (HMCan), Ashoor et al, 2013. ◮ PeakSeg, Hocking, Rigaill, Bourque, ICML 2015. ◮ PeakSegJoint Hocking and Bourque, arXiv:1506.01286. ◮ ... dozens of others.
Two big questions: how to choose the best...
◮ ...algorithm? (testing) ◮ ...parameters? (training)
How to choose parameters of unsupervised peak detectors?
19 parameters for Model-based analysis of ChIP-Seq (MACS), Zhang et al, 2008. [-g GSIZE] [-s TSIZE] [--bw BW] [-m MFOLD MFOLD] [--fix-bimodal] [--nomodel] [--extsize EXTSIZE | --shiftsize SHIFTSIZE] [-q QVALUE | -p PVALUE | -F FOLDENRICHMENT] [--to-large] [--down-sample] [--seed SEED] [--nolambda] [--slocal SMALLLOCAL] [--llocal LARGELOCAL] [--shift-control] [--half-ext] [--broad] [--broad-cutoff BROADCUTOFF] [--call-summits] 10 parameters for Histone modifications in cancer (HMCan), Ashoor et al, 2013. minLength 145 medLength 150 maxLength 155 smallBinLength 50 largeBinLength 100000 pvalueThreshold 0.01 mergeDistance 200 iterationThreshold 5 finalThreshold 0 maxIter 20
Which macs parameter is best for these data?
Compute likelihood/loss of piecewise constant model
Idea: choose the parameter with a lower loss
PeakSeg: search for the peaks with lowest loss
Choose the number of peaks via standard penalties (AIC, BIC, ...)
- r learned penalties based on visual labels (more on this later).
Maximum likelihood Poisson segmentation models
◮ Previous work: unconstrained maximum likelihood mean
for s segments (s − 1 changes).
◮ Hocking et al, ICML 2015: PeakSeg constraint enforces up,
down, up, down changes (and not up, up, down).
◮ Odd-numbered segments are background noise,
even-numbered segments are peaks.
◮ Constrained Dynamic Programming Algorithm, O(N2) time
for N data points.
But quadratic time is not fast enough for genomic data!
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ● ●
- ●
- ●
- ● ●
- ● ●
- ●
- ●
- ●
- ●
- ●
- ●●
- ●
- ● ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ● ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●●
- ●
- ●
- ●
- ●
- ●●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ● ●
- ●
- ●
- ●
- ●
- ●
- ●
- ● ●
- ● ●
- ●
- 1
2 3 0e+00 1e+05 2e+05
data points to segment N hours of computation time
◮ Genomic data is large, N ≥ 106. ◮ Split into subsets? What if we split a peak in half? ◮ Need linear time algorithm for analyzing whole data set.
Problem: optimizing ChIP-seq peak detection New linear time algorithm using functional pruning Results on benchmark data sets Conclusions and future work
Statistical model is Poisson with change constraints
◮ We have N count data z1, . . . , zN ∈ Z+. ◮ Fix the number of segments S ∈ {1, 2, . . . , N}. ◮ PeakSeg Model: zt ∼ Poisson(mt) such that mt has S − 1
up-down changes.
◮ Want to find means mt which maximize the Poisson
likelihood: P(Z = zt|mt) = mzt
t e−mt/(zt!). ◮ Equivalent to finding means mt which minimize the Poisson
loss: ℓ(mt, zt) = mt − zt log mt.
◮ Naive computation is O(NS), since there are O(NS−1)
possible positions for S − 1 change-points, and it takes O(N)
- perations to compute the mean and loss for each.
◮ Comparison to Hidden Markov Model:
Likelihood Same emission terms, no transition terms. Constraint Number of changes rather than values.
Relation to previous work
no pruning functional pruning unconstrained Dynamic Programming Pruned DP exact O(N2) exact O(N log N) R pkgs: changepoint cghseg, Segmentor up-down constrained constrained DP this work inexact O(N2) exact O(N log N) R pkgs: PeakSegDP coseg
◮ Auger and Lawrence 1989, Jackson et al 2005. ◮ Rigaill 2010, Johnson 2013, Cleynen et al 2014. ◮ Hocking, Rigaill, Bourque 2015. ◮ Contribution: new algorithm that exactly computes the
constrained optimal segmentation for N data points in linear O(N log N) time.
Relation to previous work
no pruning functional pruning unconstrained Dynamic Programming Pruned DP exact O(N2) exact O(N log N) R pkgs: changepoint cghseg, Segmentor up-down constrained constrained DP this work inexact O(N2) exact O(N log N) R pkgs: PeakSegDP coseg
◮ Auger and Lawrence 1989, Jackson et al 2005. ◮ Rigaill 2010, Johnson 2013, Cleynen et al 2014. ◮ Hocking, Rigaill, Bourque 2015. ◮ Contribution: new algorithm that exactly computes the
constrained optimal segmentation for N data points in linear O(N log N) time.
Relation to previous work
no pruning functional pruning unconstrained Dynamic Programming Pruned DP exact O(N2) exact O(N log N) R pkgs: changepoint cghseg, Segmentor up-down constrained constrained DP this work inexact O(N2) exact O(N log N) R pkgs: PeakSegDP coseg
◮ Auger and Lawrence 1989, Jackson et al 2005. ◮ Rigaill 2010, Johnson 2013, Cleynen et al 2014. ◮ Hocking, Rigaill, Bourque 2015. ◮ Contribution: new algorithm that exactly computes the
constrained optimal segmentation for N data points in linear O(N log N) time.
Relation to previous work
no pruning functional pruning unconstrained Dynamic Programming Pruned DP exact O(N2) exact O(N log N) R pkgs: changepoint cghseg, Segmentor up-down constrained constrained DP this work inexact O(N2) exact O(N log N) R pkgs: PeakSegDP coseg
◮ Auger and Lawrence 1989, Jackson et al 2005. ◮ Rigaill 2010, Johnson 2013, Cleynen et al 2014. ◮ Hocking, Rigaill, Bourque 2015. ◮ Contribution: new algorithm that exactly computes the
constrained optimal segmentation for N data points in linear O(N log N) time.
Dynamic programming and functional pruning
Classical dynamic programming (Auger and Lawrence 1989) computes the matrix of optimal loss values in S segments up to N data points, O(SN2) L1,1 · · · L1,N . . . . . . LS,1 · · · LS,N Dynamic programming with functional pruning (Rigaill 2010) computes a matrix of loss functions, the optimal loss up to N data points if segment S has mean µS, O(SN log N) L1,1(µ1) · · · L1,N(µ1) . . . . . . LS,1(µS) · · · LS,N(µS), Contribution of this work: a new algorithm that applies the functional pruning technique to the up-down constrained model.
First segment, first data point
◮ For data z1, . . . , zN ∈ Z+ let
γt(µ) = ℓ(µ, zt) = µ − zt log µ be the Poisson loss for each t ∈ {1, . . . , N}.
◮ For example z = 2, 1, 9, 5, 10, 3. ◮ Then γ1(µ) = L1,1(µ) = 1µ − 2 log µ + 0. ◮ Need to store 3 coefficients (linear, log, constant).
1 2 3 4 5 1 10 5 10
mean cost
1 segment, 1 data point
First segment, other data points
◮ The loss of the first segment up to data point t is
L1,t(µ) =
t
- i=1
γi(µ).
◮ For example z = 2, 1, 9, 5, 10, 3. ◮ L1,2(µ) = 2µ − 3 log µ + 0. ◮ L1,3(µ) = 3µ − 12 log µ + 0. ◮ ...
2 4 6 1 10 5 10
mean cost
1 segment, 2 data points
−1 1 1 10 5 10
mean cost
1 segment, 3 data points
Second segment, up to data point 2
◮ The mean cost in 2 segments up to data point 2 is
L2,2(µ2) = γ2(µ2) + min
µ1≤µ2 L1,1(µ1)
= γ2(µ2) + L≤
1,1(µ2) ◮ Min-less operator is L≤(µ) = minx≤µ L(x),
1 2 3 4 5 1 10 5 10
mean cost
2 segments, 2 data points
Comparison with unconstrained Pruned DPA
◮ For our constrained algorithm, the first segment mean must be
less than the second, and the first segment cost is a function: L2,2(µ2) = γ2(µ2) + min
µ1≤µ2 L1,1(µ1)
- L≤
1,1(µ2)
.
◮ For the unconstrained algorithm, it is constant:
- L2,2(µ2) = γ2(µ2) + min
µ1 L1,1(µ1)
- L1,1
.
◮ For example z = 2, 1, 9, 5, 10, 3.
1 2 3 4 5 1 10 5 10
mean cost fun.type
constrained min prev cost unconstrained min
2 segments, 2 data points
Storage as a piecewise function on intervals
◮ For example z = 2, 1, 9, 5, 10, 3.
1 2 3 4 5 1 10 5 10
mean cost
2 segments, 2 data points
◮ Storage:
L2,2(µ) = γ2(µ)+
- L1,1(µ) = 1µ − 2 log µ + 0
if µ ∈ [1, 2], L1,1 = 0µ − 0 log µ + 0.6137 if µ ∈ [2, 10].
Second segment, up to data point 3
◮ For data point 3 we need to consider two change-points:
L2,3(µ) = γ3(µ)+min
- L≤
1,2(µ),
change up after data point 2, L2,2(µ), change up after data point 1.
◮ For z = 2, 1, 9, 5, 10, 3 the min operation prunes a change
after data point 1.
1 2 3 4 1 10 5 10
mean cost prev seg end
1 2
2 segments, 3 data points
Second segment, up to data point t
◮ The updates continue for every data point t ∈ {3, ..., N}
L2,t(µ) = γt(µ) + min
- L≤
1,t−1(µ),
change up after t − 1, L2,t−1(µ), change up before t − 1.
◮ For example for z = 2, 1, 9, 5, 10, 3, at data point t = 4 we
- nly need to consider changes after 2 and 3 (1 has been
pruned).
−3 −2 −1 1 1 10 5 10
mean cost prev seg end
2 3
2 segments, 4 data points
General functional pruning equation
◮ The constrained cost of a mean µ for the segment s, up to
data point t: Ls,t(µ) = γt(µ) + min
- Ls,t−1(µ),
L∗
s−1,t−1(µ), ◮ Time complexity of min and min-less/more * is linear in the
number of intervals, empirically sub-linear O(log N).
max median and inter−quartile range
- 50
100 150 200 250 2 3 4 5
log10(data points to segment) intervals stored
◮ Total time complexity: O(SN log N).
Problem: optimizing ChIP-seq peak detection New linear time algorithm using functional pruning Results on benchmark data sets Conclusions and future work
Previous work in computer vision: look and add labels to...
Photos Cell images Copy number profiles Labels: names phenotypes alterations CVPR 2013 CellProfiler SegAnnDB 246 papers 873 citations Hocking et al, 2014. Sources: http://en.wikipedia.org/wiki/Face_detection Jones et al PNAS 2009. Scoring diverse cellular morphologies in image-based screens with iterative feedback and machine learning.
Benchmark data sets, algorithms
http://cbio.ensmp.fr/~thocking/chip-seq-chunk-db/
◮ Hocking et al Bioinformatics (2016). Optimizing ChIP-seq
peak detectors using visual labels and supervised machine learning.
◮ 37 labeled H3K4me3 samples (sharp peak pattern). ◮ 29 labeled H3K36me3 samples (broad peak pattern). ◮ 12,826 labeled regions with and without peaks. ◮ 2,752 separate segmentation problems.
Algorithms for segmenting N data points: package constraint exact? complexity coseg (this work) µ1 ≤ µ2 ≥ µ3 . . . yes O(N log N) PeakSegDP µ1 < µ2 > µ3 . . . no O(N2) Segmentor none yes O(N log N) Segmentor loss ≤ coseg loss ≤ PeakSegDP loss.
Linear time algorithms faster for larger data sets
1 second 1 minute 1 hour
- coseg
O(N log N) Segmentor O(N log N) PeakSegDP O(N^2)
2 4 2 3 4 5 6
log10(data points to segment) log10(seconds)
Timings on 2752 histone mark ChIP−seq data sets
Total time to compute 10 models (0, ..., 9 peaks) for all data sets:
◮ PeakSegDP: 156 hours, inexact. ◮ coseg: 6 hours, exact.
8 false negative labels for models with 0 peaks
Models with 1 peak are better (6 FN)
Models with 2 peaks are better still (4 FN)
Models with 3 peaks are the same (4 FN)
Models with 4 peaks are better (2 FN)
Models with 5 peaks have no incorrect labels
Models with 6 peaks are worse (1 false positive)
Constrained optimization better than macs
0 errors for coseg/PeakSegDP, 6 errors for Segmentor
Minimum train error in all data sets
errors fp fn models problems PeakSegDP 677 116 561 27469 2738 coseg 789 94 695 21278 1498 macs 1293 519 774 Segmentor 1544 46 1498 8106 4 hmcan.broad 2778 367 2411 possible 12826 11037 7225 27520 2752
◮ Segmentor, PeakSegDP, coseg were used to compute up to 10
models for each problem (1,...,19 segments = 0,...,9 peaks).
◮ errors = fp + fn = total number of incorrect labels, after
picking best parameter for each of the 2752 separate problems.
◮ models = number that obey the up-down constraint. ◮ New coseg algorithm has minimum train error almost as good
as slower PeakSegDP algorithm.
◮ Other algorithms much less accurate.
Max-margin penalty learning algortihm
- 2 peaks
7 peaks 2 peaks 1 peak 2 peaks 1 peak 2 peaks McGill0002 McGill0091 McGill0322 McGill0004 0 errors large margin 0 errors small margin 1 error constant
8 9 10 11 12 13 3.5 4.0 4.5 5.0 5.5 6.0
input feature = log(max(coverage)) log(penalty)
http://bl.ocks.org/tdhock/raw/9311ca39d643d127e04a088814c81ee1/
ROC curves for one test fold
http://cbio.ensmp.fr/~thocking/chip-seq-chunk-db/figure-roc-test/
Test AUC on 7 benchmark data sets
H3K36me3 AM immune H3K36me3 TDH immune H3K36me3 TDH
- ther
H3K4me3 PGP immune H3K4me3 TDH immune H3K4me3 TDH
- ther
H3K4me3 XJ immune
- MACS
HMCanBroad Segmentor PeakSegDP coseg 0.6 0.8 1 0.6 0.8 1 0.6 0.8 1 0.6 0.8 1 0.6 0.8 1 0.6 0.8 1 0.6 0.8 1
Test AUC (area under the Receiver Operating Characteristic curve) algorithm
◮ 4-fold cross-validation: train on 3/4 of labels, test on 1/4. ◮ HMCanBroad is accurate for broad H3K36me3 data but not
sharp H3K4me3 data.
◮ MACS is accurate for sharp H3K4me3 but not broad
H3K36me3 data.
◮ Unconstrained Segmentor algorithm not as accurate as
up-down constrained algorithms (coseg, PeakSegDP).
◮ Proposed algorithm in coseg R package yields state-of-the-art
accuracy in all benchmark data sets.
http://bl.ocks.org/tdhock/raw/886575874144c3b172ce6b7d7d770b9f/
Problem: optimizing ChIP-seq peak detection New linear time algorithm using functional pruning Results on benchmark data sets Conclusions and future work
How to choose parameters of unsupervised peak detectors?
19 parameters for Model-based analysis of ChIP-Seq (MACS), Zhang et al, 2008. [-g GSIZE] [-s TSIZE] [--bw BW] [-m MFOLD MFOLD] [--fix-bimodal] [--nomodel] [--extsize EXTSIZE | --shiftsize SHIFTSIZE] [-q QVALUE | -p PVALUE | -F FOLDENRICHMENT] [--to-large] [--down-sample] [--seed SEED] [--nolambda] [--slocal SMALLLOCAL] [--llocal LARGELOCAL] [--shift-control] [--half-ext] [--broad] [--broad-cutoff BROADCUTOFF] [--call-summits] 10 parameters for Histone modifications in cancer (HMCan), Ashoor et al, 2013. minLength 145 medLength 150 maxLength 155 smallBinLength 50 largeBinLength 100000 pvalueThreshold 0.01 mergeDistance 200 iterationThreshold 5 finalThreshold 0 maxIter 20
PeakSeg: search for the peaks with lowest loss
Simple model with only one parameter to train (maxPeaks).
Linear time algorithms faster for larger data sets
1 second 1 minute 1 hour
- coseg
O(N log N) Segmentor O(N log N) PeakSegDP O(N^2)
2 4 2 3 4 5 6
log10(data points to segment) log10(seconds)
Timings on 2752 histone mark ChIP−seq data sets
Total time to compute 10 models (0, ..., 9 peaks) for all data sets:
◮ PeakSegDP: 156 hours, inexact. ◮ coseg: 6 hours, exact.
Conclusions
no pruning functional pruning unconstrained Dynamic Programming Pruned DP exact O(N2) exact O(N log N) R pkgs: changepoint cghseg, Segmentor up-down constrained constrained DP this work inexact O(N2) exact O(N log N) R pkgs: PeakSegDP coseg
◮ New algorithm that exactly computes the constrained
- ptimal change-points/peaks for N data points.
◮ C++ code in coseg R package, O(N log N) memory
https://github.com/tdhock/coseg
◮ PeakSegFPOP program for big N > 106 data, O(log N)
memory and O(N log N) disk space.
◮ TODO: regularized isotonic regression solver. ◮ TODO: supervised peak calling for ENCODE, Roadmap, ... ◮ TODO: interactive web app for creating labels.
Thanks for your attention!
Questions? toby.hocking@mail.mcgill.ca
◮ coseg R package,
https://github.com/tdhock/coseg
◮ PeakSegFPOP command line program,
https://github.com/tdhock/PeakSegFPOP
◮ source code for these slides:
https://github.com/tdhock/PeakSegFPOP-paper
Segmenting whole chromosomes?
◮ 365 regions with no gaps in hg19. ◮ 272 regions with no gaps on chr1-22, X, Y. ◮ Smallest: 31,833 bases (chr6:157,609,467-157,641,300). ◮ Largest: 115,591,997 bases (chr4:75,452,279-191,044,276).
Reasonable time to segment biggest region on chr1
- 0.1
1.0 10.0 30.0 100,000 1,000,000 4,485,563
N = data points to segment minutes of computation algorithm
- in.memory
- n.disk
◮ R package in memory: O(N log N) time. ◮ Command line program on disk: O(N log N) time.
Memory requirements reasonable for on-disk version
- bedGraph lines in biggest
chr1 region with no gap bases in biggest region with no gap
- 20.000
1193.000 1.000 0.100 0.010 0.001 100,000 1,000,000 4,485,563 115,591,997
N = data points to segment memory usage (gigabytes) algorithm
- in.memory
- n.disk
◮ R package in memory: O(N log N) memory. ◮ Command line program on disk: O(log N) memory.
Disk usage reasonable
- 0.1
1.0 10.0 32.2 100,000 1,000,000 4,485,563
N = data points to segment disk usage (gigabytes) algorithm
- in.memory
- n.disk
◮ R package in memory: no disk usage. ◮ Command line program: O(N log N) disk space (temporary).
Two annotators provide consistent labels, but different precision
◮ TDH peakStart/peakEnd more precise than AM peaks. ◮ AM noPeaks more precise than TDH no label.
Train on one person, test on another (same histone mark and samples)
H3K36me3 AM immune H3K36me3 TDH immune H3K36me3 TDH
- ther
H3K4me3 PGP immune H3K4me3 TDH immune H3K4me3 TDH
- ther
H3K4me3 XJ immune max NTNU k562 nrsf NTNU k562 srf NTNU Gm12878
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ● ● ●
- ●
- ●
- ● ●
- ●
- ●
- ●
- ●
- ●
- ●
- ● ● ●
- ●
- ●
- ● ●
- ●
- ●
- ●●
- ●
- ●
- ● ●
- ●
- ● ●
- ● ●
- ●
- ●●
- ●
- ●
- ● ●
- ●
- ● ●
- ● ●
- ●
- chr1
chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY train H3K4me3_PGP_immune train H3K4me3_TDH_immune 100 200 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200
position on chromosome (Mb = mega bases) chromosome
set
- test
train unused
Train on one person, test on another (same histone mark and samples)
test data set: H3K4me3_PGP_immune test data set: H3K4me3_TDH_immune
More errors in model trained on different person Less errors in model trained on different person
- More errors
in model trained on different person Less errors in model trained on different person
- 10
20 30 40 50 60 20 40 60 20 40 60
percent incorrect test labels, model trained on labels from same person percent incorrect test labels, model trained on labels from different person
Model
- ccat.histone
ccat.tf sicer homer rseg hmcan hmcan.broad triform macs macs.broad PeakSeg
Models trained on same or different person
Train on some samples, test on others (same histone mark and person)
H3K36me3 AM immune H3K36me3 TDH immune H3K36me3 TDH
- ther
H3K4me3 PGP immune H3K4me3 TDH immune H3K4me3 TDH
- ther
H3K4me3 XJ immune max NTNU k562 nrsf NTNU k562 srf NTNU Gm12878
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ● ● ●
- ●
- ●
- ● ●
- ●
- ●
- ●
- ●
- ●
- ●
- ● ● ●
- ●
- ●
- ● ●
- ●
- ●
- ●●
- ●
- ●
- ● ●
- ●
- ● ●
- ● ●
- ●
- ●●
- ●
- ●
- ● ●
- ●
- ● ●
- ● ●
- ●
- chr1
chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY train H3K4me3_TDH_immune train H3K4me3_TDH_other 100 200 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200
position on chromosome (Mb = mega bases) chromosome
set
- test
train unused
Train on some samples, test on others (same histone mark and person)
test data set: H3K4me3_TDH_immune test data set: H3K4me3_TDH_other
More errors in model trained on different cell types Less errors in model trained on different cell types
- More errors
in model trained on different cell types Less errors in model trained on different cell types
- 20
40 60 20 40 60 20 40 60
percent incorrect test labels, model trained on labels from same cell types percent incorrect test labels, model trained on labels from different cell types
Model
- ccat.histone
ccat.tf sicer homer rseg hmcan hmcan.broad triform macs macs.broad PeakSeg
Models trained on same or different cell types
Train on one histone mark, test on another (same person and samples)
H3K36me3 AM immune H3K36me3 TDH immune H3K36me3 TDH
- ther
H3K4me3 PGP immune H3K4me3 TDH immune H3K4me3 TDH
- ther
H3K4me3 XJ immune max NTNU k562 nrsf NTNU k562 srf NTNU Gm12878
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ● ● ●
- ●
- ●
- ● ●
- ●
- ●
- ●
- ●
- ●
- ●
- ● ● ●
- ●
- ●
- ● ●
- ●
- ●
- ●●
- ●
- ●
- ● ●
- ●
- ● ●
- ● ●
- ●
- ●●
- ●
- ●
- ● ●
- ●
- ● ●
- ● ●
- ●
- chr1
chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY train H3K36me3_TDH_immune train H3K4me3_TDH_immune 100 200 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200
position on chromosome (Mb = mega bases) chromosome
set
- test
train unused
Train on one histone mark, test on another (same person and samples)
test data set: H3K36me3_TDH_immune test data set: H3K4me3_TDH_immune
More errors in model trained on different experiments Less errors in model trained on different experiments
- More errors
in model trained on different experiments Less errors in model trained on different experiments
- 20
40 60 80 20 40 60 80 0 20 40 60 80
percent incorrect test labels, model trained on labels from same experiments percent incorrect test labels, model trained on labels from different experiments
Model
- ccat.histone
ccat.tf sicer homer rseg hmcan hmcan.broad triform macs macs.broad PeakSeg
Models trained on same or different experiments
PeakSeg with 2 peaks more likely than default macs
Unconstrained model can have any changes
◮ The model above does NOT verify the PeakSeg up-down
constraint (second change is up, should be down).
◮ Not directly interpretable as background and peaks.
PeakSeg constraint forces up then down changes
◮ Novelty of PeakSeg model with respect to previous optimal
segmentation models: the up-down constraint.
◮ Interpretable as background (S1, S3, ...)
and peaks (S2, S4, ...).
What does our solution say about the PeakSeg solution?
◮ Our algo exactly solves a problem with non-strict inequality
constraints.
◮ For example, N = 3 data points and S = 3 segments,
min
m1≤m2≥m3 N
- t=1
mt − zt log mt.
◮ But the PeakSeg problem has strict inequality constraints:
min
m1<m2>m3 N
- t=1
mt − zt log mt. When our algo returns equal values for adjacent segment means,
◮ Our solution is not feasible for the PeakSeg problem, and ◮ The PeakSeg solution is undefined.
Example: 13, 14, 10, 1
What do you think is the best model with 1 peak? (3 segments, 1 change up, 1 change down) Guessing game: https://github.com/tdhock/PeakSegFPOP-paper/blob/ master/figure-min-undefined.R
PeakSegDP returns this highly suboptimal model
5.5 Poisson Loss = −51.043373
- 13
14 10 1 1 2 3 4
position on chromosome align read counts
A better model
11.333 13.333 Poisson Loss = −54.451874
- 13
14 10 1 1 2 3 4
position on chromosome align read counts
Even better
11.833 12.833 Poisson Loss = −54.735311
- 13
14 10 1 1 2 3 4
position on chromosome align read counts
Still better
12.083 12.583 Poisson Loss = −54.853063
- 13
14 10 1 1 2 3 4
position on chromosome align read counts
We can keep going forever
12.208 12.458 Poisson Loss = −54.906104
- 13
14 10 1 1 2 3 4
position on chromosome align read counts
Best model is not feasible
12.333333 12.333333 Poisson Loss = −54.955308
- 13
14 10 1 1 2 3 4
position on chromosome align read counts
Unconstrained maximum likelihood segmentation
◮ We have a sequence of n count data y ∈ Zn + to segment. ◮ Choose the number of segments S ∈ {1, 2, . . . , n}.
minimize
m∈Rn c∈{−1,0,1}n−1 n
- t=1
ℓ(mt, zt) subject to 1 +
n−1
- t=1
I(ct = 0) = S, ct = −1 ⇒ mt > mt+1 (change down) ct = 0 ⇒ mt = mt+1 (no change) ct = 1 ⇒ mt < mt+1 (change up)
◮ The Poisson loss function is ℓ(m, y) = m − y log m. ◮ Every t such that ct = 0 is a change-point.
The PeakSeg constrained maximum likelihood model
minimize
m∈Rn c∈{−1,0,1}n−1 n
- t=1
ℓ(mt, zt) subject to 1 +
n−1
- t=1
I(ct = 0) = S, ct = −1 ⇒ mt > mt+1 (change down) ct = 0 ⇒ mt = mt+1 (no change) ct = 1 ⇒ mt < mt+1 (change up) ∀t ∈ {1, . . . , n − 1}, Pt(c) ∈ {0, 1}. The only difference with the unconstrained problem is that we have added the constraint Pt(c) = t
i=1 ci ∈ {0, 1}.
The problem solved by the PeakSegPDPA
minimize
m∈Rn c∈{−1,0,1}n−1 n
- t=1
ℓ(mt, zt) subject to 1 +
n−1
- t=1
I(ct = 0) = S, ct = −1 ⇒ mt ≥ mt+1 (change down or no change) ct = 0 ⇒ mt = mt+1 (no change) ct = 1 ⇒ mt ≤ mt+1 (change up or no change) ∀t ∈ {1, . . . , n − 1}, Pt(c) ∈ {0, 1}.
◮ The only difference with the PeakSeg problem is that
we have changed the strict inequality constraints to non-strict inequality constraints.
◮ This model has at most S distinct segment means (some may
be equal due to the non-strict equality constraints).
Computation of optimal loss Ls,t for s = 1 segments up to data point t
L1,t = c(0,t]
- ptimal loss of 1st segment (0, t]
Computation of optimal loss Ls,t for s = 1 segments up to data point t
L1,t = c(0,t]
- ptimal loss of 1st segment (0, t]
Computation of optimal loss Ls,t for s = 1 segments up to data point t
L1,t = c(0,t]
- ptimal loss of 1st segment (0, t]
Computation of optimal loss Ls,t for s = 2 segments up to data point t < d
L2,t = min
t′<t
L1,t′
- ptimal loss in 1 segment up to t′
+ c(t′,t]
- ptimal loss of 2nd segment (t′, t]
Computation of optimal loss Ls,t for s = 2 segments up to data point t < d
L2,t = min
t′<t
L1,t′
- ptimal loss in 1 segment up to t′
+ c(t′,t]
- ptimal loss of 2nd segment (t′, t]
Computation of optimal loss Ls,t for s = 2 segments up to data point t < d
L2,t = min
t′<t
L1,t′
- ptimal loss in 1 segment up to t′
+ c(t′,t]
- ptimal loss of 2nd segment (t′, t]
Computation of optimal loss Ls,t for s = 2 segments up to data point t < d
L2,t = min
t′<t
L1,t′
- ptimal loss in 1 segment up to t′
+ c(t′,t]
- ptimal loss of 2nd segment (t′, t]
Computation of optimal loss Ls,t for s = 2 segments up to last data point t = d
L2,t = min
t′<t
L1,t′
- ptimal loss in 1 segment up to t′
+ c(t′,t]
- ptimal loss of 2nd segment (t′, t]
Computation of optimal loss Ls,t for s = 2 segments up to last data point t = d
L2,t = min
t′<t
L1,t′
- ptimal loss in 1 segment up to t′
+ c(t′,t]
- ptimal loss of 2nd segment (t′, t]
Computation of optimal loss Ls,t for s = 2 segments up to last data point t = d
L2,t = min
t′<t
L1,t′
- ptimal loss in 1 segment up to t′
+ c(t′,t]
- ptimal loss of 2nd segment (t′, t]
Computation of optimal loss Ls,t for s = 2 segments up to last data point t = d
L2,t = min
t′<t
L1,t′
- ptimal loss in 1 segment up to t′
+ c(t′,t]
- ptimal loss of 2nd segment (t′, t]
Computation of optimal loss Ls,t for s = 2 segments up to last data point t = d
L2,t = min
t′<t
L1,t′
- ptimal loss in 1 segment up to t′
+ c(t′,t]
- ptimal loss of 2nd segment (t′, t]
Computation of optimal loss Ls,t for s = 2 segments up to last data point t = d
L2,t = min
t′<t
L1,t′
- ptimal loss in 1 segment up to t′
+ c(t′,t]
- ptimal loss of 2nd segment (t′, t]
Dynamic programming is faster than grid search for s > 2 segments
Computation time in number of data points N: segments s grid search dynamic programming 1 O(N) O(N) 2 O(N2) O(N2) 3 O(N3) O(N2) 4 O(N4) O(N2) . . . . . . . . . For example N = 5735 data points to segment. N2 = 32890225 N3 = 188625440375 . . .
Computation of optimal loss Ls,t for s = 3 segments up to data point t
L3,t = min
t′<t
L2,t′
- ptimal loss in 2 segments up to t′
+ c(t′,t]
- ptimal loss of 3rd segment (t′, t]
Computation of optimal loss Ls,t for s = 3 segments up to data point t
L3,t = min
t′<t
L2,t′
- ptimal loss in 2 segments up to t′
+ c(t′,t]
- ptimal loss of 3rd segment (t′, t]
Computation of optimal loss Ls,t for s = 3 segments up to data point t
L3,t = min
t′<t
L2,t′
- ptimal loss in 2 segments up to t′
+ c(t′,t]
- ptimal loss of 3rd segment (t′, t]
Computation of optimal loss Ls,t for s = 3 segments up to data point t
L3,t = min
t′<t
L2,t′
- ptimal loss in 2 segments up to t′
+ c(t′,t]
- ptimal loss of 3rd segment (t′, t]
5 errors for coseg/Segmentor, only 1 error for PeakSegDP
macs PeakSegDP Segmentor 1692 342 87 31 8 10 10 2 156 183 39 6 4 50 46 46 9 1 1 10 5 7 1 2 1 2 1 2179 3 26 361 1 15 39 97 1 6 3 15 1 1 3 1 1859 216 68 10 24 2 3 265 81 29 8 4 1 116 27 6 1 2 14 8 2 3 1 1 1 2 4 2 4 6 2 4 6 2 4 6
incorrect labels in best competing model incorrect labels in best coseg model
1 2 3
log10.problems
5 errors for coseg/Segmentor, only 1 error for PeakSegDP
Constrained optimization worse than macs
macs PeakSegDP Segmentor 1692 342 87 31 8 10 10 2 156 183 39 6 4 50 46 46 9 1 1 10 5 7 1 2 1 2 1 2179 3 26 361 1 15 39 97 1 6 3 15 1 1 3 1 1859 216 68 10 24 2 3 265 81 29 8 4 1 116 27 6 1 2 14 8 2 3 1 1 1 2 4 2 4 6 2 4 6 2 4 6
incorrect labels in best competing model incorrect labels in best coseg model
1 2 3
log10.problems
Constrained optimization worse than macs
Constrained optimization worse than macs
Functional pruning complete example
pruning at data t = 34 cost up to data t = 35 pruning at data t = C ≥
2,34
C3,34 M3,34 = min{C3,34, C ≥
2,34}
C3,35 = M3,34+ γ35 M3,34 γ35 C3,35 C ≥
2,35
M3,35 = min{C3,35, C ≥
2,35}
0.0 0.1 0.2 0.3 0.4 0.00 0.25 0.50 0.75 0.00 0.25 0.50 0.75 0.00 0.25 0.50
mean µ of segment 3 cost value C(µ)
https://github.com/tdhock/PeakSegFPOP-paper
New coseg algorithm more accurate than unconstrained maximum likelihood Poisson model (Segmentor)
1859 216 68 10 24 2 3 265 81 29 8 4 1 116 27 6 1 2 14 8 2 3 1 1 1
2 4 2 4 6
incorrect labels in best Segmentor model incorrect labels in best coseg model
1 2 3
log10.problems
New coseg algorithm mostly agrees with slower inexact DP
2179 3 26 361 1 15 39 97 1 6 3 15 1 1 3 1
2 4 1 2 3 4
incorrect labels in best PeakSegDP model incorrect labels in best coseg model
1 2 3
log10.problems
New coseg algorithm sometimes disagrees with macs
1692 342 87 31 8 10 10 2 156 183 39 6 4 50 46 46 9 1 1 10 5 7 1 2 1 2 1
2 4 2 4 6
incorrect labels in best macs model incorrect labels in best coseg model
1 2 3
log10.problems
Constrained optimization better than macs
macs PeakSegDP Segmentor 1692 342 87 31 8 10 10 2 156 183 39 6 4 50 46 46 9 1 1 10 5 7 1 2 1 2 1 2179 3 26 361 1 15 39 97 1 6 3 15 1 1 3 1 1859 216 68 10 24 2 3 265 81 29 8 4 1 116 27 6 1 2 14 8 2 3 1 1 1 2 4 2 4 6 2 4 6 2 4 6
incorrect labels in best competing model incorrect labels in best coseg model
1 2 3
log10.problems
0 errors for coseg/PeakSegDP, 6 errors for Segmentor
macs PeakSegDP Segmentor 1692 342 87 31 8 10 10 2 156 183 39 6 4 50 46 46 9 1 1 10 5 7 1 2 1 2 1 2179 3 26 361 1 15 39 97 1 6 3 15 1 1 3 1 1859 216 68 10 24 2 3 265 81 29 8 4 1 116 27 6 1 2 14 8 2 3 1 1 1 2 4 2 4 6 2 4 6 2 4 6
incorrect labels in best competing model incorrect labels in best coseg model
1 2 3
log10.problems