A linear time algorithm for constrained optimal segmentation Toby - - PowerPoint PPT Presentation

a linear time algorithm for constrained optimal
SMART_READER_LITE
LIVE PREVIEW

A linear time algorithm for constrained optimal segmentation Toby - - PowerPoint PPT Presentation

A linear time algorithm for constrained optimal segmentation Toby Dylan Hocking toby.hocking@mail.mcgill.ca joint work with Guillem Rigaill, Paul Fearnhead, Guillaume Bourque 11 Nov 2016 Problem: optimizing ChIP-seq peak detection New linear


slide-1
SLIDE 1

A linear time algorithm for constrained optimal segmentation

Toby Dylan Hocking toby.hocking@mail.mcgill.ca joint work with Guillem Rigaill, Paul Fearnhead, Guillaume Bourque 11 Nov 2016

slide-2
SLIDE 2

Problem: optimizing ChIP-seq peak detection New linear time algorithm using functional pruning Results on benchmark data sets Conclusions and future work

slide-3
SLIDE 3

Chromatin immunoprecipitation sequencing (ChIP-seq)

Analysis of DNA-protein interactions. Source: “ChIP-sequencing,” Wikipedia.

slide-4
SLIDE 4

Problem: find peaks in each of several samples

Scale chr11: 10 kb hg19 118,095,000 118,100,000 118,105,000 118,110,000 118,115,000 118,120,000 118,125,000 UCSC Genes (RefSeq, GenBank, CCDS, Rfam, tRNAs & Comparative Genomics) Alignability of 100mers by GEM from ENCODE/CRG(Guigo) McGill0002.MS000201: monocyte, H3K4me3, signal McGill0004.MS000401: CD4-positive helper T cell, H3K4me3, signal McGill0091.MS009101: B cell, H3K4me3, signal McGill0103.MS010302: B cell, H3K4me3, signal AMICA1 AK289390 MPZL3 MPZL2 CRG Align 100 1 _ 0 _ 000201mono.k4me3 36.366 _ 0.1254 _ 000401htc.k4me3 5.1414 _ 0.1254 _ 009101bCell.k4me3 13.0345 _ 0.0995 _ 010302bCell.k4me3 7.8597 _ 0.1107 _

Grey profiles are normalized aligned read count signals. Black bars are “peaks” called by MACS2 (Zhang et al, 2008):

◮ many false positives. ◮ overlapping peaks have different start/end positions.

slide-5
SLIDE 5

Previous work in genomic peak detection

◮ Model-based analysis of ChIP-Seq (MACS), Zhang et al, 2008. ◮ SICER, Zang et al, 2009. ◮ HOMER, Heinz et al, 2010. ◮ CCAT, Xu et al, 2010. ◮ RSEG, Song et al, 2011. ◮ Triform, Kornacker et al, 2012. ◮ Histone modifications in cancer (HMCan), Ashoor et al, 2013. ◮ PeakSeg, Hocking, Rigaill, Bourque, ICML 2015. ◮ PeakSegJoint Hocking and Bourque, arXiv:1506.01286. ◮ ... dozens of others.

Two big questions: how to choose the best...

◮ ...algorithm? (testing) ◮ ...parameters? (training)

slide-6
SLIDE 6

How to choose parameters of unsupervised peak detectors?

19 parameters for Model-based analysis of ChIP-Seq (MACS), Zhang et al, 2008. [-g GSIZE] [-s TSIZE] [--bw BW] [-m MFOLD MFOLD] [--fix-bimodal] [--nomodel] [--extsize EXTSIZE | --shiftsize SHIFTSIZE] [-q QVALUE | -p PVALUE | -F FOLDENRICHMENT] [--to-large] [--down-sample] [--seed SEED] [--nolambda] [--slocal SMALLLOCAL] [--llocal LARGELOCAL] [--shift-control] [--half-ext] [--broad] [--broad-cutoff BROADCUTOFF] [--call-summits] 10 parameters for Histone modifications in cancer (HMCan), Ashoor et al, 2013. minLength 145 medLength 150 maxLength 155 smallBinLength 50 largeBinLength 100000 pvalueThreshold 0.01 mergeDistance 200 iterationThreshold 5 finalThreshold 0 maxIter 20

slide-7
SLIDE 7

Which macs parameter is best for these data?

slide-8
SLIDE 8

Compute likelihood/loss of piecewise constant model

slide-9
SLIDE 9

Idea: choose the parameter with a lower loss

slide-10
SLIDE 10

PeakSeg: search for the peaks with lowest loss

Choose the number of peaks via standard penalties (AIC, BIC, ...)

  • r learned penalties based on visual labels (more on this later).
slide-11
SLIDE 11

Maximum likelihood Poisson segmentation models

◮ Previous work: unconstrained maximum likelihood mean

for s segments (s − 1 changes).

◮ Hocking et al, ICML 2015: PeakSeg constraint enforces up,

down, up, down changes (and not up, up, down).

◮ Odd-numbered segments are background noise,

even-numbered segments are peaks.

◮ Constrained Dynamic Programming Algorithm, O(N2) time

for N data points.

slide-12
SLIDE 12

But quadratic time is not fast enough for genomic data!

  • ●●
  • ● ●
  • ● ●
  • ● ●
  • ●●
  • ● ●
  • ● ●
  • ●●
  • ●●
  • ● ●
  • ● ●
  • ● ●
  • 1

2 3 0e+00 1e+05 2e+05

data points to segment N hours of computation time

◮ Genomic data is large, N ≥ 106. ◮ Split into subsets? What if we split a peak in half? ◮ Need linear time algorithm for analyzing whole data set.

slide-13
SLIDE 13

Problem: optimizing ChIP-seq peak detection New linear time algorithm using functional pruning Results on benchmark data sets Conclusions and future work

slide-14
SLIDE 14

Statistical model is Poisson with change constraints

◮ We have N count data z1, . . . , zN ∈ Z+. ◮ Fix the number of segments S ∈ {1, 2, . . . , N}. ◮ PeakSeg Model: zt ∼ Poisson(mt) such that mt has S − 1

up-down changes.

◮ Want to find means mt which maximize the Poisson

likelihood: P(Z = zt|mt) = mzt

t e−mt/(zt!). ◮ Equivalent to finding means mt which minimize the Poisson

loss: ℓ(mt, zt) = mt − zt log mt.

◮ Naive computation is O(NS), since there are O(NS−1)

possible positions for S − 1 change-points, and it takes O(N)

  • perations to compute the mean and loss for each.

◮ Comparison to Hidden Markov Model:

Likelihood Same emission terms, no transition terms. Constraint Number of changes rather than values.

slide-15
SLIDE 15

Relation to previous work

no pruning functional pruning unconstrained Dynamic Programming Pruned DP exact O(N2) exact O(N log N) R pkgs: changepoint cghseg, Segmentor up-down constrained constrained DP this work inexact O(N2) exact O(N log N) R pkgs: PeakSegDP coseg

◮ Auger and Lawrence 1989, Jackson et al 2005. ◮ Rigaill 2010, Johnson 2013, Cleynen et al 2014. ◮ Hocking, Rigaill, Bourque 2015. ◮ Contribution: new algorithm that exactly computes the

constrained optimal segmentation for N data points in linear O(N log N) time.

slide-16
SLIDE 16

Relation to previous work

no pruning functional pruning unconstrained Dynamic Programming Pruned DP exact O(N2) exact O(N log N) R pkgs: changepoint cghseg, Segmentor up-down constrained constrained DP this work inexact O(N2) exact O(N log N) R pkgs: PeakSegDP coseg

◮ Auger and Lawrence 1989, Jackson et al 2005. ◮ Rigaill 2010, Johnson 2013, Cleynen et al 2014. ◮ Hocking, Rigaill, Bourque 2015. ◮ Contribution: new algorithm that exactly computes the

constrained optimal segmentation for N data points in linear O(N log N) time.

slide-17
SLIDE 17

Relation to previous work

no pruning functional pruning unconstrained Dynamic Programming Pruned DP exact O(N2) exact O(N log N) R pkgs: changepoint cghseg, Segmentor up-down constrained constrained DP this work inexact O(N2) exact O(N log N) R pkgs: PeakSegDP coseg

◮ Auger and Lawrence 1989, Jackson et al 2005. ◮ Rigaill 2010, Johnson 2013, Cleynen et al 2014. ◮ Hocking, Rigaill, Bourque 2015. ◮ Contribution: new algorithm that exactly computes the

constrained optimal segmentation for N data points in linear O(N log N) time.

slide-18
SLIDE 18

Relation to previous work

no pruning functional pruning unconstrained Dynamic Programming Pruned DP exact O(N2) exact O(N log N) R pkgs: changepoint cghseg, Segmentor up-down constrained constrained DP this work inexact O(N2) exact O(N log N) R pkgs: PeakSegDP coseg

◮ Auger and Lawrence 1989, Jackson et al 2005. ◮ Rigaill 2010, Johnson 2013, Cleynen et al 2014. ◮ Hocking, Rigaill, Bourque 2015. ◮ Contribution: new algorithm that exactly computes the

constrained optimal segmentation for N data points in linear O(N log N) time.

slide-19
SLIDE 19

Dynamic programming and functional pruning

Classical dynamic programming (Auger and Lawrence 1989) computes the matrix of optimal loss values in S segments up to N data points, O(SN2) L1,1 · · · L1,N . . . . . . LS,1 · · · LS,N Dynamic programming with functional pruning (Rigaill 2010) computes a matrix of loss functions, the optimal loss up to N data points if segment S has mean µS, O(SN log N) L1,1(µ1) · · · L1,N(µ1) . . . . . . LS,1(µS) · · · LS,N(µS), Contribution of this work: a new algorithm that applies the functional pruning technique to the up-down constrained model.

slide-20
SLIDE 20

First segment, first data point

◮ For data z1, . . . , zN ∈ Z+ let

γt(µ) = ℓ(µ, zt) = µ − zt log µ be the Poisson loss for each t ∈ {1, . . . , N}.

◮ For example z = 2, 1, 9, 5, 10, 3. ◮ Then γ1(µ) = L1,1(µ) = 1µ − 2 log µ + 0. ◮ Need to store 3 coefficients (linear, log, constant).

1 2 3 4 5 1 10 5 10

mean cost

1 segment, 1 data point

slide-21
SLIDE 21

First segment, other data points

◮ The loss of the first segment up to data point t is

L1,t(µ) =

t

  • i=1

γi(µ).

◮ For example z = 2, 1, 9, 5, 10, 3. ◮ L1,2(µ) = 2µ − 3 log µ + 0. ◮ L1,3(µ) = 3µ − 12 log µ + 0. ◮ ...

2 4 6 1 10 5 10

mean cost

1 segment, 2 data points

−1 1 1 10 5 10

mean cost

1 segment, 3 data points

slide-22
SLIDE 22

Second segment, up to data point 2

◮ The mean cost in 2 segments up to data point 2 is

L2,2(µ2) = γ2(µ2) + min

µ1≤µ2 L1,1(µ1)

= γ2(µ2) + L≤

1,1(µ2) ◮ Min-less operator is L≤(µ) = minx≤µ L(x),

1 2 3 4 5 1 10 5 10

mean cost

2 segments, 2 data points

slide-23
SLIDE 23

Comparison with unconstrained Pruned DPA

◮ For our constrained algorithm, the first segment mean must be

less than the second, and the first segment cost is a function: L2,2(µ2) = γ2(µ2) + min

µ1≤µ2 L1,1(µ1)

  • L≤

1,1(µ2)

.

◮ For the unconstrained algorithm, it is constant:

  • L2,2(µ2) = γ2(µ2) + min

µ1 L1,1(µ1)

  • L1,1

.

◮ For example z = 2, 1, 9, 5, 10, 3.

1 2 3 4 5 1 10 5 10

mean cost fun.type

constrained min prev cost unconstrained min

2 segments, 2 data points

slide-24
SLIDE 24

Storage as a piecewise function on intervals

◮ For example z = 2, 1, 9, 5, 10, 3.

1 2 3 4 5 1 10 5 10

mean cost

2 segments, 2 data points

◮ Storage:

L2,2(µ) = γ2(µ)+

  • L1,1(µ) = 1µ − 2 log µ + 0

if µ ∈ [1, 2], L1,1 = 0µ − 0 log µ + 0.6137 if µ ∈ [2, 10].

slide-25
SLIDE 25

Second segment, up to data point 3

◮ For data point 3 we need to consider two change-points:

L2,3(µ) = γ3(µ)+min

  • L≤

1,2(µ),

change up after data point 2, L2,2(µ), change up after data point 1.

◮ For z = 2, 1, 9, 5, 10, 3 the min operation prunes a change

after data point 1.

1 2 3 4 1 10 5 10

mean cost prev seg end

1 2

2 segments, 3 data points

slide-26
SLIDE 26

Second segment, up to data point t

◮ The updates continue for every data point t ∈ {3, ..., N}

L2,t(µ) = γt(µ) + min

  • L≤

1,t−1(µ),

change up after t − 1, L2,t−1(µ), change up before t − 1.

◮ For example for z = 2, 1, 9, 5, 10, 3, at data point t = 4 we

  • nly need to consider changes after 2 and 3 (1 has been

pruned).

−3 −2 −1 1 1 10 5 10

mean cost prev seg end

2 3

2 segments, 4 data points

slide-27
SLIDE 27

General functional pruning equation

◮ The constrained cost of a mean µ for the segment s, up to

data point t: Ls,t(µ) = γt(µ) + min

  • Ls,t−1(µ),

L∗

s−1,t−1(µ), ◮ Time complexity of min and min-less/more * is linear in the

number of intervals, empirically sub-linear O(log N).

max median and inter−quartile range

  • 50

100 150 200 250 2 3 4 5

log10(data points to segment) intervals stored

◮ Total time complexity: O(SN log N).

slide-28
SLIDE 28

Problem: optimizing ChIP-seq peak detection New linear time algorithm using functional pruning Results on benchmark data sets Conclusions and future work

slide-29
SLIDE 29

Previous work in computer vision: look and add labels to...

Photos Cell images Copy number profiles Labels: names phenotypes alterations CVPR 2013 CellProfiler SegAnnDB 246 papers 873 citations Hocking et al, 2014. Sources: http://en.wikipedia.org/wiki/Face_detection Jones et al PNAS 2009. Scoring diverse cellular morphologies in image-based screens with iterative feedback and machine learning.

slide-30
SLIDE 30

Benchmark data sets, algorithms

http://cbio.ensmp.fr/~thocking/chip-seq-chunk-db/

◮ Hocking et al Bioinformatics (2016). Optimizing ChIP-seq

peak detectors using visual labels and supervised machine learning.

◮ 37 labeled H3K4me3 samples (sharp peak pattern). ◮ 29 labeled H3K36me3 samples (broad peak pattern). ◮ 12,826 labeled regions with and without peaks. ◮ 2,752 separate segmentation problems.

Algorithms for segmenting N data points: package constraint exact? complexity coseg (this work) µ1 ≤ µ2 ≥ µ3 . . . yes O(N log N) PeakSegDP µ1 < µ2 > µ3 . . . no O(N2) Segmentor none yes O(N log N) Segmentor loss ≤ coseg loss ≤ PeakSegDP loss.

slide-31
SLIDE 31

Linear time algorithms faster for larger data sets

1 second 1 minute 1 hour

  • coseg

O(N log N) Segmentor O(N log N) PeakSegDP O(N^2)

2 4 2 3 4 5 6

log10(data points to segment) log10(seconds)

Timings on 2752 histone mark ChIP−seq data sets

Total time to compute 10 models (0, ..., 9 peaks) for all data sets:

◮ PeakSegDP: 156 hours, inexact. ◮ coseg: 6 hours, exact.

slide-32
SLIDE 32

8 false negative labels for models with 0 peaks

slide-33
SLIDE 33

Models with 1 peak are better (6 FN)

slide-34
SLIDE 34

Models with 2 peaks are better still (4 FN)

slide-35
SLIDE 35

Models with 3 peaks are the same (4 FN)

slide-36
SLIDE 36

Models with 4 peaks are better (2 FN)

slide-37
SLIDE 37

Models with 5 peaks have no incorrect labels

slide-38
SLIDE 38

Models with 6 peaks are worse (1 false positive)

slide-39
SLIDE 39

Constrained optimization better than macs

slide-40
SLIDE 40

0 errors for coseg/PeakSegDP, 6 errors for Segmentor

slide-41
SLIDE 41

Minimum train error in all data sets

errors fp fn models problems PeakSegDP 677 116 561 27469 2738 coseg 789 94 695 21278 1498 macs 1293 519 774 Segmentor 1544 46 1498 8106 4 hmcan.broad 2778 367 2411 possible 12826 11037 7225 27520 2752

◮ Segmentor, PeakSegDP, coseg were used to compute up to 10

models for each problem (1,...,19 segments = 0,...,9 peaks).

◮ errors = fp + fn = total number of incorrect labels, after

picking best parameter for each of the 2752 separate problems.

◮ models = number that obey the up-down constraint. ◮ New coseg algorithm has minimum train error almost as good

as slower PeakSegDP algorithm.

◮ Other algorithms much less accurate.

slide-42
SLIDE 42

Max-margin penalty learning algortihm

  • 2 peaks

7 peaks 2 peaks 1 peak 2 peaks 1 peak 2 peaks McGill0002 McGill0091 McGill0322 McGill0004 0 errors large margin 0 errors small margin 1 error constant

8 9 10 11 12 13 3.5 4.0 4.5 5.0 5.5 6.0

input feature = log(max(coverage)) log(penalty)

http://bl.ocks.org/tdhock/raw/9311ca39d643d127e04a088814c81ee1/

slide-43
SLIDE 43

ROC curves for one test fold

http://cbio.ensmp.fr/~thocking/chip-seq-chunk-db/figure-roc-test/

slide-44
SLIDE 44

Test AUC on 7 benchmark data sets

H3K36me3 AM immune H3K36me3 TDH immune H3K36me3 TDH

  • ther

H3K4me3 PGP immune H3K4me3 TDH immune H3K4me3 TDH

  • ther

H3K4me3 XJ immune

  • MACS

HMCanBroad Segmentor PeakSegDP coseg 0.6 0.8 1 0.6 0.8 1 0.6 0.8 1 0.6 0.8 1 0.6 0.8 1 0.6 0.8 1 0.6 0.8 1

Test AUC (area under the Receiver Operating Characteristic curve) algorithm

◮ 4-fold cross-validation: train on 3/4 of labels, test on 1/4. ◮ HMCanBroad is accurate for broad H3K36me3 data but not

sharp H3K4me3 data.

◮ MACS is accurate for sharp H3K4me3 but not broad

H3K36me3 data.

◮ Unconstrained Segmentor algorithm not as accurate as

up-down constrained algorithms (coseg, PeakSegDP).

◮ Proposed algorithm in coseg R package yields state-of-the-art

accuracy in all benchmark data sets.

http://bl.ocks.org/tdhock/raw/886575874144c3b172ce6b7d7d770b9f/

slide-45
SLIDE 45

Problem: optimizing ChIP-seq peak detection New linear time algorithm using functional pruning Results on benchmark data sets Conclusions and future work

slide-46
SLIDE 46

How to choose parameters of unsupervised peak detectors?

19 parameters for Model-based analysis of ChIP-Seq (MACS), Zhang et al, 2008. [-g GSIZE] [-s TSIZE] [--bw BW] [-m MFOLD MFOLD] [--fix-bimodal] [--nomodel] [--extsize EXTSIZE | --shiftsize SHIFTSIZE] [-q QVALUE | -p PVALUE | -F FOLDENRICHMENT] [--to-large] [--down-sample] [--seed SEED] [--nolambda] [--slocal SMALLLOCAL] [--llocal LARGELOCAL] [--shift-control] [--half-ext] [--broad] [--broad-cutoff BROADCUTOFF] [--call-summits] 10 parameters for Histone modifications in cancer (HMCan), Ashoor et al, 2013. minLength 145 medLength 150 maxLength 155 smallBinLength 50 largeBinLength 100000 pvalueThreshold 0.01 mergeDistance 200 iterationThreshold 5 finalThreshold 0 maxIter 20

slide-47
SLIDE 47

PeakSeg: search for the peaks with lowest loss

Simple model with only one parameter to train (maxPeaks).

slide-48
SLIDE 48

Linear time algorithms faster for larger data sets

1 second 1 minute 1 hour

  • coseg

O(N log N) Segmentor O(N log N) PeakSegDP O(N^2)

2 4 2 3 4 5 6

log10(data points to segment) log10(seconds)

Timings on 2752 histone mark ChIP−seq data sets

Total time to compute 10 models (0, ..., 9 peaks) for all data sets:

◮ PeakSegDP: 156 hours, inexact. ◮ coseg: 6 hours, exact.

slide-49
SLIDE 49

Conclusions

no pruning functional pruning unconstrained Dynamic Programming Pruned DP exact O(N2) exact O(N log N) R pkgs: changepoint cghseg, Segmentor up-down constrained constrained DP this work inexact O(N2) exact O(N log N) R pkgs: PeakSegDP coseg

◮ New algorithm that exactly computes the constrained

  • ptimal change-points/peaks for N data points.

◮ C++ code in coseg R package, O(N log N) memory

https://github.com/tdhock/coseg

◮ PeakSegFPOP program for big N > 106 data, O(log N)

memory and O(N log N) disk space.

◮ TODO: regularized isotonic regression solver. ◮ TODO: supervised peak calling for ENCODE, Roadmap, ... ◮ TODO: interactive web app for creating labels.

slide-50
SLIDE 50

Thanks for your attention!

Questions? toby.hocking@mail.mcgill.ca

◮ coseg R package,

https://github.com/tdhock/coseg

◮ PeakSegFPOP command line program,

https://github.com/tdhock/PeakSegFPOP

◮ source code for these slides:

https://github.com/tdhock/PeakSegFPOP-paper

slide-51
SLIDE 51

Segmenting whole chromosomes?

◮ 365 regions with no gaps in hg19. ◮ 272 regions with no gaps on chr1-22, X, Y. ◮ Smallest: 31,833 bases (chr6:157,609,467-157,641,300). ◮ Largest: 115,591,997 bases (chr4:75,452,279-191,044,276).

slide-52
SLIDE 52

Reasonable time to segment biggest region on chr1

  • 0.1

1.0 10.0 30.0 100,000 1,000,000 4,485,563

N = data points to segment minutes of computation algorithm

  • in.memory
  • n.disk

◮ R package in memory: O(N log N) time. ◮ Command line program on disk: O(N log N) time.

slide-53
SLIDE 53

Memory requirements reasonable for on-disk version

  • bedGraph lines in biggest

chr1 region with no gap bases in biggest region with no gap

  • 20.000

1193.000 1.000 0.100 0.010 0.001 100,000 1,000,000 4,485,563 115,591,997

N = data points to segment memory usage (gigabytes) algorithm

  • in.memory
  • n.disk

◮ R package in memory: O(N log N) memory. ◮ Command line program on disk: O(log N) memory.

slide-54
SLIDE 54

Disk usage reasonable

  • 0.1

1.0 10.0 32.2 100,000 1,000,000 4,485,563

N = data points to segment disk usage (gigabytes) algorithm

  • in.memory
  • n.disk

◮ R package in memory: no disk usage. ◮ Command line program: O(N log N) disk space (temporary).

slide-55
SLIDE 55

Two annotators provide consistent labels, but different precision

◮ TDH peakStart/peakEnd more precise than AM peaks. ◮ AM noPeaks more precise than TDH no label.

slide-56
SLIDE 56

Train on one person, test on another (same histone mark and samples)

H3K36me3 AM immune H3K36me3 TDH immune H3K36me3 TDH

  • ther

H3K4me3 PGP immune H3K4me3 TDH immune H3K4me3 TDH

  • ther

H3K4me3 XJ immune max NTNU k562 nrsf NTNU k562 srf NTNU Gm12878

  • ● ● ●
  • ● ●
  • ● ● ●
  • ● ●
  • ●●
  • ● ●
  • ● ●
  • ● ●
  • ●●
  • ● ●
  • ● ●
  • ● ●
  • chr1

chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY train H3K4me3_PGP_immune train H3K4me3_TDH_immune 100 200 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200

position on chromosome (Mb = mega bases) chromosome

set

  • test

train unused

slide-57
SLIDE 57

Train on one person, test on another (same histone mark and samples)

test data set: H3K4me3_PGP_immune test data set: H3K4me3_TDH_immune

More errors in model trained on different person Less errors in model trained on different person

  • More errors

in model trained on different person Less errors in model trained on different person

  • 10

20 30 40 50 60 20 40 60 20 40 60

percent incorrect test labels, model trained on labels from same person percent incorrect test labels, model trained on labels from different person

Model

  • ccat.histone

ccat.tf sicer homer rseg hmcan hmcan.broad triform macs macs.broad PeakSeg

Models trained on same or different person

slide-58
SLIDE 58

Train on some samples, test on others (same histone mark and person)

H3K36me3 AM immune H3K36me3 TDH immune H3K36me3 TDH

  • ther

H3K4me3 PGP immune H3K4me3 TDH immune H3K4me3 TDH

  • ther

H3K4me3 XJ immune max NTNU k562 nrsf NTNU k562 srf NTNU Gm12878

  • ● ● ●
  • ● ●
  • ● ● ●
  • ● ●
  • ●●
  • ● ●
  • ● ●
  • ● ●
  • ●●
  • ● ●
  • ● ●
  • ● ●
  • chr1

chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY train H3K4me3_TDH_immune train H3K4me3_TDH_other 100 200 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200

position on chromosome (Mb = mega bases) chromosome

set

  • test

train unused

slide-59
SLIDE 59

Train on some samples, test on others (same histone mark and person)

test data set: H3K4me3_TDH_immune test data set: H3K4me3_TDH_other

More errors in model trained on different cell types Less errors in model trained on different cell types

  • More errors

in model trained on different cell types Less errors in model trained on different cell types

  • 20

40 60 20 40 60 20 40 60

percent incorrect test labels, model trained on labels from same cell types percent incorrect test labels, model trained on labels from different cell types

Model

  • ccat.histone

ccat.tf sicer homer rseg hmcan hmcan.broad triform macs macs.broad PeakSeg

Models trained on same or different cell types

slide-60
SLIDE 60

Train on one histone mark, test on another (same person and samples)

H3K36me3 AM immune H3K36me3 TDH immune H3K36me3 TDH

  • ther

H3K4me3 PGP immune H3K4me3 TDH immune H3K4me3 TDH

  • ther

H3K4me3 XJ immune max NTNU k562 nrsf NTNU k562 srf NTNU Gm12878

  • ● ● ●
  • ● ●
  • ● ● ●
  • ● ●
  • ●●
  • ● ●
  • ● ●
  • ● ●
  • ●●
  • ● ●
  • ● ●
  • ● ●
  • chr1

chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY train H3K36me3_TDH_immune train H3K4me3_TDH_immune 100 200 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200

position on chromosome (Mb = mega bases) chromosome

set

  • test

train unused

slide-61
SLIDE 61

Train on one histone mark, test on another (same person and samples)

test data set: H3K36me3_TDH_immune test data set: H3K4me3_TDH_immune

More errors in model trained on different experiments Less errors in model trained on different experiments

  • More errors

in model trained on different experiments Less errors in model trained on different experiments

  • 20

40 60 80 20 40 60 80 0 20 40 60 80

percent incorrect test labels, model trained on labels from same experiments percent incorrect test labels, model trained on labels from different experiments

Model

  • ccat.histone

ccat.tf sicer homer rseg hmcan hmcan.broad triform macs macs.broad PeakSeg

Models trained on same or different experiments

slide-62
SLIDE 62

PeakSeg with 2 peaks more likely than default macs

slide-63
SLIDE 63

Unconstrained model can have any changes

◮ The model above does NOT verify the PeakSeg up-down

constraint (second change is up, should be down).

◮ Not directly interpretable as background and peaks.

slide-64
SLIDE 64

PeakSeg constraint forces up then down changes

◮ Novelty of PeakSeg model with respect to previous optimal

segmentation models: the up-down constraint.

◮ Interpretable as background (S1, S3, ...)

and peaks (S2, S4, ...).

slide-65
SLIDE 65

What does our solution say about the PeakSeg solution?

◮ Our algo exactly solves a problem with non-strict inequality

constraints.

◮ For example, N = 3 data points and S = 3 segments,

min

m1≤m2≥m3 N

  • t=1

mt − zt log mt.

◮ But the PeakSeg problem has strict inequality constraints:

min

m1<m2>m3 N

  • t=1

mt − zt log mt. When our algo returns equal values for adjacent segment means,

◮ Our solution is not feasible for the PeakSeg problem, and ◮ The PeakSeg solution is undefined.

slide-66
SLIDE 66

Example: 13, 14, 10, 1

What do you think is the best model with 1 peak? (3 segments, 1 change up, 1 change down) Guessing game: https://github.com/tdhock/PeakSegFPOP-paper/blob/ master/figure-min-undefined.R

slide-67
SLIDE 67

PeakSegDP returns this highly suboptimal model

5.5 Poisson Loss = −51.043373

  • 13

14 10 1 1 2 3 4

position on chromosome align read counts

slide-68
SLIDE 68

A better model

11.333 13.333 Poisson Loss = −54.451874

  • 13

14 10 1 1 2 3 4

position on chromosome align read counts

slide-69
SLIDE 69

Even better

11.833 12.833 Poisson Loss = −54.735311

  • 13

14 10 1 1 2 3 4

position on chromosome align read counts

slide-70
SLIDE 70

Still better

12.083 12.583 Poisson Loss = −54.853063

  • 13

14 10 1 1 2 3 4

position on chromosome align read counts

slide-71
SLIDE 71

We can keep going forever

12.208 12.458 Poisson Loss = −54.906104

  • 13

14 10 1 1 2 3 4

position on chromosome align read counts

slide-72
SLIDE 72

Best model is not feasible

12.333333 12.333333 Poisson Loss = −54.955308

  • 13

14 10 1 1 2 3 4

position on chromosome align read counts

slide-73
SLIDE 73

Unconstrained maximum likelihood segmentation

◮ We have a sequence of n count data y ∈ Zn + to segment. ◮ Choose the number of segments S ∈ {1, 2, . . . , n}.

minimize

m∈Rn c∈{−1,0,1}n−1 n

  • t=1

ℓ(mt, zt) subject to 1 +

n−1

  • t=1

I(ct = 0) = S, ct = −1 ⇒ mt > mt+1 (change down) ct = 0 ⇒ mt = mt+1 (no change) ct = 1 ⇒ mt < mt+1 (change up)

◮ The Poisson loss function is ℓ(m, y) = m − y log m. ◮ Every t such that ct = 0 is a change-point.

slide-74
SLIDE 74

The PeakSeg constrained maximum likelihood model

minimize

m∈Rn c∈{−1,0,1}n−1 n

  • t=1

ℓ(mt, zt) subject to 1 +

n−1

  • t=1

I(ct = 0) = S, ct = −1 ⇒ mt > mt+1 (change down) ct = 0 ⇒ mt = mt+1 (no change) ct = 1 ⇒ mt < mt+1 (change up) ∀t ∈ {1, . . . , n − 1}, Pt(c) ∈ {0, 1}. The only difference with the unconstrained problem is that we have added the constraint Pt(c) = t

i=1 ci ∈ {0, 1}.

slide-75
SLIDE 75

The problem solved by the PeakSegPDPA

minimize

m∈Rn c∈{−1,0,1}n−1 n

  • t=1

ℓ(mt, zt) subject to 1 +

n−1

  • t=1

I(ct = 0) = S, ct = −1 ⇒ mt ≥ mt+1 (change down or no change) ct = 0 ⇒ mt = mt+1 (no change) ct = 1 ⇒ mt ≤ mt+1 (change up or no change) ∀t ∈ {1, . . . , n − 1}, Pt(c) ∈ {0, 1}.

◮ The only difference with the PeakSeg problem is that

we have changed the strict inequality constraints to non-strict inequality constraints.

◮ This model has at most S distinct segment means (some may

be equal due to the non-strict equality constraints).

slide-76
SLIDE 76

Computation of optimal loss Ls,t for s = 1 segments up to data point t

L1,t = c(0,t]

  • ptimal loss of 1st segment (0, t]
slide-77
SLIDE 77

Computation of optimal loss Ls,t for s = 1 segments up to data point t

L1,t = c(0,t]

  • ptimal loss of 1st segment (0, t]
slide-78
SLIDE 78

Computation of optimal loss Ls,t for s = 1 segments up to data point t

L1,t = c(0,t]

  • ptimal loss of 1st segment (0, t]
slide-79
SLIDE 79

Computation of optimal loss Ls,t for s = 2 segments up to data point t < d

L2,t = min

t′<t

L1,t′

  • ptimal loss in 1 segment up to t′

+ c(t′,t]

  • ptimal loss of 2nd segment (t′, t]
slide-80
SLIDE 80

Computation of optimal loss Ls,t for s = 2 segments up to data point t < d

L2,t = min

t′<t

L1,t′

  • ptimal loss in 1 segment up to t′

+ c(t′,t]

  • ptimal loss of 2nd segment (t′, t]
slide-81
SLIDE 81

Computation of optimal loss Ls,t for s = 2 segments up to data point t < d

L2,t = min

t′<t

L1,t′

  • ptimal loss in 1 segment up to t′

+ c(t′,t]

  • ptimal loss of 2nd segment (t′, t]
slide-82
SLIDE 82

Computation of optimal loss Ls,t for s = 2 segments up to data point t < d

L2,t = min

t′<t

L1,t′

  • ptimal loss in 1 segment up to t′

+ c(t′,t]

  • ptimal loss of 2nd segment (t′, t]
slide-83
SLIDE 83

Computation of optimal loss Ls,t for s = 2 segments up to last data point t = d

L2,t = min

t′<t

L1,t′

  • ptimal loss in 1 segment up to t′

+ c(t′,t]

  • ptimal loss of 2nd segment (t′, t]
slide-84
SLIDE 84

Computation of optimal loss Ls,t for s = 2 segments up to last data point t = d

L2,t = min

t′<t

L1,t′

  • ptimal loss in 1 segment up to t′

+ c(t′,t]

  • ptimal loss of 2nd segment (t′, t]
slide-85
SLIDE 85

Computation of optimal loss Ls,t for s = 2 segments up to last data point t = d

L2,t = min

t′<t

L1,t′

  • ptimal loss in 1 segment up to t′

+ c(t′,t]

  • ptimal loss of 2nd segment (t′, t]
slide-86
SLIDE 86

Computation of optimal loss Ls,t for s = 2 segments up to last data point t = d

L2,t = min

t′<t

L1,t′

  • ptimal loss in 1 segment up to t′

+ c(t′,t]

  • ptimal loss of 2nd segment (t′, t]
slide-87
SLIDE 87

Computation of optimal loss Ls,t for s = 2 segments up to last data point t = d

L2,t = min

t′<t

L1,t′

  • ptimal loss in 1 segment up to t′

+ c(t′,t]

  • ptimal loss of 2nd segment (t′, t]
slide-88
SLIDE 88

Computation of optimal loss Ls,t for s = 2 segments up to last data point t = d

L2,t = min

t′<t

L1,t′

  • ptimal loss in 1 segment up to t′

+ c(t′,t]

  • ptimal loss of 2nd segment (t′, t]
slide-89
SLIDE 89

Dynamic programming is faster than grid search for s > 2 segments

Computation time in number of data points N: segments s grid search dynamic programming 1 O(N) O(N) 2 O(N2) O(N2) 3 O(N3) O(N2) 4 O(N4) O(N2) . . . . . . . . . For example N = 5735 data points to segment. N2 = 32890225 N3 = 188625440375 . . .

slide-90
SLIDE 90

Computation of optimal loss Ls,t for s = 3 segments up to data point t

L3,t = min

t′<t

L2,t′

  • ptimal loss in 2 segments up to t′

+ c(t′,t]

  • ptimal loss of 3rd segment (t′, t]
slide-91
SLIDE 91

Computation of optimal loss Ls,t for s = 3 segments up to data point t

L3,t = min

t′<t

L2,t′

  • ptimal loss in 2 segments up to t′

+ c(t′,t]

  • ptimal loss of 3rd segment (t′, t]
slide-92
SLIDE 92

Computation of optimal loss Ls,t for s = 3 segments up to data point t

L3,t = min

t′<t

L2,t′

  • ptimal loss in 2 segments up to t′

+ c(t′,t]

  • ptimal loss of 3rd segment (t′, t]
slide-93
SLIDE 93

Computation of optimal loss Ls,t for s = 3 segments up to data point t

L3,t = min

t′<t

L2,t′

  • ptimal loss in 2 segments up to t′

+ c(t′,t]

  • ptimal loss of 3rd segment (t′, t]
slide-94
SLIDE 94

5 errors for coseg/Segmentor, only 1 error for PeakSegDP

macs PeakSegDP Segmentor 1692 342 87 31 8 10 10 2 156 183 39 6 4 50 46 46 9 1 1 10 5 7 1 2 1 2 1 2179 3 26 361 1 15 39 97 1 6 3 15 1 1 3 1 1859 216 68 10 24 2 3 265 81 29 8 4 1 116 27 6 1 2 14 8 2 3 1 1 1 2 4 2 4 6 2 4 6 2 4 6

incorrect labels in best competing model incorrect labels in best coseg model

1 2 3

log10.problems

slide-95
SLIDE 95

5 errors for coseg/Segmentor, only 1 error for PeakSegDP

slide-96
SLIDE 96

Constrained optimization worse than macs

macs PeakSegDP Segmentor 1692 342 87 31 8 10 10 2 156 183 39 6 4 50 46 46 9 1 1 10 5 7 1 2 1 2 1 2179 3 26 361 1 15 39 97 1 6 3 15 1 1 3 1 1859 216 68 10 24 2 3 265 81 29 8 4 1 116 27 6 1 2 14 8 2 3 1 1 1 2 4 2 4 6 2 4 6 2 4 6

incorrect labels in best competing model incorrect labels in best coseg model

1 2 3

log10.problems

slide-97
SLIDE 97

Constrained optimization worse than macs

slide-98
SLIDE 98

Constrained optimization worse than macs

slide-99
SLIDE 99

Functional pruning complete example

pruning at data t = 34 cost up to data t = 35 pruning at data t = C ≥

2,34

C3,34 M3,34 = min{C3,34, C ≥

2,34}

C3,35 = M3,34+ γ35 M3,34 γ35 C3,35 C ≥

2,35

M3,35 = min{C3,35, C ≥

2,35}

0.0 0.1 0.2 0.3 0.4 0.00 0.25 0.50 0.75 0.00 0.25 0.50 0.75 0.00 0.25 0.50

mean µ of segment 3 cost value C(µ)

https://github.com/tdhock/PeakSegFPOP-paper

slide-100
SLIDE 100

New coseg algorithm more accurate than unconstrained maximum likelihood Poisson model (Segmentor)

1859 216 68 10 24 2 3 265 81 29 8 4 1 116 27 6 1 2 14 8 2 3 1 1 1

2 4 2 4 6

incorrect labels in best Segmentor model incorrect labels in best coseg model

1 2 3

log10.problems

slide-101
SLIDE 101

New coseg algorithm mostly agrees with slower inexact DP

2179 3 26 361 1 15 39 97 1 6 3 15 1 1 3 1

2 4 1 2 3 4

incorrect labels in best PeakSegDP model incorrect labels in best coseg model

1 2 3

log10.problems

slide-102
SLIDE 102

New coseg algorithm sometimes disagrees with macs

1692 342 87 31 8 10 10 2 156 183 39 6 4 50 46 46 9 1 1 10 5 7 1 2 1 2 1

2 4 2 4 6

incorrect labels in best macs model incorrect labels in best coseg model

1 2 3

log10.problems

slide-103
SLIDE 103

Constrained optimization better than macs

macs PeakSegDP Segmentor 1692 342 87 31 8 10 10 2 156 183 39 6 4 50 46 46 9 1 1 10 5 7 1 2 1 2 1 2179 3 26 361 1 15 39 97 1 6 3 15 1 1 3 1 1859 216 68 10 24 2 3 265 81 29 8 4 1 116 27 6 1 2 14 8 2 3 1 1 1 2 4 2 4 6 2 4 6 2 4 6

incorrect labels in best competing model incorrect labels in best coseg model

1 2 3

log10.problems

slide-104
SLIDE 104

0 errors for coseg/PeakSegDP, 6 errors for Segmentor

macs PeakSegDP Segmentor 1692 342 87 31 8 10 10 2 156 183 39 6 4 50 46 46 9 1 1 10 5 7 1 2 1 2 1 2179 3 26 361 1 15 39 97 1 6 3 15 1 1 3 1 1859 216 68 10 24 2 3 265 81 29 8 4 1 116 27 6 1 2 14 8 2 3 1 1 1 2 4 2 4 6 2 4 6 2 4 6

incorrect labels in best competing model incorrect labels in best coseg model

1 2 3

log10.problems