Genome-wide supervised ChIP-seq peak detection Toby Dylan Hocking - - PowerPoint PPT Presentation

genome wide supervised chip seq peak detection
SMART_READER_LITE
LIVE PREVIEW

Genome-wide supervised ChIP-seq peak detection Toby Dylan Hocking - - PowerPoint PPT Presentation

Genome-wide supervised ChIP-seq peak detection Toby Dylan Hocking toby.hocking@mail.mcgill.ca joint work with Guillem Rigaill, Paul Fearnhead, Guillaume Bourque 26 Jan 2017 Problem: optimizing ChIP-seq peak detection Segment neighborhood


slide-1
SLIDE 1

Genome-wide supervised ChIP-seq peak detection

Toby Dylan Hocking toby.hocking@mail.mcgill.ca joint work with Guillem Rigaill, Paul Fearnhead, Guillaume Bourque 26 Jan 2017

slide-2
SLIDE 2

Problem: optimizing ChIP-seq peak detection Segment neighborhood model (constraint on number of peaks) Results on benchmark data (labeled chromosome subsets) Optimal partitioning model (penalize number of peaks) Conclusions and future work

slide-3
SLIDE 3

Chromatin immunoprecipitation sequencing (ChIP-seq)

Analysis of DNA-protein interactions. Source: “ChIP-sequencing,” Wikipedia.

slide-4
SLIDE 4

Problem: find peaks in each of several samples

Scale chr11: 10 kb hg19 118,095,000 118,100,000 118,105,000 118,110,000 118,115,000 118,120,000 118,125,000 UCSC Genes (RefSeq, GenBank, CCDS, Rfam, tRNAs & Comparative Genomics) Alignability of 100mers by GEM from ENCODE/CRG(Guigo) McGill0002.MS000201: monocyte, H3K4me3, signal McGill0004.MS000401: CD4-positive helper T cell, H3K4me3, signal McGill0091.MS009101: B cell, H3K4me3, signal McGill0103.MS010302: B cell, H3K4me3, signal AMICA1 AK289390 MPZL3 MPZL2 CRG Align 100 1 _ 0 _ 000201mono.k4me3 36.366 _ 0.1254 _ 000401htc.k4me3 5.1414 _ 0.1254 _ 009101bCell.k4me3 13.0345 _ 0.0995 _ 010302bCell.k4me3 7.8597 _ 0.1107 _

Grey profiles are normalized aligned read count signals. Black bars are “peaks” called by MACS2 (Zhang et al, 2008):

◮ many false positives. ◮ overlapping peaks have different start/end positions.

slide-5
SLIDE 5

Previous work in genomic peak detection

◮ Model-based analysis of ChIP-Seq (MACS), Zhang et al, 2008. ◮ SICER, Zang et al, 2009. ◮ HOMER, Heinz et al, 2010. ◮ CCAT, Xu et al, 2010. ◮ RSEG, Song et al, 2011. ◮ Triform, Kornacker et al, 2012. ◮ Histone modifications in cancer (HMCan), Ashoor et al, 2013. ◮ PeakSeg, Hocking, Rigaill, Bourque, ICML 2015. ◮ PeakSegJoint Hocking and Bourque, arXiv:1506.01286. ◮ ... dozens of others.

Two big questions: how to choose the best...

◮ ...algorithm? (testing) ◮ ...parameters? (training)

slide-6
SLIDE 6

How to choose parameters of unsupervised peak detectors?

19 parameters for Model-based analysis of ChIP-Seq (MACS), Zhang et al, 2008. [-g GSIZE] [-s TSIZE] [--bw BW] [-m MFOLD MFOLD] [--fix-bimodal] [--nomodel] [--extsize EXTSIZE | --shiftsize SHIFTSIZE] [-q QVALUE | -p PVALUE | -F FOLDENRICHMENT] [--to-large] [--down-sample] [--seed SEED] [--nolambda] [--slocal SMALLLOCAL] [--llocal LARGELOCAL] [--shift-control] [--half-ext] [--broad] [--broad-cutoff BROADCUTOFF] [--call-summits] 10 parameters for Histone modifications in cancer (HMCan), Ashoor et al, 2013. minLength 145 medLength 150 maxLength 155 smallBinLength 50 largeBinLength 100000 pvalueThreshold 0.01 mergeDistance 200 iterationThreshold 5 finalThreshold 0 maxIter 20

slide-7
SLIDE 7

Which macs parameter is best for these data?

slide-8
SLIDE 8

Compute likelihood/loss of piecewise constant model

slide-9
SLIDE 9

Idea: choose the parameter with a lower loss

slide-10
SLIDE 10

PeakSeg: search for the peaks with lowest loss

Choose the number of peaks via standard penalties (AIC, BIC, ...)

  • r learned penalties based on visual labels (more on this later).
slide-11
SLIDE 11

Maximum likelihood Poisson segmentation models

◮ Previous work: unconstrained maximum likelihood mean

for s segments (s − 1 changes), Cleynen et al 2014.

◮ Hocking et al, ICML 2015: PeakSeg constraint enforces up,

down, up, down changes (and not up, up, down).

◮ Odd-numbered segments are background noise,

even-numbered segments are peaks.

◮ Constrained Dynamic Programming Algorithm, O(N2) time

for N data points.

slide-12
SLIDE 12

But quadratic time is not fast enough for genomic data!

  • ●●
  • ● ●
  • ● ●
  • ● ●
  • ●●
  • ● ●
  • ● ●
  • ●●
  • ●●
  • ● ●
  • ● ●
  • ● ●
  • 1

2 3 0e+00 1e+05 2e+05

data points to segment N hours of computation time

◮ Genomic data is large, N ≥ 106. ◮ Split into subsets? What if we split a peak in half? ◮ Need linear time algorithm for analyzing whole data set.

slide-13
SLIDE 13

Problem: optimizing ChIP-seq peak detection Segment neighborhood model (constraint on number of peaks) Results on benchmark data (labeled chromosome subsets) Optimal partitioning model (penalize number of peaks) Conclusions and future work

slide-14
SLIDE 14

Statistical model is Poisson with change constraints

◮ We have N count data z1, . . . , zN ∈ Z+. ◮ Fix the number of segments S ∈ {1, 2, . . . , N}. ◮ PeakSeg Model: zt ∼ Poisson(mt) such that mt has S − 1

up-down changes.

◮ Want to find means mt which maximize the Poisson

likelihood: P(Z = zt|mt) = mzt

t e−mt/(zt!). ◮ Equivalent to finding means mt which minimize the Poisson

loss: ℓ(mt, zt) = mt − zt log mt.

◮ Naive computation is O(NS), since there are O(NS−1)

possible positions for S − 1 change-points, and it takes O(N)

  • perations to compute the mean and loss for each.

◮ Comparison to Hidden Markov Model:

Likelihood Same emission terms, no transition terms. Constraint Number of changes rather than values.

slide-15
SLIDE 15

Relation to previous dynamic programming algorithms

no pruning functional pruning unconstrained Dynamic Programming Pruned DP exact O(SN2) exact O(SN log N) R pkgs: changepoint cghseg, Segmentor up-down constrained constrained DP this work inexact O(SN2) exact O(SN log N) R pkgs: PeakSegDP coseg

◮ Auger and Lawrence, Algorithms for the optimal identification

  • f segment neighborhoods (Bull Math Biol 1989).

◮ Independent discovery: Rigaill 2010, Johnson 2013. R

package Segmentor: Cleynen et al 2014.

◮ Hocking, Rigaill, Bourque 2015. ◮ Contribution: new algorithm that exactly computes the

constrained optimal segmentation for N data points and 1, . . . , S segments in O(SN log N) time.

slide-16
SLIDE 16

Relation to previous dynamic programming algorithms

no pruning functional pruning unconstrained Dynamic Programming Pruned DP exact O(SN2) exact O(SN log N) R pkgs: changepoint cghseg, Segmentor up-down constrained constrained DP this work inexact O(SN2) exact O(SN log N) R pkgs: PeakSegDP coseg

◮ Auger and Lawrence, Algorithms for the optimal identification

  • f segment neighborhoods (Bull Math Biol 1989).

◮ Independent discovery: Rigaill 2010, Johnson 2013. R

package Segmentor: Cleynen et al 2014.

◮ Hocking, Rigaill, Bourque 2015. ◮ Contribution: new algorithm that exactly computes the

constrained optimal segmentation for N data points and 1, . . . , S segments in O(SN log N) time.

slide-17
SLIDE 17

Relation to previous dynamic programming algorithms

no pruning functional pruning unconstrained Dynamic Programming Pruned DP exact O(SN2) exact O(SN log N) R pkgs: changepoint cghseg, Segmentor up-down constrained constrained DP this work inexact O(SN2) exact O(SN log N) R pkgs: PeakSegDP coseg

◮ Auger and Lawrence, Algorithms for the optimal identification

  • f segment neighborhoods (Bull Math Biol 1989).

◮ Independent discovery: Rigaill 2010, Johnson 2013. R

package Segmentor: Cleynen et al 2014.

◮ Hocking, Rigaill, Bourque 2015. ◮ Contribution: new algorithm that exactly computes the

constrained optimal segmentation for N data points and 1, . . . , S segments in O(SN log N) time.

slide-18
SLIDE 18

Relation to previous dynamic programming algorithms

no pruning functional pruning unconstrained Dynamic Programming Pruned DP exact O(SN2) exact O(SN log N) R pkgs: changepoint cghseg, Segmentor up-down constrained constrained DP this work inexact O(SN2) exact O(SN log N) R pkgs: PeakSegDP coseg

◮ Auger and Lawrence, Algorithms for the optimal identification

  • f segment neighborhoods (Bull Math Biol 1989).

◮ Independent discovery: Rigaill 2010, Johnson 2013. R

package Segmentor: Cleynen et al 2014.

◮ Hocking, Rigaill, Bourque 2015. ◮ Contribution: new algorithm that exactly computes the

constrained optimal segmentation for N data points and 1, . . . , S segments in O(SN log N) time.

slide-19
SLIDE 19

Dynamic programming and functional pruning

Classical dynamic programming (Auger and Lawrence 1989) computes the matrix of optimal loss values in S segments up to N data points, O(SN2) L1,1 · · · L1,N . . . . . . LS,1 · · · LS,N Dynamic programming with functional pruning (Rigaill 2010, Johnson 2013) computes a matrix of loss functions, the optimal loss up to N data points if segment S has mean µS, O(SN log N) L1,1(µ1) · · · L1,N(µ1) . . . . . . LS,1(µS) · · · LS,N(µS), Contribution of this work: a new algorithm that applies the functional pruning technique to the up-down constrained model.

slide-20
SLIDE 20

First segment, first data point

◮ For data z1, . . . , zN ∈ Z+ let

γt(µ) = ℓ(µ, zt) = µ − zt log µ be the Poisson loss for each t ∈ {1, . . . , N}.

◮ For example z = 2, 1, 0, 4. ◮ Then γ1(µ) = L1,1(µ) = 1µ − 2 log µ + 0. ◮ Need to store 3 coefficients (linear, log, constant).

2 4 6 1 2 3 4

mean cost

1 segment, 1 data point

slide-21
SLIDE 21

First segment, other data points

◮ The loss of the first segment up to data point t is

L1,t(µ) =

t

  • i=1

γi(µ).

◮ For example z = 2, 1, 0, 4. ◮ L1,2(µ) = 2µ − 3 log µ + 0. ◮ L1,3(µ) = 3µ − 12 log µ + 0. ◮ ...

1 2 3 4 5 1 2 3 4

mean cost

1 segment, 2 data points

1.0 1.5 2.0 2.5 3.0 1 2 3 4

mean cost

1 segment, 3 data points

slide-22
SLIDE 22

Second segment, up to data point 2

◮ The mean cost in 2 segments up to data point 2 is

L2,2(µ2) = γ2(µ2) + min

µ1≤µ2 L1,1(µ1)

= γ2(µ2) + L≤

1,1(µ2) ◮ Min-less operator is L≤(µ) = minx≤µ L(x),

2 4 6 8 4

mean cost

2 segments, 2 data points

slide-23
SLIDE 23

Comparison with unconstrained Pruned DPA

◮ For our constrained algorithm, the first segment mean must be

less than the second, and the first segment cost is a function: L2,2(µ2) = γ2(µ2) + min

µ1≤µ2 L1,1(µ1)

  • L≤

1,1(µ2)

.

◮ For the unconstrained algorithm, it is constant:

  • L2,2(µ2) = γ2(µ2) + min

µ1 L1,1(µ1)

  • L1,1

.

◮ For example z = 2, 1, 0, 4.

2 4 6 8 4

mean cost fun.type

constrained min prev cost unconstrained min

2 segments, 2 data points

slide-24
SLIDE 24

Storage as a piecewise function on intervals

◮ For example z = 2, 1, 0, 4.

2 4 6 8 4

mean cost

2 segments, 2 data points

◮ Storage:

L2,2(µ) = γ2(µ)+

  • L1,1(µ) = 1µ − 2 log µ + 0

if µ ∈ [0, 2], L1,1 = 0µ − 0 log µ + 0.6137 if µ ∈ [2, 4].

slide-25
SLIDE 25

Second segment, up to data point 3

◮ For data point 3 we need to consider two change-points:

L2,3(µ) = γ3(µ)+min

  • L≤

1,2(µ),

change up after data point 2, L2,2(µ), change up after data point 1.

◮ For z = 2, 1, 0, 4 the min operation prunes a change after data

point 1.

2 4 6 4

mean cost prev seg end

1 2

2 segments, 3 data points

slide-26
SLIDE 26

Second segment, up to data point t

◮ The updates continue for every data point t ∈ {3, ..., N}

L2,t(µ) = γt(µ) + min

  • L≤

1,t−1(µ),

change up after t − 1, L2,t−1(µ), change up before t − 1.

◮ For example for z = 2, 1, 0, 4, at data point t = 4 we only

need to consider changes after 2 and 3 (1 has been pruned).

1 2 3 4 4

mean cost prev seg end

2 3

2 segments, 4 data points

slide-27
SLIDE 27

Dynamic programming for segment neighborhood model

◮ Dynamic programming update rule: the constrained cost of a

mean µ for the segment s, up to data point t: Ls,t(µ) = γt(µ) + min

  • Ls,t−1(µ),

L∗

s−1,t−1(µ), ◮ Time complexity of min and min-less/more * is linear in the

number of intervals, empirically sub-linear O(log N).

max median and inter−quartile range

  • 50

100 150 200 250 2 3 4 5

log10(data points to segment) intervals stored

◮ Total time complexity: O(SN log N).

slide-28
SLIDE 28

Problem: optimizing ChIP-seq peak detection Segment neighborhood model (constraint on number of peaks) Results on benchmark data (labeled chromosome subsets) Optimal partitioning model (penalize number of peaks) Conclusions and future work

slide-29
SLIDE 29

Previous work in computer vision: look and add labels to...

Photos Cell images Copy number profiles Labels: names phenotypes alterations CVPR 2013 CellProfiler SegAnnDB 246 papers 873 citations Hocking et al, 2014. Sources: http://en.wikipedia.org/wiki/Face_detection Jones et al PNAS 2009. Scoring diverse cellular morphologies in image-based screens with iterative feedback and machine learning.

slide-30
SLIDE 30

Benchmark data sets, algorithms

http://cbio.ensmp.fr/~thocking/chip-seq-chunk-db/

◮ Hocking et al Bioinformatics (2016). Optimizing ChIP-seq

peak detectors using visual labels and supervised machine learning.

◮ 37 labeled H3K4me3 samples (sharp peak pattern). ◮ 29 labeled H3K36me3 samples (broad peak pattern). ◮ 12,826 labeled regions with and without peaks. ◮ 2,752 separate segmentation problems.

Algorithms for segmenting N data points: package constraint exact? complexity coseg (this work) µ1 ≤ µ2 ≥ µ3 . . . yes O(N log N) PeakSegDP µ1 < µ2 > µ3 . . . no O(N2) Segmentor none yes O(N log N) Segmentor loss ≤ coseg loss ≤ PeakSegDP loss.

slide-31
SLIDE 31

Linear time algorithms faster for larger data sets

1 second 1 minute 1 hour

  • coseg

O(N log N) Segmentor O(N log N) PeakSegDP O(N^2)

2 4 2 3 4 5 6

log10(data points to segment) log10(seconds)

Timings on 2752 histone mark ChIP−seq data sets

Total time to compute 10 models (0, ..., 9 peaks) for all data sets:

◮ PeakSegDP: 156 hours, inexact. ◮ coseg: 6 hours, exact.

slide-32
SLIDE 32

8 false negative labels for models with 0 peaks

slide-33
SLIDE 33

Models with 1 peak are better (6 FN)

slide-34
SLIDE 34

Models with 2 peaks are better still (4 FN)

slide-35
SLIDE 35

Models with 3 peaks are the same (4 FN)

slide-36
SLIDE 36

Models with 4 peaks are better (2 FN)

slide-37
SLIDE 37

Models with 5 peaks have no incorrect labels

slide-38
SLIDE 38

Models with 6 peaks are worse (1 false positive)

slide-39
SLIDE 39

Constrained optimization better than macs

slide-40
SLIDE 40

0 errors for coseg/PeakSegDP, 6 errors for Segmentor

slide-41
SLIDE 41

Minimum train error in all data sets

errors fp fn feasible models PeakSegDP 677 116 561 27469 coseg 789 94 695 21278 macs 1293 519 774 Segmentor 1544 46 1498 8106 hmcan.broad 2778 367 2411 possible 12826 11037 7225 27520

◮ Segmentor, PeakSegDP, coseg were used to compute up to 10

models for each problem (1,...,19 segments = 0,...,9 peaks).

◮ errors = fp + fn = total number of incorrect labels, after

picking best parameter for each of the 2752 separate problems.

◮ models = number that obey the up-down constraint. ◮ New coseg algorithm has minimum train error almost as good

as slower PeakSegDP algorithm.

◮ Other algorithms much less accurate.

slide-42
SLIDE 42

Max-margin penalty learning algortihm

  • 2 peaks

7 peaks 2 peaks 1 peak 2 peaks 1 peak 2 peaks McGill0002 McGill0091 McGill0322 McGill0004 0 errors large margin 0 errors small margin 1 error constant

8 9 10 11 12 13 3.5 4.0 4.5 5.0 5.5 6.0

input feature = log(max(coverage)) log(penalty)

http://bl.ocks.org/tdhock/raw/9311ca39d643d127e04a088814c81ee1/

slide-43
SLIDE 43

ROC curves for one test fold

http://cbio.ensmp.fr/~thocking/chip-seq-chunk-db/figure-roc-test/

slide-44
SLIDE 44

Test AUC on 7 benchmark data sets

H3K36me3 AM immune H3K36me3 TDH immune H3K36me3 TDH

  • ther

H3K4me3 PGP immune H3K4me3 TDH immune H3K4me3 TDH

  • ther

H3K4me3 XJ immune

  • MACS

HMCanBroad Segmentor PeakSegDP coseg 0.6 0.8 1 0.6 0.8 1 0.6 0.8 1 0.6 0.8 1 0.6 0.8 1 0.6 0.8 1 0.6 0.8 1

Test AUC (area under the Receiver Operating Characteristic curve) algorithm

◮ 4-fold cross-validation: train on 3/4 of labels, test on 1/4. ◮ HMCanBroad is accurate for broad H3K36me3 data but not

sharp H3K4me3 data.

◮ MACS is accurate for sharp H3K4me3 but not broad

H3K36me3 data.

◮ Unconstrained Segmentor algorithm not as accurate as

up-down constrained algorithms (coseg, PeakSegDP).

◮ Proposed algorithm in coseg R package yields state-of-the-art

accuracy in all benchmark data sets.

http://bl.ocks.org/tdhock/raw/886575874144c3b172ce6b7d7d770b9f/

slide-45
SLIDE 45

Segmenting whole chromosomes?

◮ 365 regions with no gaps in hg19. ◮ 272 regions with no gaps on chr1-22, X, Y. ◮ Smallest: 31,833 bases (chr6:157,609,467-157,641,300). ◮ Largest: 115,591,997 bases (chr4:75,452,279-191,044,276).

slide-46
SLIDE 46

But the segment neighborhood algorithm is too slow for large genomic regions

◮ Time complexity O(PN log N) to compute models with

0, . . . , P peaks in N data.

◮ Since P = O(N), time complexity is O(N2 log N) – too slow

for large genome regions.

  • ●●
  • 13

9 8 5 1 10 4 9 3 5 5 38 5 7 5 15 5 11 6 11 29 11 17 23 6 18 13 7 38 22 14 chr1 chr2 chr3 chr4 chr5 chr6 chr6_apd_hap1 chr6_dbb_hap3 chr6_mann_hap4 chr6_mcf_hap5 chr6_qbl_hap6 chr6_ssto_hap7 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr17_ctg5_hap1 chr18 chr19 chr20 chr21 chr22 chrX chrY 50 100 150 200 250 position on chromosome (mega bases) chromosome

slide-47
SLIDE 47

Problem: optimizing ChIP-seq peak detection Segment neighborhood model (constraint on number of peaks) Results on benchmark data (labeled chromosome subsets) Optimal partitioning model (penalize number of peaks) Conclusions and future work

slide-48
SLIDE 48

Optimal partitioning model penalizes the number of changes

◮ We have N count data z1, . . . , zN ∈ Z+. ◮ PeakSeg Segment Neighborhood Model: find means mt which

minimize the Poisson loss N

t=1 ℓ(mt, zt), such that there are

exactly S − 1 up-down changes.

◮ PeakSeg Optimal Partitioning Model: find means mt with

up-down changes which minimize the penalized Poisson loss λ

N−1

  • t=1

I(mt = mt+1) +

N

  • t=1

ℓ(mt, zt).

◮ No constraint on the number of changes, but λ > 0 penalizes

models with more changes.

slide-49
SLIDE 49

Relation to previous dynamic programming algorithms

no pruning functional pruning unconstrained Dynamic Programming FPOP exact O(N2) exact O(N log N) R packages: fpop up-down constrained this work exact O(N log N) R packages: coseg

◮ Jackson et al 2005, An algorithm for optimal partitioning of

data on an interval.

◮ Independent discovery: Johnson 2013, Maidstone et al 2016

(FPOP = Functional Pruning Optimal Partitioning).

◮ Contribution: new dynamic programming algorithm that

exactly computes the constrained optimal segmentation for N data points and a single penalty λ in O(N log N) time. All algorithms compute a model with S segments without having to compute models with 1, . . . , S − 1 segments.

slide-50
SLIDE 50

Relation to previous dynamic programming algorithms

no pruning functional pruning unconstrained Dynamic Programming FPOP exact O(N2) exact O(N log N) R packages: fpop up-down constrained this work exact O(N log N) R packages: coseg

◮ Jackson et al 2005, An algorithm for optimal partitioning of

data on an interval.

◮ Independent discovery: Johnson 2013, Maidstone et al 2016

(FPOP = Functional Pruning Optimal Partitioning).

◮ Contribution: new dynamic programming algorithm that

exactly computes the constrained optimal segmentation for N data points and a single penalty λ in O(N log N) time. All algorithms compute a model with S segments without having to compute models with 1, . . . , S − 1 segments.

slide-51
SLIDE 51

Relation to previous dynamic programming algorithms

no pruning functional pruning unconstrained Dynamic Programming FPOP exact O(N2) exact O(N log N) R packages: fpop up-down constrained this work exact O(N log N) R packages: coseg

◮ Jackson et al 2005, An algorithm for optimal partitioning of

data on an interval.

◮ Independent discovery: Johnson 2013, Maidstone et al 2016

(FPOP = Functional Pruning Optimal Partitioning).

◮ Contribution: new dynamic programming algorithm that

exactly computes the constrained optimal segmentation for N data points and a single penalty λ in O(N log N) time. All algorithms compute a model with S segments without having to compute models with 1, . . . , S − 1 segments.

slide-52
SLIDE 52

Dynamic programming and functional pruning for the

  • ptimal partitioning model

Classical dynamic programming (Jackson et al 2005) computes the optimal loss values up to N data points, O(N2) L1 · · · LN Dynamic programming with functional pruning (Johnson 2013, Maidstone et al 2016) computes loss functions, the optimal loss up to N data points with last segment mean µ, O(N log N) L1(µ) · · · LN(µ) Contribution of this work: a new algorithm that applies the functional pruning technique to the up-down constrained model.

slide-53
SLIDE 53

Dynamic programming algorithm (PeakSegFPOP)

◮ Input: data z1, . . . , zN ∈ Z+, penalty λ ≥ 0. ◮ Initialization t = 1, optimal loss if first data point has mean µ

  • n a background/down segment: L1(µ) = ℓ(µ, z1).

◮ Initialization t = 2:

L2(µ) = ℓ(µ, z2) + L≤

1 (µ) + λ,

L2(µ) = ℓ(µ, z2) + L1(µ).

◮ Updates: for all t > 2,

Lt(µ) = ℓ(µ, zt) + min{Lt−1(µ), L≤

t−1(µ) + λ},

Lt(µ) = ℓ(µ, zt) + min{Lt−1(µ), L

≥ t−1(µ) + λ}. ◮ Output: 2 × N matrix of optimal loss functions

L2(µ) · · · LN−1(µ) L1(µ) L2(µ) · · · LN−1(µ) LN(µ)

slide-54
SLIDE 54

Implementation details

◮ Weighted loss function for compressed bedGraph data:

2,1,1,1,0,0,0,4,4 ⇒ count zt 2 1 4 weight wt 1 3 2 2 ℓ(mt, zt, wt) = wt(mt − zt log mt).

◮ Lt stores mean rather than total loss: W1:t = t i=1 wi,

Lt(µ) = 1 W1:t

  • ℓ(µ, zt) + W1:(t−1) min{Lt−1(µ), L≤

t−1(µ) + λ}

  • ◮ Solve a log µ + bµ + c = 0 (linear as µ → ∞) or x = log µ,

ax + bex + c = 0 (linear as µ → −∞), Newton root finding.

◮ Since update rule t depends only on t − 1, we can store the

  • thers 1, . . . , t − 2 on disk.
slide-55
SLIDE 55

BerkeleyDB C++ Standard Template Library

In memory C++ STL: #include <vector> std::vector<PiecewisePoissonLoss> cost_model_mat; On disk BerkeleyDB STL:

#include <dbstl_vector.h> // Functions to serialize/unserialize PiecewisePoissonLoss. dbstl::DbstlElemTraits<PiecewisePoissonLoss> *funTraits = dbstl::DbstlElemTraits<PiecewisePoissonLoss>::instance(); funTraits->set_size_function(PiecewiseFunSize); funTraits->set_copy_function(PiecewiseFunCopy); funTraits->set_restore_function(PiecewiseFunRestore); // Tell BerkeleyDB a file to store the data in the vector. DbEnv *env = NULL; Db *db = dbstl::open_db( env, "/path/to/database.db", DB_RECNO, DB_CREATE, 0); dbstl::db_vector<PiecewisePoissonLoss> cost_model_mat(db, env);

slide-56
SLIDE 56

Two C++ implementations of PeakSegFPOP algorithm

◮ PeakSegFPOP function in coseg R package (constrained

  • ptimal segmentation), limited by memory.

https://github.com/tdhock/coseg

◮ PeakSegFPOP command line program for big N > 106 data.

https://github.com/tdhock/PeakSegFPOP implementation time memory disk R package coseg O(N log N) O(N log N) command line O(N log N) O(log N) O(N log N) N min(MB) max(MB) min(time) max(time) 10,000 12 43 1 sec 2 sec 100,000 189 627 12 sec 25 sec 1,000,000 3462 7148 3 min 5 min 7,135,956 5042 41695 18 min 56 min 7,806,082 5270 33425 35 min 167 min

slide-57
SLIDE 57

Reasonable time to segment biggest region on chr1

  • 0.1

1.0 10.0 30.0 100,000 1,000,000 4,485,563

N = data points to segment minutes of computation algorithm

  • in.memory
  • n.disk

◮ R package in memory: O(N log N) time. ◮ Command line program on disk: O(N log N) time.

slide-58
SLIDE 58

Memory requirements reasonable for on-disk version

  • bedGraph lines in biggest

chr1 region with no gap bases in biggest region with no gap

  • 20.000

1193.000 1.000 0.100 0.010 0.001 100,000 1,000,000 4,485,563 115,591,997

N = data points to segment memory usage (gigabytes) algorithm

  • in.memory
  • n.disk

◮ R package in memory: O(N log N) memory. ◮ Command line program on disk: O(log N) memory.

slide-59
SLIDE 59

Disk usage reasonable

  • 0.1

1.0 10.0 32.2 100,000 1,000,000 4,485,563

N = data points to segment disk usage (gigabytes) algorithm

  • in.memory
  • n.disk

◮ R package in memory: no disk usage. ◮ Command line program: O(N log N) disk space (temporary).

slide-60
SLIDE 60

Finding the optimal interval of penalty values

Broad peak pattern (H3K36me3), chr11:96,437,584-134,946,516.

penalty peaks status fp fn 1: Inf feasible 2 2: 5265297.3831 3 feasible 2 3: 3368752.0956 7 feasible 2 4: 3005673.8348 8 feasible 5: 2686562.9821 10 feasible 6: 794862.0964 29 feasible 7: 411324.5305 43 feasible 8: 375144.4400 44 feasible 9: 361535.5526 45 feasible 1 10: 333932.0146 47 feasible 1 11: 283667.3315 51 feasible 1 12: 211252.2486 61 feasible 1 13: 55303.9465 123 feasible 2 14: 2715.5727 1986 infeasible 5 15: 233.2671 54409 infeasible 5 16: 0.0000 671283 infeasible 5

slide-61
SLIDE 61

Finding the optimal interval of penalty values

Sharp peak pattern (H3K4me3), chr11:96,437,584-134,946,516.

penalty peaks status fp fn 1: Inf feasible 0 12 2: 1232627.283 45 feasible 9 3: 599902.993 84 feasible 6 4: 467450.211 96 infeasible 4 5: 436473.003 99 infeasible 4 6: 427664.944 100 infeasible 3 7: 423816.446 101 infeasible 3 8: 415415.071 102 infeasible 3 9: 375629.581 108 infeasible 3 10: 227473.010 138 infeasible 3 11: 144688.178 173 infeasible 3 12: 117251.695 205 infeasible 3 13: 103355.006 219 infeasible 3 14: 99984.891 227 infeasible 3 15: 97920.192 229 infeasible 3 16: 97532.045 230 infeasible 4 3 17: 97119.082 231 infeasible 4 3 18: 95124.208 233 infeasible 4 3 19: 46623.871 385 infeasible 7 20: 12159.960 905 infeasible 13 21: 1039.473 21060 infeasible 19 22: 0.000 260804 infeasible 20

slide-62
SLIDE 62

More peaks in sharp H3K4me3 than in broad H3K36me3

slide-63
SLIDE 63

Smaller peaks in sharp H3K4me3 than in broad H3K36me3

slide-64
SLIDE 64

Only need 10–25 runs to find optimal penalty interval

  • 10

15 20 25 5 6 7

log10(number of data to segment = lines in bedGraph file) Number of penalties for which PeakSegFPOP was run

Number of penalties to find target interval is constant

slide-65
SLIDE 65

Target interval computation time

slide-66
SLIDE 66

Pipeline

Input:

◮ S bigWig coverage data files, ◮ problems.bed file with P genomic regions to segment, ◮ labels.txt file(s) – gold standard locations with and without

peaks. For example S = 37 samples, P = 374 problems in hg19. Step Jobs Time/job Create separate problems 1 5 min Compute separate targets SP 1–10 hours Train separate model 1 5 min Separate peak prediction P 1–10 hours Train joint model 1 5 min Joint peak prediction P 1–10 hours Combine and plot predictions 1 10–60 minutes Output: matrix of joint peak predictions, samples x peaks.

slide-67
SLIDE 67

Problem: optimizing ChIP-seq peak detection Segment neighborhood model (constraint on number of peaks) Results on benchmark data (labeled chromosome subsets) Optimal partitioning model (penalize number of peaks) Conclusions and future work

slide-68
SLIDE 68

How to choose parameters of unsupervised peak detectors?

19 parameters for Model-based analysis of ChIP-Seq (MACS), Zhang et al, 2008. [-g GSIZE] [-s TSIZE] [--bw BW] [-m MFOLD MFOLD] [--fix-bimodal] [--nomodel] [--extsize EXTSIZE | --shiftsize SHIFTSIZE] [-q QVALUE | -p PVALUE | -F FOLDENRICHMENT] [--to-large] [--down-sample] [--seed SEED] [--nolambda] [--slocal SMALLLOCAL] [--llocal LARGELOCAL] [--shift-control] [--half-ext] [--broad] [--broad-cutoff BROADCUTOFF] [--call-summits] 10 parameters for Histone modifications in cancer (HMCan), Ashoor et al, 2013. minLength 145 medLength 150 maxLength 155 smallBinLength 50 largeBinLength 100000 pvalueThreshold 0.01 mergeDistance 200 iterationThreshold 5 finalThreshold 0 maxIter 20

slide-69
SLIDE 69

PeakSeg: search for the peaks with lowest loss

Simple model with only one parameter to train (maxPeaks).

slide-70
SLIDE 70

Conclusions

no pruning functional pruning unconstrained Dynamic Programming PDPA, FPOP exact O(N2) exact O(N log N) R packages: changepoint cghseg, Segmentor up-down constrained constrained DP this work inexact O(N2) exact O(N log N) R packages: PeakSegDP coseg

◮ New algorithm that exactly computes the constrained

  • ptimal change-points/peaks for N data points.

◮ C++ code in coseg R package, O(N log N) memory

https://github.com/tdhock/coseg

◮ PeakSegFPOP program for big N > 106 data, O(log N)

memory and O(N log N) disk space.

◮ TODO: ICML paper. ◮ TODO: C3G-GSOC pipeline project. ◮ TODO: web app for interactive labeling and model learning.

slide-71
SLIDE 71

Thanks for your attention!

Questions? toby.hocking@mail.mcgill.ca

◮ coseg R package,

https://github.com/tdhock/coseg

◮ PeakSegFPOP command line program,

https://github.com/tdhock/PeakSegFPOP

◮ source code for these slides:

https://github.com/tdhock/PeakSegFPOP-paper

slide-72
SLIDE 72

Two annotators provide consistent labels, but different precision

◮ TDH peakStart/peakEnd more precise than AM peaks. ◮ AM noPeaks more precise than TDH no label.

slide-73
SLIDE 73

Train on one person, test on another (same histone mark and samples)

H3K36me3 AM immune H3K36me3 TDH immune H3K36me3 TDH

  • ther

H3K4me3 PGP immune H3K4me3 TDH immune H3K4me3 TDH

  • ther

H3K4me3 XJ immune max NTNU k562 nrsf NTNU k562 srf NTNU Gm12878

  • ● ● ●
  • ● ●
  • ● ● ●
  • ● ●
  • ●●
  • ● ●
  • ● ●
  • ● ●
  • ●●
  • ● ●
  • ● ●
  • ● ●
  • chr1

chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY train H3K4me3_PGP_immune train H3K4me3_TDH_immune 100 200 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200

position on chromosome (Mb = mega bases) chromosome

set

  • test

train unused

slide-74
SLIDE 74

Train on one person, test on another (same histone mark and samples)

test data set: H3K4me3_PGP_immune test data set: H3K4me3_TDH_immune

More errors in model trained on different person Less errors in model trained on different person

  • More errors

in model trained on different person Less errors in model trained on different person

  • 10

20 30 40 50 60 20 40 60 20 40 60

percent incorrect test labels, model trained on labels from same person percent incorrect test labels, model trained on labels from different person

Model

  • ccat.histone

ccat.tf sicer homer rseg hmcan hmcan.broad triform macs macs.broad PeakSeg

Models trained on same or different person

slide-75
SLIDE 75

Train on some samples, test on others (same histone mark and person)

H3K36me3 AM immune H3K36me3 TDH immune H3K36me3 TDH

  • ther

H3K4me3 PGP immune H3K4me3 TDH immune H3K4me3 TDH

  • ther

H3K4me3 XJ immune max NTNU k562 nrsf NTNU k562 srf NTNU Gm12878

  • ● ● ●
  • ● ●
  • ● ● ●
  • ● ●
  • ●●
  • ● ●
  • ● ●
  • ● ●
  • ●●
  • ● ●
  • ● ●
  • ● ●
  • chr1

chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY train H3K4me3_TDH_immune train H3K4me3_TDH_other 100 200 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200

position on chromosome (Mb = mega bases) chromosome

set

  • test

train unused

slide-76
SLIDE 76

Train on some samples, test on others (same histone mark and person)

test data set: H3K4me3_TDH_immune test data set: H3K4me3_TDH_other

More errors in model trained on different cell types Less errors in model trained on different cell types

  • More errors

in model trained on different cell types Less errors in model trained on different cell types

  • 20

40 60 20 40 60 20 40 60

percent incorrect test labels, model trained on labels from same cell types percent incorrect test labels, model trained on labels from different cell types

Model

  • ccat.histone

ccat.tf sicer homer rseg hmcan hmcan.broad triform macs macs.broad PeakSeg

Models trained on same or different cell types

slide-77
SLIDE 77

Train on one histone mark, test on another (same person and samples)

H3K36me3 AM immune H3K36me3 TDH immune H3K36me3 TDH

  • ther

H3K4me3 PGP immune H3K4me3 TDH immune H3K4me3 TDH

  • ther

H3K4me3 XJ immune max NTNU k562 nrsf NTNU k562 srf NTNU Gm12878

  • ● ● ●
  • ● ●
  • ● ● ●
  • ● ●
  • ●●
  • ● ●
  • ● ●
  • ● ●
  • ●●
  • ● ●
  • ● ●
  • ● ●
  • chr1

chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY train H3K36me3_TDH_immune train H3K4me3_TDH_immune 100 200 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200

position on chromosome (Mb = mega bases) chromosome

set

  • test

train unused

slide-78
SLIDE 78

Train on one histone mark, test on another (same person and samples)

test data set: H3K36me3_TDH_immune test data set: H3K4me3_TDH_immune

More errors in model trained on different experiments Less errors in model trained on different experiments

  • More errors

in model trained on different experiments Less errors in model trained on different experiments

  • 20

40 60 80 20 40 60 80 0 20 40 60 80

percent incorrect test labels, model trained on labels from same experiments percent incorrect test labels, model trained on labels from different experiments

Model

  • ccat.histone

ccat.tf sicer homer rseg hmcan hmcan.broad triform macs macs.broad PeakSeg

Models trained on same or different experiments