SLIDE 1
Learning penalties for change-point detection using max-margin - - PowerPoint PPT Presentation
Learning penalties for change-point detection using max-margin - - PowerPoint PPT Presentation
Learning penalties for change-point detection using max-margin interval regression Presented as a paper at ICML13 Toby Dylan Hocking toby.hocking@mail.mcgill.ca Co-authors Guillem Rigaill, Francis Bach, Jean-Philippe Vert 24 May 2017
SLIDE 2
SLIDE 3
Motivation: tumor genome copy number analysis
◮ Comparative genomic hybridization microarrays (aCGH) allow
genome-wide copy number analysis since logratio is proportional to DNA copy number (Pinkel et al., 1998).
◮ Tumors often contain breakpoints, amplifications, and
deletions at specific chromosomal locations that we would like to detect.
◮ Which genomic alterations are linked with good or bad patient
- utcome?
◮ To answer clinical questions like this one, we first need to
accurately detect these genomic alterations.
SLIDE 4
aCGH neuroblastoma copy number data
SLIDE 5
Copy number profiles are predictive of progression in neuroblastoma
Gudrun Schleiermacher, et al. Accumulation of Segmental Alterations Determines Progression in Neuroblastoma. J Clinical Oncology 2010. 2 types of profiles:
◮ Numerical: entire chromosome amplification. Good outcome. ◮ Segmental: deletion 1p 3p 11q, gain 1q 2p 17q. Bad outcome.
SLIDE 6
But which model is the best?
◮ GLAD: adaptive weights smoothing (Hup´
e et al., 2004)
◮ DNAcopy: circular binary segmentation (Venkatraman and
Olshen, 2007)
◮ cghFLasso: fused lasso signal approximator with heuristics
(Tibshirani and Wang, 2007)
◮ HaarSeg: wavelet smoothing (Ben-Yaacov and Eldar, 2008) ◮ GADA: sparse Bayesian learning (Pique-Regi et al., 2008) ◮ flsa: fused lasso signal approximator path algorithm (Hoefling
2009)
◮ cghseg: pruned dynamic programming (Rigaill 2010) ◮ PELT: pruned exact linear time (Killick et al., 2011)
Comparison study: Learning smoothing models of copy number profiles using breakpoint annotations (Hocking et al., 2012).
SLIDE 7
But which model is the best?
◮ GLAD: adaptive weights smoothing (Hup´
e et al., 2004)
◮ DNAcopy: circular binary segmentation (Venkatraman and
Olshen, 2007)
◮ cghFLasso: fused lasso signal approximator with heuristics
(Tibshirani and Wang, 2007)
◮ HaarSeg: wavelet smoothing (Ben-Yaacov and Eldar, 2008) ◮ GADA: sparse Bayesian learning (Pique-Regi et al., 2008) ◮ flsa: fused lasso signal approximator path algorithm (Hoefling
2009)
◮ cghseg: pruned dynamic programming (Rigaill 2010) ◮ PELT: pruned exact linear time (Killick et al., 2011)
Comparison study: Learning smoothing models of copy number profiles using breakpoint annotations (Hocking et al., 2012).
SLIDE 8
The cghseg.k/pelt.n least squares model
For a signal y ∈ Rd, we define ˆ yk, the maximum likelihood model with k ∈ {1, . . . , d} segments as arg min
µ∈Rd
||y − µ||2
2
subject to k − 1 =
d−1
- j=1
1µj=µj+1 and select the number of segments using the penalty k∗(λ) = arg min
k
λkd + ||y − ˆ yk||2
2
and use the constant λ which maximizes agreement with a database of labels. But why this particular penalty? Should we normalize using the variance of the signal y?
SLIDE 9
The cghseg.k/pelt.n least squares model
For a signal y ∈ Rd, we define ˆ yk, the maximum likelihood model with k ∈ {1, . . . , d} segments as arg min
µ∈Rd
||y − µ||2
2
subject to k − 1 =
d−1
- j=1
1µj=µj+1 and select the number of segments using the penalty k∗(λ) = arg min
k
λkd + ||y − ˆ yk||2
2
and use the constant λ which maximizes agreement with a database of labels. But why this particular penalty? Should we normalize using the variance of the signal y?
SLIDE 10
Introduction: how to detect changes in copy number? From labels to interval regression A convex relaxation of the label error Results on the neuroblastoma data Conclusions and future work
SLIDE 11
Creating breakpoint labels (demo)
SLIDE 12
SLIDE 13
SLIDE 14
Labels for 2 signals
1breakpoint 0breakpoints
0.0 0.4 0.8 1.2 0.0 0.4 0.8 1.2 signal i = 4.17 signal i = 6.15 25 50 75 100
position on chromosome (mega base pairs) logratio
SLIDE 15
Estimated model with 1 segment
1breakpoint 0breakpoints
0.0 0.4 0.8 1.2 0.0 0.4 0.8 1.2 signal i = 4.17 signal i = 6.15 25 50 75 100
position on chromosome (mega base pairs) logratio status
correct false negative
SLIDE 16
Estimated model with 2 segments
1breakpoint 0breakpoints
0.0 0.4 0.8 1.2 0.0 0.4 0.8 1.2 signal i = 4.17 signal i = 6.15 25 50 75 100
position on chromosome (mega base pairs) logratio status
correct false positive
SLIDE 17
Estimated model with 3 segments
1breakpoint 0breakpoints
0.0 0.4 0.8 1.2 0.0 0.4 0.8 1.2 signal i = 4.17 signal i = 6.15 25 50 75 100
position on chromosome (mega base pairs) logratio status
false positive
SLIDE 18
Estimated model with 4 segments
1breakpoint 0breakpoints
0.0 0.4 0.8 1.2 0.0 0.4 0.8 1.2 signal i = 4.17 signal i = 6.15 25 50 75 100
position on chromosome (mega base pairs) logratio status
false positive
SLIDE 19
Label error curves for 2 signals
1 1 signal i = 4.17 signal i = 6.15 1 5 10 20
number of segments k annotation error ei(k)
SLIDE 20
Definition of label error
For signal i we have a set of labels Si where for each label (r, a) ∈ Si we have
◮ a specific labeled region e.g. r = [10, 30]. ◮ a labeled number of allowed change-points e.g. a = {0}
The label error ei : {1, . . . , kmax} → R+ is the total number of incorrectly predicted labels for the model with k segments: ei(k) =
- (r,a)∈Si
1|ˆ
Pk
i ∩r|∈a,
where ˆ Pk
i is the set of predicted change-point positions, e.g.
{25, 35}.
SLIDE 21
Penalized model label error
For a log penalty L = log λ ∈ R we define the optimal number of segments z∗
i (L) =
arg min
k∈{1,...,kmax}
exp(L)k + ||yi − ˆ yk
i ||2 2
and the penalized model annnotation error Ei : R → R+ Ei(L) = ei [z∗
i (L)]
Both are piecewise constant functions that can be calculated exactly.
SLIDE 22
Error and model selection curves for 2 signals
signal i = 4.17 signal i = 6.15 1 5 10 20 1 segments z∗
i (L)
error Ei(L)
- 4
- 2
2
- 4
- 2
2
log penalty L
SLIDE 23
Error and model selection curves for 2 signals
signal i = 4.17 signal i = 6.15 1 5 10 20 1 segments z∗
i (L)
error Ei(L)
- 4
- 2
2
- 4
- 2
2
log penalty L
SLIDE 24
Label error curves for 2 signals
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 signal i = 4.17 signal i = 6.15
- 4
- 2
2
log penalty L annotation error Ei(L) limit
Li Li
SLIDE 25
Target interval [Li, Li] for 2 signals
- 4
- 2
2
- 2.7
- 2.4
- 2.1
variance estimate log ˆ si log penalty L limit
Li Li
SLIDE 26
Target interval [Li, Li] for all signals
- 4
- 2
2
- 2.7
- 2.4
- 2.1
variance estimate log ˆ si log penalty L limit
Li Li
SLIDE 27
Limit point representation
- 4
- 2
2
- 2.7
- 2.4
- 2.1
variance estimate log ˆ si log penalty L limit
Li Li
SLIDE 28
Max margin regression line
- 4
- 2
2
- 2.7
- 2.4
- 2.1
variance estimate log ˆ si log penalty L limit
Li Li
SLIDE 29
Max margin interval regression is a linear program
◮ Let xi ∈ Rm be the signal features, i.e. number of points
sampled log di, variance estimate log ˆ si, ...
◮ Learn an affine function f (xi) = w′xi + β from features to
model complexity using a linear program: maximize
β∈R,w∈Rm,µ∈R+ µ
subject to ∀i, if Li > −∞, w′xi + β − Li ≥ µ ∀i, if Li < ∞, Li − w′xi − β ≥ µ.
◮ Linear objective, linear constraints ⇒ linear program. ◮ Many solvers exist, i.e. simplex, interior point, dual methods. ◮ Implemented in R: lpSolve, quadprog, ...
SLIDE 30
Learning the penalty function
◮ For every signal i, calculate a variance estimate feature
xi = log ˆ si.
◮ Use that to predict model complexity using an affine function
f (xi) = w log ˆ si + β.
◮ Equivalent to learning a penalty function
z∗
i [f (xi)]
= arg min
k
||yi − ˆ yk
i ||2 2 + exp[f (xi)]k
= arg min
k
||yi − ˆ yk
i ||2 2 + ˆ
sw
i eβk.
SLIDE 31
Learning the penalty function
◮ For every signal i, calculate a variance estimate feature
xi = log ˆ si.
◮ Use that to predict model complexity using an affine function
f (xi) = w log ˆ si + β.
◮ Equivalent to learning a penalty function
z∗
i [f (xi)]
= arg min
k
||yi − ˆ yk
i ||2 2 + exp[f (xi)]k
= arg min
k
||yi − ˆ yk
i ||2 2 + ˆ
sw
i eβk.
SLIDE 32
Introduction: how to detect changes in copy number? From labels to interval regression A convex relaxation of the label error Results on the neuroblastoma data Conclusions and future work
SLIDE 33
Learning problem
◮ Signal features xi ∈ Rm i.e. variance, number of points, etc. ◮ We would like to find a penalty function f : Rm → R with
minimal label error arg min
f n
- i=1
Ei[f (xi)].
◮ But Ei is a non-convex, piecewise constant function. ◮ Grid search?
◮ Many features m ⇒ intractable. ◮ No guarantee to find the global min.
◮ Instead, we propose to minimize a margin-based convex
relaxation arg min
f n
- i=1
li[f (xi)].
SLIDE 34
Error and relaxation for 4 labeled signals
- 5.0
- 2.5
0.0 2.5
penalty exponent L error/loss
curve annotation error Ei(L) surrogate loss li(L)
SLIDE 35
Regularize using the ℓ1-norm for variable selection
◮ Convex surrogate loss li : R → R+. ◮ Learn an affine function f (xi) = w′xi + β by solving
arg min
β∈R,w∈Rm γ||w||1 + 1
n
n
- i=1
li(w′xi + β).
◮ This is a convex optimization problem, so we can use
accelerated proximal gradient methods to find the global minimum (i.e. FISTA, Beck and Teboulle 2009).
◮ The ℓ1 norm encourages w to be sparse
||w||1 =
m
- j=1
|wj|.
SLIDE 36
Coefficients depend on degree of ℓ1 regularization γ
For each signal i, we use features xi ∈ R3: number of points sampled di, variance estimate ˆ si, length in base pairs bi. β log bi log di log ˆ si test train
- 2
- 1
1 5 6 7 coefficients percent ann. error 1 2 3 4
Model complexity − log10(γ)
SLIDE 37
Introduction: how to detect changes in copy number? From labels to interval regression A convex relaxation of the label error Results on the neuroblastoma data Conclusions and future work
SLIDE 38
Original data (Systematic): label the same regions
SLIDE 39
Detailed data (Any): draw rectangles around regions
SLIDE 40
Two labeled data sets to analyze
- riginal
detailed.low.density protocol Systematic Any annotated signals n 3418 3526 annotations 3418 4149 0breakpoints 2845 3214 1breakpoint 514 >0breakpoints 573 421 Both labels of data(neuroblastoma,package="neuroblastoma") on CRAN.
SLIDE 41
Optimal model complexity depends on variance
SLIDE 42
Optimal model complexity depends on variance
SLIDE 43
Optimal model complexity depends on number of points
SLIDE 44
Optimal model complexity depends on number of points
SLIDE 45
Interactive figures
If we have time, show how learned regression line is different from BIC. http://members.cbio.ensmp.fr/~thocking/ figure-regression-interactive/ http://bl.ocks.org/ tdhock/raw/9fc37a7aaf291cef364aab3fb41dd898/ http://bl.ocks.org/tdhock/raw/ eee5fd673c258ae554702d9c7c60f69b/
SLIDE 46
Learned coefficients
Model with two features:
◮ Variance estimate log ˆ
si.
◮ Number of points sampled log di. ◮ Learned model complexity
f (xi) = w1 log ˆ si + w2 log di + β.
◮ Learned penalty function
z∗
i [f (xi)] = arg min k
||yi − ˆ yk
i ||2 2 + ˆ
sw1
i
dw2
i
eβk. annotation data set w1 w2 β
- riginal
1.02 0.95 −2.64 ±0.05 ±0.02 ±0.17 detailed.low.density 1.30 0.93 −2.00 ±0.01 ±0.03 ±0.17 Mean ± 1 standard deviation over 4 folds.
SLIDE 47
Test error estimated using K-fold cross-validation
log λi = f (xi) BIC f (xi) = log log di, no learned parameters. cghseg.k f (xi) = β + log di, learn β using grid search. log.d f (xi) = β + w log di, learn β, w by minimizing the un-regularized (γ = 0) surrogate loss li. log.s.log.d f (xi) = β + w1 log ˆ si + w2 log di learn β, w1, w2. L1-reg f (xi) = β + w′xi variance estimate, signal size, model error, chromosome indicator features xi ∈ R117, CV to choose the degree of ℓ1 regularization γ.
- riginal
detailed.low.density
- L1−reg
log.s.log.d log.d cghseg.k BIC 2.5 5.0 7.5 6 9 12
percent incorrect labels (test error in 4−fold CV) model
SLIDE 48
Introduction: how to detect changes in copy number? From labels to interval regression A convex relaxation of the label error Results on the neuroblastoma data Conclusions and future work
SLIDE 49
Conclusions
◮ Labels can be quickly produced using a GUI. ◮ Labels + convex optimization = learned penalty functions. ◮ Learned penalties are more accurate changepoint detectors
than unsupervised methods. Availability:
◮ Python Package Index: annotate regions GUI. ◮ R package implementing penalty learning algorithms:
https://github.com/tdhock/penaltyLearning
◮ R package neuroblastoma on CRAN
(3418 labeled changepoint detection problems).
SLIDE 50
Discussion and future work
◮ Un-regularized generative linear interval regression models
have been extensively studied in the survival analysis
- literature. e.g. survival::survreg function in R implements
Accelerated Failure Time models.
◮ This was the first paper to explore a regularized
discriminative linear model for interval regression.
◮ Implementation on the web for interactive change-point
detection: SegAnnDB, Bioinformatics 2014.
◮ Non-linear penalty functions? Tree model submitted to NIPS
2017.
◮ Multi-dimensional interval regression to learn arbitrary penalty
constants.
SLIDE 51
Thanks for your attention!
Supplementary slides appear after this one.
SLIDE 52
Gradient of surrogate loss
We define the surrogate loss as li(L) = φ(L − Li) + φ(Li − L) so its partial derivatives are ∂ ∂ β li(w′xi + β) = φ′(w′xi + β − Li) − φ′(Li − w′xi − β) ∂ ∂ wj li(w′xi + β) = xij
- φ′(w′xi + β − Li) − φ′(Li − w′xi − β)
- .
where the derivative of the squared hinge loss φ is φ′(L) =
- 2(L − 1)
if L ≤ 1 if L ≥ 1.
SLIDE 53
Optimality condition
Let the average surrogate loss be L(β, w) = 1 n
n
- i=1
li(w′xi + β). The optimization problem is arg min
β∈R,w∈Rm γ||w||1 + L(β, w),
and the approximate optimality condition is
- ∂
∂ β L(β, w)
- ≤ ǫ,
and for every variable j ∈ {1, . . . , m},
- −γ +
∂ ∂ wj L(β, w)
- ≤ ǫ
if wj < 0
- ∂
∂ wj L(β, w)
- − γ
- + ≤ ǫ
if wj = 0
- γ +
∂ ∂ wj L(β, w)
- ≤ ǫ
if wj > 0.
SLIDE 54
Proximal operator
To apply the FISTA method we need to solve the proximal
- perator pη : Rp+1 → Rp+1, which in our case is
pη(β, w) = β − 1
η ∂ ∂ βL(β, w)
sγ/η
- w1 − 1
η ∂ ∂ w1 L(β, w)
- .
. . where
◮ η = m√m is a Lipshitz constant of the smooth loss. ◮ The soft-thresholding function sλ : R → R is
sλ(x) =
- if |x| < λ
x − λ sign(x)
- therwise.