Learning penalties for change-point detection using max-margin - - PowerPoint PPT Presentation

learning penalties for change point detection using max
SMART_READER_LITE
LIVE PREVIEW

Learning penalties for change-point detection using max-margin - - PowerPoint PPT Presentation

Learning penalties for change-point detection using max-margin interval regression Presented as a paper at ICML13 Toby Dylan Hocking toby.hocking@mail.mcgill.ca Co-authors Guillem Rigaill, Francis Bach, Jean-Philippe Vert 24 May 2017


slide-1
SLIDE 1

Learning penalties for change-point detection using max-margin interval regression

Presented as a paper at ICML’13 Toby Dylan Hocking toby.hocking@mail.mcgill.ca

Co-authors Guillem Rigaill, Francis Bach, Jean-Philippe Vert

24 May 2017

slide-2
SLIDE 2

Introduction: how to detect changes in copy number? From labels to interval regression A convex relaxation of the label error Results on the neuroblastoma data Conclusions and future work

slide-3
SLIDE 3

Motivation: tumor genome copy number analysis

◮ Comparative genomic hybridization microarrays (aCGH) allow

genome-wide copy number analysis since logratio is proportional to DNA copy number (Pinkel et al., 1998).

◮ Tumors often contain breakpoints, amplifications, and

deletions at specific chromosomal locations that we would like to detect.

◮ Which genomic alterations are linked with good or bad patient

  • utcome?

◮ To answer clinical questions like this one, we first need to

accurately detect these genomic alterations.

slide-4
SLIDE 4

aCGH neuroblastoma copy number data

slide-5
SLIDE 5

Copy number profiles are predictive of progression in neuroblastoma

Gudrun Schleiermacher, et al. Accumulation of Segmental Alterations Determines Progression in Neuroblastoma. J Clinical Oncology 2010. 2 types of profiles:

◮ Numerical: entire chromosome amplification. Good outcome. ◮ Segmental: deletion 1p 3p 11q, gain 1q 2p 17q. Bad outcome.

slide-6
SLIDE 6

But which model is the best?

◮ GLAD: adaptive weights smoothing (Hup´

e et al., 2004)

◮ DNAcopy: circular binary segmentation (Venkatraman and

Olshen, 2007)

◮ cghFLasso: fused lasso signal approximator with heuristics

(Tibshirani and Wang, 2007)

◮ HaarSeg: wavelet smoothing (Ben-Yaacov and Eldar, 2008) ◮ GADA: sparse Bayesian learning (Pique-Regi et al., 2008) ◮ flsa: fused lasso signal approximator path algorithm (Hoefling

2009)

◮ cghseg: pruned dynamic programming (Rigaill 2010) ◮ PELT: pruned exact linear time (Killick et al., 2011)

Comparison study: Learning smoothing models of copy number profiles using breakpoint annotations (Hocking et al., 2012).

slide-7
SLIDE 7

But which model is the best?

◮ GLAD: adaptive weights smoothing (Hup´

e et al., 2004)

◮ DNAcopy: circular binary segmentation (Venkatraman and

Olshen, 2007)

◮ cghFLasso: fused lasso signal approximator with heuristics

(Tibshirani and Wang, 2007)

◮ HaarSeg: wavelet smoothing (Ben-Yaacov and Eldar, 2008) ◮ GADA: sparse Bayesian learning (Pique-Regi et al., 2008) ◮ flsa: fused lasso signal approximator path algorithm (Hoefling

2009)

◮ cghseg: pruned dynamic programming (Rigaill 2010) ◮ PELT: pruned exact linear time (Killick et al., 2011)

Comparison study: Learning smoothing models of copy number profiles using breakpoint annotations (Hocking et al., 2012).

slide-8
SLIDE 8

The cghseg.k/pelt.n least squares model

For a signal y ∈ Rd, we define ˆ yk, the maximum likelihood model with k ∈ {1, . . . , d} segments as arg min

µ∈Rd

||y − µ||2

2

subject to k − 1 =

d−1

  • j=1

1µj=µj+1 and select the number of segments using the penalty k∗(λ) = arg min

k

λkd + ||y − ˆ yk||2

2

and use the constant λ which maximizes agreement with a database of labels. But why this particular penalty? Should we normalize using the variance of the signal y?

slide-9
SLIDE 9

The cghseg.k/pelt.n least squares model

For a signal y ∈ Rd, we define ˆ yk, the maximum likelihood model with k ∈ {1, . . . , d} segments as arg min

µ∈Rd

||y − µ||2

2

subject to k − 1 =

d−1

  • j=1

1µj=µj+1 and select the number of segments using the penalty k∗(λ) = arg min

k

λkd + ||y − ˆ yk||2

2

and use the constant λ which maximizes agreement with a database of labels. But why this particular penalty? Should we normalize using the variance of the signal y?

slide-10
SLIDE 10

Introduction: how to detect changes in copy number? From labels to interval regression A convex relaxation of the label error Results on the neuroblastoma data Conclusions and future work

slide-11
SLIDE 11

Creating breakpoint labels (demo)

slide-12
SLIDE 12
slide-13
SLIDE 13
slide-14
SLIDE 14

Labels for 2 signals

1breakpoint 0breakpoints

0.0 0.4 0.8 1.2 0.0 0.4 0.8 1.2 signal i = 4.17 signal i = 6.15 25 50 75 100

position on chromosome (mega base pairs) logratio

slide-15
SLIDE 15

Estimated model with 1 segment

1breakpoint 0breakpoints

0.0 0.4 0.8 1.2 0.0 0.4 0.8 1.2 signal i = 4.17 signal i = 6.15 25 50 75 100

position on chromosome (mega base pairs) logratio status

correct false negative

slide-16
SLIDE 16

Estimated model with 2 segments

1breakpoint 0breakpoints

0.0 0.4 0.8 1.2 0.0 0.4 0.8 1.2 signal i = 4.17 signal i = 6.15 25 50 75 100

position on chromosome (mega base pairs) logratio status

correct false positive

slide-17
SLIDE 17

Estimated model with 3 segments

1breakpoint 0breakpoints

0.0 0.4 0.8 1.2 0.0 0.4 0.8 1.2 signal i = 4.17 signal i = 6.15 25 50 75 100

position on chromosome (mega base pairs) logratio status

false positive

slide-18
SLIDE 18

Estimated model with 4 segments

1breakpoint 0breakpoints

0.0 0.4 0.8 1.2 0.0 0.4 0.8 1.2 signal i = 4.17 signal i = 6.15 25 50 75 100

position on chromosome (mega base pairs) logratio status

false positive

slide-19
SLIDE 19

Label error curves for 2 signals

1 1 signal i = 4.17 signal i = 6.15 1 5 10 20

number of segments k annotation error ei(k)

slide-20
SLIDE 20

Definition of label error

For signal i we have a set of labels Si where for each label (r, a) ∈ Si we have

◮ a specific labeled region e.g. r = [10, 30]. ◮ a labeled number of allowed change-points e.g. a = {0}

The label error ei : {1, . . . , kmax} → R+ is the total number of incorrectly predicted labels for the model with k segments: ei(k) =

  • (r,a)∈Si

1|ˆ

Pk

i ∩r|∈a,

where ˆ Pk

i is the set of predicted change-point positions, e.g.

{25, 35}.

slide-21
SLIDE 21

Penalized model label error

For a log penalty L = log λ ∈ R we define the optimal number of segments z∗

i (L) =

arg min

k∈{1,...,kmax}

exp(L)k + ||yi − ˆ yk

i ||2 2

and the penalized model annnotation error Ei : R → R+ Ei(L) = ei [z∗

i (L)]

Both are piecewise constant functions that can be calculated exactly.

slide-22
SLIDE 22

Error and model selection curves for 2 signals

signal i = 4.17 signal i = 6.15 1 5 10 20 1 segments z∗

i (L)

error Ei(L)

  • 4
  • 2

2

  • 4
  • 2

2

log penalty L

slide-23
SLIDE 23

Error and model selection curves for 2 signals

signal i = 4.17 signal i = 6.15 1 5 10 20 1 segments z∗

i (L)

error Ei(L)

  • 4
  • 2

2

  • 4
  • 2

2

log penalty L

slide-24
SLIDE 24

Label error curves for 2 signals

0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 signal i = 4.17 signal i = 6.15

  • 4
  • 2

2

log penalty L annotation error Ei(L) limit

Li Li

slide-25
SLIDE 25

Target interval [Li, Li] for 2 signals

  • 4
  • 2

2

  • 2.7
  • 2.4
  • 2.1

variance estimate log ˆ si log penalty L limit

Li Li

slide-26
SLIDE 26

Target interval [Li, Li] for all signals

  • 4
  • 2

2

  • 2.7
  • 2.4
  • 2.1

variance estimate log ˆ si log penalty L limit

Li Li

slide-27
SLIDE 27

Limit point representation

  • 4
  • 2

2

  • 2.7
  • 2.4
  • 2.1

variance estimate log ˆ si log penalty L limit

Li Li

slide-28
SLIDE 28

Max margin regression line

  • 4
  • 2

2

  • 2.7
  • 2.4
  • 2.1

variance estimate log ˆ si log penalty L limit

Li Li

slide-29
SLIDE 29

Max margin interval regression is a linear program

◮ Let xi ∈ Rm be the signal features, i.e. number of points

sampled log di, variance estimate log ˆ si, ...

◮ Learn an affine function f (xi) = w′xi + β from features to

model complexity using a linear program: maximize

β∈R,w∈Rm,µ∈R+ µ

subject to ∀i, if Li > −∞, w′xi + β − Li ≥ µ ∀i, if Li < ∞, Li − w′xi − β ≥ µ.

◮ Linear objective, linear constraints ⇒ linear program. ◮ Many solvers exist, i.e. simplex, interior point, dual methods. ◮ Implemented in R: lpSolve, quadprog, ...

slide-30
SLIDE 30

Learning the penalty function

◮ For every signal i, calculate a variance estimate feature

xi = log ˆ si.

◮ Use that to predict model complexity using an affine function

f (xi) = w log ˆ si + β.

◮ Equivalent to learning a penalty function

z∗

i [f (xi)]

= arg min

k

||yi − ˆ yk

i ||2 2 + exp[f (xi)]k

= arg min

k

||yi − ˆ yk

i ||2 2 + ˆ

sw

i eβk.

slide-31
SLIDE 31

Learning the penalty function

◮ For every signal i, calculate a variance estimate feature

xi = log ˆ si.

◮ Use that to predict model complexity using an affine function

f (xi) = w log ˆ si + β.

◮ Equivalent to learning a penalty function

z∗

i [f (xi)]

= arg min

k

||yi − ˆ yk

i ||2 2 + exp[f (xi)]k

= arg min

k

||yi − ˆ yk

i ||2 2 + ˆ

sw

i eβk.

slide-32
SLIDE 32

Introduction: how to detect changes in copy number? From labels to interval regression A convex relaxation of the label error Results on the neuroblastoma data Conclusions and future work

slide-33
SLIDE 33

Learning problem

◮ Signal features xi ∈ Rm i.e. variance, number of points, etc. ◮ We would like to find a penalty function f : Rm → R with

minimal label error arg min

f n

  • i=1

Ei[f (xi)].

◮ But Ei is a non-convex, piecewise constant function. ◮ Grid search?

◮ Many features m ⇒ intractable. ◮ No guarantee to find the global min.

◮ Instead, we propose to minimize a margin-based convex

relaxation arg min

f n

  • i=1

li[f (xi)].

slide-34
SLIDE 34

Error and relaxation for 4 labeled signals

  • 5.0
  • 2.5

0.0 2.5

penalty exponent L error/loss

curve annotation error Ei(L) surrogate loss li(L)

slide-35
SLIDE 35

Regularize using the ℓ1-norm for variable selection

◮ Convex surrogate loss li : R → R+. ◮ Learn an affine function f (xi) = w′xi + β by solving

arg min

β∈R,w∈Rm γ||w||1 + 1

n

n

  • i=1

li(w′xi + β).

◮ This is a convex optimization problem, so we can use

accelerated proximal gradient methods to find the global minimum (i.e. FISTA, Beck and Teboulle 2009).

◮ The ℓ1 norm encourages w to be sparse

||w||1 =

m

  • j=1

|wj|.

slide-36
SLIDE 36

Coefficients depend on degree of ℓ1 regularization γ

For each signal i, we use features xi ∈ R3: number of points sampled di, variance estimate ˆ si, length in base pairs bi. β log bi log di log ˆ si test train

  • 2
  • 1

1 5 6 7 coefficients percent ann. error 1 2 3 4

Model complexity − log10(γ)

slide-37
SLIDE 37

Introduction: how to detect changes in copy number? From labels to interval regression A convex relaxation of the label error Results on the neuroblastoma data Conclusions and future work

slide-38
SLIDE 38

Original data (Systematic): label the same regions

slide-39
SLIDE 39

Detailed data (Any): draw rectangles around regions

slide-40
SLIDE 40

Two labeled data sets to analyze

  • riginal

detailed.low.density protocol Systematic Any annotated signals n 3418 3526 annotations 3418 4149 0breakpoints 2845 3214 1breakpoint 514 >0breakpoints 573 421 Both labels of data(neuroblastoma,package="neuroblastoma") on CRAN.

slide-41
SLIDE 41

Optimal model complexity depends on variance

slide-42
SLIDE 42

Optimal model complexity depends on variance

slide-43
SLIDE 43

Optimal model complexity depends on number of points

slide-44
SLIDE 44

Optimal model complexity depends on number of points

slide-45
SLIDE 45

Interactive figures

If we have time, show how learned regression line is different from BIC. http://members.cbio.ensmp.fr/~thocking/ figure-regression-interactive/ http://bl.ocks.org/ tdhock/raw/9fc37a7aaf291cef364aab3fb41dd898/ http://bl.ocks.org/tdhock/raw/ eee5fd673c258ae554702d9c7c60f69b/

slide-46
SLIDE 46

Learned coefficients

Model with two features:

◮ Variance estimate log ˆ

si.

◮ Number of points sampled log di. ◮ Learned model complexity

f (xi) = w1 log ˆ si + w2 log di + β.

◮ Learned penalty function

z∗

i [f (xi)] = arg min k

||yi − ˆ yk

i ||2 2 + ˆ

sw1

i

dw2

i

eβk. annotation data set w1 w2 β

  • riginal

1.02 0.95 −2.64 ±0.05 ±0.02 ±0.17 detailed.low.density 1.30 0.93 −2.00 ±0.01 ±0.03 ±0.17 Mean ± 1 standard deviation over 4 folds.

slide-47
SLIDE 47

Test error estimated using K-fold cross-validation

log λi = f (xi) BIC f (xi) = log log di, no learned parameters. cghseg.k f (xi) = β + log di, learn β using grid search. log.d f (xi) = β + w log di, learn β, w by minimizing the un-regularized (γ = 0) surrogate loss li. log.s.log.d f (xi) = β + w1 log ˆ si + w2 log di learn β, w1, w2. L1-reg f (xi) = β + w′xi variance estimate, signal size, model error, chromosome indicator features xi ∈ R117, CV to choose the degree of ℓ1 regularization γ.

  • riginal

detailed.low.density

  • L1−reg

log.s.log.d log.d cghseg.k BIC 2.5 5.0 7.5 6 9 12

percent incorrect labels (test error in 4−fold CV) model

slide-48
SLIDE 48

Introduction: how to detect changes in copy number? From labels to interval regression A convex relaxation of the label error Results on the neuroblastoma data Conclusions and future work

slide-49
SLIDE 49

Conclusions

◮ Labels can be quickly produced using a GUI. ◮ Labels + convex optimization = learned penalty functions. ◮ Learned penalties are more accurate changepoint detectors

than unsupervised methods. Availability:

◮ Python Package Index: annotate regions GUI. ◮ R package implementing penalty learning algorithms:

https://github.com/tdhock/penaltyLearning

◮ R package neuroblastoma on CRAN

(3418 labeled changepoint detection problems).

slide-50
SLIDE 50

Discussion and future work

◮ Un-regularized generative linear interval regression models

have been extensively studied in the survival analysis

  • literature. e.g. survival::survreg function in R implements

Accelerated Failure Time models.

◮ This was the first paper to explore a regularized

discriminative linear model for interval regression.

◮ Implementation on the web for interactive change-point

detection: SegAnnDB, Bioinformatics 2014.

◮ Non-linear penalty functions? Tree model submitted to NIPS

2017.

◮ Multi-dimensional interval regression to learn arbitrary penalty

constants.

slide-51
SLIDE 51

Thanks for your attention!

Supplementary slides appear after this one.

slide-52
SLIDE 52

Gradient of surrogate loss

We define the surrogate loss as li(L) = φ(L − Li) + φ(Li − L) so its partial derivatives are ∂ ∂ β li(w′xi + β) = φ′(w′xi + β − Li) − φ′(Li − w′xi − β) ∂ ∂ wj li(w′xi + β) = xij

  • φ′(w′xi + β − Li) − φ′(Li − w′xi − β)
  • .

where the derivative of the squared hinge loss φ is φ′(L) =

  • 2(L − 1)

if L ≤ 1 if L ≥ 1.

slide-53
SLIDE 53

Optimality condition

Let the average surrogate loss be L(β, w) = 1 n

n

  • i=1

li(w′xi + β). The optimization problem is arg min

β∈R,w∈Rm γ||w||1 + L(β, w),

and the approximate optimality condition is

∂ β L(β, w)

  • ≤ ǫ,

and for every variable j ∈ {1, . . . , m},         

  • −γ +

∂ ∂ wj L(β, w)

  • ≤ ǫ

if wj < 0

∂ wj L(β, w)

  • − γ
  • + ≤ ǫ

if wj = 0

  • γ +

∂ ∂ wj L(β, w)

  • ≤ ǫ

if wj > 0.

slide-54
SLIDE 54

Proximal operator

To apply the FISTA method we need to solve the proximal

  • perator pη : Rp+1 → Rp+1, which in our case is

pη(β, w) =     β − 1

η ∂ ∂ βL(β, w)

sγ/η

  • w1 − 1

η ∂ ∂ w1 L(β, w)

  • .

. .     where

◮ η = m√m is a Lipshitz constant of the smooth loss. ◮ The soft-thresholding function sλ : R → R is

sλ(x) =

  • if |x| < λ

x − λ sign(x)

  • therwise.