Learning algorithms and statistical software, with applications to - - PowerPoint PPT Presentation

learning algorithms and statistical software with
SMART_READER_LITE
LIVE PREVIEW

Learning algorithms and statistical software, with applications to - - PowerPoint PPT Presentation

Learning algorithms and statistical software, with applications to bioinformatics PhD defense of Toby Dylan Hocking toby.hocking@inria.fr http://cbio.ensmp.fr/~thocking/ 20 November 2012 1 Summary of contributions Ch. 2: clusterpath for


slide-1
SLIDE 1

Learning algorithms and statistical software, with applications to bioinformatics

PhD defense of Toby Dylan Hocking toby.hocking@inria.fr http://cbio.ensmp.fr/~thocking/ 20 November 2012

1

slide-2
SLIDE 2

Summary of contributions

◮ Ch. 2: clusterpath for finding groups in data, ICML 2011. ◮ Ch. 3: breakpoint annotations for smoothing model

training and evaluation, HAL-00663790.

◮ Ch. 4-5: penalties for breakpoint detection in simulated and

real signals, under review.

◮ Statistical software contributions in R:

◮ Ch. 7: direct labels for readable statistical graphics, Best

Student Poster at useR 2011.

◮ Ch. 8: documentation generation to convert comments

into a package for distribution, accepted in JSS.

◮ Ch. 9: named capture regular expressions for extracting

data from text files, talk for useR 2011, accepted into R-2.14.

2

slide-3
SLIDE 3

Cancer cells show chromosomal copy number alterations

Spectral karyotypes show the number of copies of the sex chromosomes (X,Y) and autosomes (1-22). Source: Alberts et al. 2002. Normal cell with 2 copies of each autosome. Cancer cell with many copy number alterations.

3

slide-4
SLIDE 4

Copy number profiles of neuroblastoma tumors

4

slide-5
SLIDE 5
  • Ch. 2: clusterpath finds groups in data

Ch 3: breakpoint annotations for smoothing model selection

  • Ch. 4–5: penalties for breakpoint detection

5

slide-6
SLIDE 6

The clusterpath relaxes a hard fusion penalty

min

α∈Rn×p

||α − X||2

F

subject to

  • i<j

1αi=αj ≤ t Combinatorial! Relaxation:

  • i<j

||αi − αj||qwij ≤ t The clusterpath is the path of

  • ptimal α obtained by varying

t.

Related work: “fused lasso” Tibshirani and Saunders (2005), “convex clustering shrinkage” Pel- ckmans et al. (2005), “grouping pursuit” Shen and Huang (2010), “sum of norms” Lindsten et al. (2011).

X1 X2 X3 α1 αC = α2 = α3

α1 survival α2 number of breakpoints

6

slide-7
SLIDE 7

Choice of norm and weights alters the clusterpath

Take X ∈ R10×2, solve min

α ||α − X||2 F

subject to Ω(α)/Ω(X) ≤ 0. Penalty with ℓq norm: Ω(Y ) =

  • i<j

||Yi−Yj||qwij Weights: wij = exp(−γ||Xi −Xj||2

2)

norm = 1 norm = 2 norm = ∞ weights γ = 0 weights γ = 1

7

slide-8
SLIDE 8

Choice of norm and weights alters the clusterpath

Take X ∈ R10×2, solve min

α ||α − X||2 F

subject to Ω(α)/Ω(X) ≤ 0.1. Penalty with ℓq norm: Ω(Y ) =

  • i<j

||Yi−Yj||qwij Weights: wij = exp(−γ||Xi −Xj||2

2)

norm = 1 norm = 2 norm = ∞ weights γ = 0 weights γ = 1

8

slide-9
SLIDE 9

Choice of norm and weights alters the clusterpath

Take X ∈ R10×2, solve min

α ||α − X||2 F

subject to Ω(α)/Ω(X) ≤ 0.2. Penalty with ℓq norm: Ω(Y ) =

  • i<j

||Yi−Yj||qwij Weights: wij = exp(−γ||Xi −Xj||2

2)

norm = 1 norm = 2 norm = ∞ weights γ = 0 weights γ = 1

9

slide-10
SLIDE 10

Choice of norm and weights alters the clusterpath

Take X ∈ R10×2, solve min

α ||α − X||2 F

subject to Ω(α)/Ω(X) ≤ 0.3. Penalty with ℓq norm: Ω(Y ) =

  • i<j

||Yi−Yj||qwij Weights: wij = exp(−γ||Xi −Xj||2

2)

norm = 1 norm = 2 norm = ∞ weights γ = 0 weights γ = 1

10

slide-11
SLIDE 11

Choice of norm and weights alters the clusterpath

Take X ∈ R10×2, solve min

α ||α − X||2 F

subject to Ω(α)/Ω(X) ≤ 0.4. Penalty with ℓq norm: Ω(Y ) =

  • i<j

||Yi−Yj||qwij Weights: wij = exp(−γ||Xi −Xj||2

2)

norm = 1 norm = 2 norm = ∞ weights γ = 0 weights γ = 1

11

slide-12
SLIDE 12

Choice of norm and weights alters the clusterpath

Take X ∈ R10×2, solve min

α ||α − X||2 F

subject to Ω(α)/Ω(X) ≤ 0.5. Penalty with ℓq norm: Ω(Y ) =

  • i<j

||Yi−Yj||qwij Weights: wij = exp(−γ||Xi −Xj||2

2)

norm = 1 norm = 2 norm = ∞ weights γ = 0 weights γ = 1

12

slide-13
SLIDE 13

Choice of norm and weights alters the clusterpath

Take X ∈ R10×2, solve min

α ||α − X||2 F

subject to Ω(α)/Ω(X) ≤ 0.6. Penalty with ℓq norm: Ω(Y ) =

  • i<j

||Yi−Yj||qwij Weights: wij = exp(−γ||Xi −Xj||2

2)

norm = 1 norm = 2 norm = ∞ weights γ = 0 weights γ = 1

13

slide-14
SLIDE 14

Choice of norm and weights alters the clusterpath

Take X ∈ R10×2, solve min

α ||α − X||2 F

subject to Ω(α)/Ω(X) ≤ 0.7. Penalty with ℓq norm: Ω(Y ) =

  • i<j

||Yi−Yj||qwij Weights: wij = exp(−γ||Xi −Xj||2

2)

norm = 1 norm = 2 norm = ∞ weights γ = 0 weights γ = 1

14

slide-15
SLIDE 15

Choice of norm and weights alters the clusterpath

Take X ∈ R10×2, solve min

α ||α − X||2 F

subject to Ω(α)/Ω(X) ≤ 0.8. Penalty with ℓq norm: Ω(Y ) =

  • i<j

||Yi−Yj||qwij Weights: wij = exp(−γ||Xi −Xj||2

2)

norm = 1 norm = 2 norm = ∞ weights γ = 0 weights γ = 1

15

slide-16
SLIDE 16

Choice of norm and weights alters the clusterpath

Take X ∈ R10×2, solve min

α ||α − X||2 F

subject to Ω(α)/Ω(X) ≤ 0.9. Penalty with ℓq norm: Ω(Y ) =

  • i<j

||Yi−Yj||qwij Weights: wij = exp(−γ||Xi −Xj||2

2)

norm = 1 norm = 2 norm = ∞ weights γ = 0 weights γ = 1

16

slide-17
SLIDE 17

Choice of norm and weights alters the clusterpath

Take X ∈ R10×2, solve min

α ||α − X||2 F

subject to Ω(α)/Ω(X) ≤ 1. Penalty with ℓq norm: Ω(Y ) =

  • i<j

||Yi−Yj||qwij Weights: wij = exp(−γ||Xi −Xj||2

2)

norm = 1 norm = 2 norm = ∞ weights γ = 0 weights γ = 1

17

slide-18
SLIDE 18

Clusterpath learns a tree, even for odd cluster shapes

Comparison with other methods for finding 2 clusters. Caveat: does not recover overlapping clusters, e.g. iris data, gaussian mixture.

18

slide-19
SLIDE 19

Contributions in chapter 2, future work

Hocking et al. Clusterpath: an Algorithm for Clustering using Convex Fusion Penalties. ICML 2011.

◮ Theorem. No splits in the ℓ1 clusterpath with identity weights

wij = 1. What about other situations?

◮ Convex and hierarchical clustering algorithms.

◮ ℓ1 homotopy method O(pn log n). ◮ ℓ2 active-set method O(pn2). ◮ ℓ∞ Franck-Wolfe algorithm.

◮ Implementation in R package clusterpath on R-Forge.

19

slide-20
SLIDE 20
  • Ch. 2: clusterpath finds groups in data

Ch 3: breakpoint annotations for smoothing model selection

  • Ch. 4–5: penalties for breakpoint detection

20

slide-21
SLIDE 21

How to detect breakpoints in 23 × 575 =13,225 signals?

21

slide-22
SLIDE 22

Which model should we use?

◮ GLAD: adaptive weights smoothing (Hup´

e et al., 2004)

◮ DNAcopy: circular binary segmentation (Venkatraman and

Olshen, 2007)

◮ cghFLasso: fused lasso signal approximator with heuristics

(Tibshirani and Wang, 2007)

◮ HaarSeg: wavelet smoothing (Ben-Yaacov and Eldar, 2008) ◮ GADA: sparse Bayesian learning (Pique-Regi et al., 2008) ◮ flsa: fused lasso signal approximator path algorithm (Hoefling

2009)

◮ cghseg: pruned dynamic programming (Rigaill 2010) ◮ PELT: pruned exact linear time (Killick et al., 2011)

... and how to select the smoothing parameter in each model?

22

slide-23
SLIDE 23

575 copy number profiles, each annotated in 6 regions

23

slide-24
SLIDE 24

Not enough breakpoints

24

slide-25
SLIDE 25

Too many breakpoints

25

slide-26
SLIDE 26

Good agreement with annotated regions

26

slide-27
SLIDE 27

Select the best model using the breakpoint annotations

Breakpoint detection training errors for 3 models of data(neuroblastoma,package="neuroblastoma").

cghseg.k, pelt.n flsa.norm dnacopy.sd

  • 2.2
  • 4.8
  • 11.5

20 40 60 80 −5 −4 −3 −2 −1 0 −2 −1 1 2 3 0.0 0.5 1.0

log10(smoothing parameter lambda) percent incorrectly predicted annotations in training set

statistic false.positive false.negative errors

<− more breakpoints fewer breakpoints −>

Idea: for several smoothing parameters λ, calculate the annotation error function E(λ), (black line) then select the model with least error. (black dot) ˆ λ = arg min

λ

E(λ).

27

slide-28
SLIDE 28

PELT/cghseg show the best breakpoint detection

ROC curves for breakpoint detection training errors of each model, by varying the smoothness parameter λ.

  • ptimization−based models

approximate optimization glad

  • cghseg.k

flsa norm flsa pelt.n pelt.default cghseg.mBIC

  • gada

dnacopy.alpha dnacopy prune dnacopy.sd dnacopy default

  • glad.haarseg

glad.lambdabreak glad.MinBkpWeight glad.default

0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4

False positive rate = probability(predict breakpoint | normal) True positive rate = probability(predict breakpoint | breakpoint)

Open circle shows smoothness λ selected using annotations.

28

slide-29
SLIDE 29

Few annotations required for a good breakpoint detector

glad.lambdabreak dnacopy.sd flsa.norm cghseg.k, pelt.n

80 82 84 86 88 90 92 94 96 98 100 1 5 10 15 20 25 30

Annotated profiles in global model training set Percent of correctly predicted annotations on test set profiles

29

slide-30
SLIDE 30

Interactive web site for annotation and model building

Takita J et al. Aberrations of NEGR1 on 1p31 and MYEOV on 11q13 in neuroblastoma. Cancer Sci. 2011 Sep;102(9):1645-50.

30

slide-31
SLIDE 31

Contributions in chapter 3: data, algorithms, comparison, code

Hocking et al. Learning smoothing models using breakpoint

  • annotations. HAL-00663790, Jan 2012.

◮ data(neuroblastoma,package="neuroblastoma") on

CRAN: 575 annotated profiles.

◮ Model training and evaluation protocols based on breakpoint

annotations.

◮ Quantitative comparison of 17 breakpoint detection models on

the neuroblastoma data set.

◮ Free/open-source GUIs for creating annotation databases.

More accurate breakpoint detection?

31

slide-32
SLIDE 32

The cghseg.k/pelt.n least squares model

For a signal y ∈ Rd, we define ˆ yk, the maximum likelihood model with k ∈ {1, . . . , d} segments as arg min

µ∈Rd

||y − µ||2

2

subject to k − 1 =

d−1

  • j=1

1µj=µj+1. Lavielle (2005): select the number of segments using the penalty k∗(λ) = arg min

k

λkd + ||y − ˆ yk||2

2.

We use the tradeoff λ which maximizes agreement with a database

  • f breakpoint annotations.

But why this particular penalty? Should we normalize using the variance of the signal y?

32

slide-33
SLIDE 33

The cghseg.k/pelt.n least squares model

For a signal y ∈ Rd, we define ˆ yk, the maximum likelihood model with k ∈ {1, . . . , d} segments as arg min

µ∈Rd

||y − µ||2

2

subject to k − 1 =

d−1

  • j=1

1µj=µj+1. Lavielle (2005): select the number of segments using the penalty k∗(λ) = arg min

k

λkd + ||y − ˆ yk||2

2.

We use the tradeoff λ which maximizes agreement with a database

  • f breakpoint annotations.

But why this particular penalty? Should we normalize using the variance of the signal y?

32

slide-34
SLIDE 34
  • Ch. 2: clusterpath finds groups in data

Ch 3: breakpoint annotations for smoothing model selection

  • Ch. 4–5: penalties for breakpoint detection

33

slide-35
SLIDE 35

Simulated signals with varying sampling density

Can we learn a parameter λ on the first signal, and use it for accurate breakpoint detection on the second?

34

slide-36
SLIDE 36

Exact breakpoint error curves for 2 signals

bases/probe = 374 bases/probe = 7

  • 1

6 13 14 1 7 20 1 7 20

Number of segments of estimated cghseg model Error relative to latent breakpoints

For each signal i, calculate the breakpoint error function BErri : {1, . . . , kmax} → R+.

35

slide-37
SLIDE 37

Model selection error curves suggest α = 1/2

For each signal i, select the number of segments using kα

i (λ) = arg min k

λkdα

i + ||yi − ˆ

yk

i ||2 2.

Question: which penalty exponent α is best? Plot the error curves E α

i (λ) = BErri [kα i (λ)] .

alpha = 0 alpha = 0.5 alpha = 1

  • 5

10 −5 5 −5 5 −5 5

log10(lambda) error

bases/probe

  • 7

374

Then pick the parameter with minimal breakpoint error (dots) ˆ λα

i = arg min λ∈R+ E α i (λ).

36

slide-38
SLIDE 38

Train and test error curves suggest α = 1/2

n = 8 signals i considered, with d1 = 70, . . . , d8 = 70000 points sampled, and a penalty with term dα

i .

10 20 30 40 5 10 train test −1 1 2

penalty exponent alpha total error relative to latent breaks

train(α) = min

λ n

  • i=1

E α

i (λ),

test(α) =

  • i=j

E α

i (ˆ

λα

j ).

37

slide-39
SLIDE 39

Suggested penalties do not work on real data

ki(λ) = arg min

k

λk

  • di + ||yi − ˆ

yk

i ||2 2

(1) For each signal i, normalize for

◮ number of points sampled di, ◮ signal length in base pairs li, ◮ and estimated noise ˆ

si. ki(λ) = arg min

k

λkˆ s2

i

  • di/li + ||yi − ˆ

yk

i ||2 2

(2) exponents annotation error points length variance train test.mean test.sd cghseg.k 1 2.19 2.20 1.01 (1) 1/2 3.51 3.87 1.58 (2) 1/2 −1/2 2 4.18 6.38 7.61

38

slide-40
SLIDE 40

In real data, we only have weak annotations

1breakpoint 0breakpoints

0.0 0.5 1.0 0.0 0.5 1.0 signal 1.7 signal 10.7 50 100 150

position on chromosome (mega base pairs) logratio

39

slide-41
SLIDE 41

Model selection and error curves for 2 signals

Optimal number of segments z∗

i (L) = arg min k

exp(L)k+||yi −ˆ yk

i ||2 2.

signal 1.7 signal 10.7 1 5 10 20 1 segments z∗

i (L)

error Ei(L)

  • 4
  • 2

2

  • 4
  • 2

2

penalty exponent L

40

slide-42
SLIDE 42

Target interval [Li, Li] for 2 signals

  • 2.4
  • 2.2
  • 2.0
  • 3
  • 2
  • 1

1 2

penalty exponent L variance estimate log ˆ si

limit Li Li

41

slide-43
SLIDE 43

Target interval [Li, Li] for all signals

  • 2.4
  • 2.2
  • 2.0
  • 3
  • 2
  • 1

1 2

penalty exponent L variance estimate log ˆ si

limit Li Li

42

slide-44
SLIDE 44

Limit point representation, find a separator

  • 2.4
  • 2.2
  • 2.0
  • 3
  • 2
  • 1

1 2

penalty exponent L variance estimate log ˆ si

limit Li Li

43

slide-45
SLIDE 45

Max margin regression line

  • 2.4
  • 2.2
  • 2.0
  • 3
  • 2
  • 1

1 2

penalty exponent L variance estimate log ˆ si

limit Li Li

44

slide-46
SLIDE 46

Learning the penalty function

◮ For every signal i, calculate a variance estimate feature

xi = log ˆ si.

◮ Predict model complexity using an affine function

f (xi) = w log ˆ si + β.

◮ Equivalent to learning a penalty function

z∗

i [f (xi)]

= arg min

k

||yi − ˆ yk

i ||2 2 + exp[f (xi)]k

= arg min

k

||yi − ˆ yk

i ||2 2 + ˆ

sw

i eβk.

45

slide-47
SLIDE 47

Original data: annotate the same regions

46

slide-48
SLIDE 48

Detailed data: draw rectangles around regions

47

slide-49
SLIDE 49

Optimal model complexity depends on variance

arg min

β∈R,w∈Rm

1 n

n

  • i=1

li(w′xi + β) + γ||w||1

48

slide-50
SLIDE 50

Optimal model complexity depends on variance

arg min

β∈R,w∈Rm

1 n

n

  • i=1

li(w′xi + β) + γ||w||1

49

slide-51
SLIDE 51

Error estimated using 10-fold cross-validation

f (xi) = w1 log ˆ si + w2 log di + β

◮ cghseg.k: w1 = 0, w2 = 1, learn β using grid search to

minimize the annotation error Ei.

◮ log.s.log.d: learn β, w1, w2 by by minimizing the

un-regularized (γ = 0) surrogate loss li.

◮ L1-reg: variance estimate, signal size, model error,

chromosome indicator features xi ∈ R117, CV to choose the degree of ℓ1 regularization γ.

  • riginal

detailed.low.density

  • L1−reg

log.s.log.d cghseg.k 2.0 2.4 2.8 5 6 7

percent test annotation error, mean +1 standard deviation model 50

slide-52
SLIDE 52

Conclusions and future work

◮ Penalties suggested by theory and simulations are suboptimal

in real data.

◮ Annotations + convex optimization = learned penalties. ◮ Learned penalties show good change-point detection. ◮ Learning more general penalty functions? ◮ Active learning strategy for picking the next signal to

annotate?

◮ Do annotation-guided models help to predict clinical patient

  • utcome?

51

slide-53
SLIDE 53

Acknowledgements

Labs:

◮ INRIA-Sierra ◮ INSERM U830 ◮ Institute Curie bioinformatics (U900)

Coauthors:

◮ Armand Joulin ◮ Isabelle Janoueix-Lerosey ◮ Gudrun Schleiermacher ◮ Olivier Delattre ◮ Guillem Rigaill

Thank you!

52

slide-54
SLIDE 54

53

slide-55
SLIDE 55

54

slide-56
SLIDE 56

55

slide-57
SLIDE 57

56

slide-58
SLIDE 58

57

slide-59
SLIDE 59

58

slide-60
SLIDE 60

59

slide-61
SLIDE 61

60

slide-62
SLIDE 62

61

slide-63
SLIDE 63

62

slide-64
SLIDE 64

63

slide-65
SLIDE 65

64