Fast sparse methods for genomics data Jean-Philippe Vert - - PowerPoint PPT Presentation

fast sparse methods for genomics data
SMART_READER_LITE
LIVE PREVIEW

Fast sparse methods for genomics data Jean-Philippe Vert - - PowerPoint PPT Presentation

Fast sparse methods for genomics data Jean-Philippe Vert Optimization and Statistical Learning workshop, Les Houches, January 6-11, 2013 JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 1 / 47 Normal vs cancer cells What


slide-1
SLIDE 1

Fast sparse methods for genomics data

Jean-Philippe Vert Optimization and Statistical Learning workshop, Les Houches, January 6-11, 2013

JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 1 / 47

slide-2
SLIDE 2

Normal vs cancer cells

What goes wrong? How to treat?

JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 2 / 47

slide-3
SLIDE 3

Biology is now quantitative, "high-throughput"

DOE Joint Genome Institute JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 3 / 47

slide-4
SLIDE 4

Big data in biology

"The $1,000 genome, the $1 million interpretation" (B. Kopf) High-dimensional, heterogeneous, structured data. "Large p" http://aws.amazon.com/1000genomes/

JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 4 / 47

slide-5
SLIDE 5

In this talk

1

Mapping DNA breakpoints in cancer genomes (w. K Bleakley)

2

Isoform detection from RNA-seq data (w. E Bernard, J Mairal, L Jacob)

JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 5 / 47

slide-6
SLIDE 6

Outline

1

Mapping DNA breakpoints in cancer genomes (w. K Bleakley)

2

Isoform detection from RNA-seq data (w. E Bernard, J Mairal, L Jacob)

JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 6 / 47

slide-7
SLIDE 7

Chromosomic aberrations in cancer

JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 7 / 47

slide-8
SLIDE 8

Comparative Genomic Hybridization (CGH)

Motivation

Comparative genomic hybridization (CGH) data measure the DNA copy number along the genome Very useful, in particular in cancer research to observe systematically variants in DNA content

  • 1
  • 0.5

0.5 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 2021 22 23 X Chromosome Log-ratio

JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 8 / 47

slide-9
SLIDE 9

Can we identify breakpoints and "smooth" each profile?

100 200 300 400 500 600 700 800 900 1000 −0.2 0.2 0.4 0.6 0.8 1 1.2 1.4

A classical multiple change-point detection problem Should scale to lengths of order 106 ∼ 109

JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 9 / 47

slide-10
SLIDE 10

Can we identify breakpoints and "smooth" each profile?

100 200 300 400 500 600 700 800 900 1000 −0.2 0.2 0.4 0.6 0.8 1 1.2 1.4

A classical multiple change-point detection problem Should scale to lengths of order 106 ∼ 109

JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 9 / 47

slide-11
SLIDE 11

An optimal solution

100 200 300 400 500 600 700 800 900 1000 −0.2 0.2 0.4 0.6 0.8 1 1.2 1.4

For a signal Y ∈ Rp, define an optimal approximation β ∈ Rp with k breakpoints as the solution of min

β∈Rp Y − β 2

such that

p−1

  • i=1

1 (βi+1 = βi) ≤ k This is an optimization problem over the p

k

  • partitions...

Dynamic programming finds the solution in O(p2k) in time and O(p2) in memory But: does not scale to p = 106 ∼ 109...

JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 10 / 47

slide-12
SLIDE 12

An optimal solution

100 200 300 400 500 600 700 800 900 1000 −0.2 0.2 0.4 0.6 0.8 1 1.2 1.4

For a signal Y ∈ Rp, define an optimal approximation β ∈ Rp with k breakpoints as the solution of min

β∈Rp Y − β 2

such that

p−1

  • i=1

1 (βi+1 = βi) ≤ k This is an optimization problem over the p

k

  • partitions...

Dynamic programming finds the solution in O(p2k) in time and O(p2) in memory But: does not scale to p = 106 ∼ 109...

JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 10 / 47

slide-13
SLIDE 13

An optimal solution

100 200 300 400 500 600 700 800 900 1000 −0.2 0.2 0.4 0.6 0.8 1 1.2 1.4

For a signal Y ∈ Rp, define an optimal approximation β ∈ Rp with k breakpoints as the solution of min

β∈Rp Y − β 2

such that

p−1

  • i=1

1 (βi+1 = βi) ≤ k This is an optimization problem over the p

k

  • partitions...

Dynamic programming finds the solution in O(p2k) in time and O(p2) in memory But: does not scale to p = 106 ∼ 109...

JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 10 / 47

slide-14
SLIDE 14

An optimal solution

100 200 300 400 500 600 700 800 900 1000 −0.2 0.2 0.4 0.6 0.8 1 1.2 1.4

For a signal Y ∈ Rp, define an optimal approximation β ∈ Rp with k breakpoints as the solution of min

β∈Rp Y − β 2

such that

p−1

  • i=1

1 (βi+1 = βi) ≤ k This is an optimization problem over the p

k

  • partitions...

Dynamic programming finds the solution in O(p2k) in time and O(p2) in memory But: does not scale to p = 106 ∼ 109...

JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 10 / 47

slide-15
SLIDE 15

Promoting sparsity with the ℓ1 penalty

The ℓ1 penalty (Tibshirani, 1996; Chen et al., 1998)

If R(β) is convex and "smooth", the solution of min

β∈Rp R(β) + λ p

  • i=1

|βi| is usually sparse.

JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 11 / 47

slide-16
SLIDE 16

Promoting piecewise constant profiles penalty

The total variation / variable fusion penalty

If R(β) is convex and "smooth", the solution of min

β∈Rp R(β) + λ p−1

  • i=1

|βi+1 − βi| is usually piecewise constant (Rudin et al., 1992; Land and Friedman, 1996). Proof: Change of variable ui = βi+1 − βi, u0 = β1 We obtain a Lasso problem in u ∈ Rp−1 u sparse means β piecewise constant

JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 12 / 47

slide-17
SLIDE 17

TV signal approximator

min

β∈Rp Y − β 2

such that

p−1

  • i=1

| βi+1 − βi | ≤ µ Adding additional constraints does not change the change-points: p

i=1 | βi | ≤ ν (Tibshirani et al., 2005; Tibshirani and Wang, 2008)

p

i=1 β2 i ≤ ν (Mairal et al. 2010)

JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 13 / 47

slide-18
SLIDE 18

Solving TV signal approximator

min

β∈Rp Y − β 2

such that

p−1

  • i=1

| βi+1 − βi | ≤ µ QP with sparse linear constraints in O(p2) -> 135 min for p = 105 (Tibshirani and Wang, 2008) Coordinate descent-like method O(p)? -> 3s s for p = 105 (Friedman et al., 2007) With the LARS in O(pk) (Harchaoui and Levy-Leduc, 2008) For all µ in O(p ln p) (Hoefling, 2009) For the first k change-points in O(p ln k) (Bleakley and V., 2010)

JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 14 / 47

slide-19
SLIDE 19

Solving TV signal approximator in O(p ln k)

Theorem (V. and Bleakley, 2010; see also Hoefling, 2009)

TV signal approximator performs "greedy" dichotomic segmentation

Algorithm 1 Greedy dichotomic segmentation Require: k number of intervals, γ(I) gain function to split an interval I into IL(I), IR(I)

1: I0 represents the interval [1, n] 2: P = {I0} 3: for i = 1 to k do 4:

I∗ ← arg max

I2P

γ (I∗)

5:

P ← P\ {I∗}

6:

P ← P [ {IL (I∗) , IR (I∗)}

7: end for 8: return P

Apparently greedy algorithm finds the global optimum!

JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 15 / 47

slide-20
SLIDE 20

Speed trial : 2 s. for k = 100, p = 107

1 2 3 4 5 6 7 8 9 10 x 10

5

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 signal length seconds Speed for K=1, 10, 1e2, 1e3, 1e4, 1e5

JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 16 / 47

slide-21
SLIDE 21

Extension 1: linear discrimination / regression

500 1000 1500 2000 2500 −0.5 0.5 500 1000 1500 2000 2500 −1 1 2 500 1000 1500 2000 2500 −2 −1 1 500 1000 1500 2000 2500 −2 2 4 500 1000 1500 2000 2500 −4 −2 2 500 1000 1500 2000 2500 −1 −0.5 0.5 500 1000 1500 2000 2500 −1 −0.5 0.5 500 1000 1500 2000 2500 −4 −2 2 500 1000 1500 2000 2500 −4 −2 2 500 1000 1500 2000 2500 −1 1

Aggressive (left) vs non-aggressive (right) melanoma

JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 17 / 47

slide-22
SLIDE 22

Fused lasso for supervised classification

500 1000 1500 2000 2500 −0.5 0.5 500 1000 1500 2000 2500 −1 1 2 500 1000 1500 2000 2500 −2 −1 1 500 1000 1500 2000 2500 −2 2 4 500 1000 1500 2000 2500 −4 −2 2 500 1000 1500 2000 2500 −1 −0.5 0.5 500 1000 1500 2000 2500 −1 −0.5 0.5 500 1000 1500 2000 2500 −4 −2 2 500 1000 1500 2000 2500 −4 −2 2 500 1000 1500 2000 2500 −1 1

Idea: find a linear predictor f(Y) = β⊤Y that best discriminates the aggressive vs non-aggressive samples, subject to the constraints that it should be sparse and piecewise constant Mathematically: min

β∈Rp R(β) + λ1 β 1 + λ2 β TV

Computationnally: proximal methods

JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 18 / 47

slide-23
SLIDE 23

Fused lasso for supervised classification

500 1000 1500 2000 2500 −0.5 0.5 500 1000 1500 2000 2500 −1 1 2 500 1000 1500 2000 2500 −2 −1 1 500 1000 1500 2000 2500 −2 2 4 500 1000 1500 2000 2500 −4 −2 2 500 1000 1500 2000 2500 −1 −0.5 0.5 500 1000 1500 2000 2500 −1 −0.5 0.5 500 1000 1500 2000 2500 −4 −2 2 500 1000 1500 2000 2500 −4 −2 2 500 1000 1500 2000 2500 −1 1

Idea: find a linear predictor f(Y) = β⊤Y that best discriminates the aggressive vs non-aggressive samples, subject to the constraints that it should be sparse and piecewise constant Mathematically: min

β∈Rp R(β) + λ1 β 1 + λ2 β TV

Computationnally: proximal methods

JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 18 / 47

slide-24
SLIDE 24

Fused lasso for supervised classification

500 1000 1500 2000 2500 −0.5 0.5 500 1000 1500 2000 2500 −1 1 2 500 1000 1500 2000 2500 −2 −1 1 500 1000 1500 2000 2500 −2 2 4 500 1000 1500 2000 2500 −4 −2 2 500 1000 1500 2000 2500 −1 −0.5 0.5 500 1000 1500 2000 2500 −1 −0.5 0.5 500 1000 1500 2000 2500 −4 −2 2 500 1000 1500 2000 2500 −4 −2 2 500 1000 1500 2000 2500 −1 1

Idea: find a linear predictor f(Y) = β⊤Y that best discriminates the aggressive vs non-aggressive samples, subject to the constraints that it should be sparse and piecewise constant Mathematically: min

β∈Rp R(β) + λ1 β 1 + λ2 β TV

Computationnally: proximal methods

JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 18 / 47

slide-25
SLIDE 25

Prognosis in melanoma (Rapaport et al., 2008)

500 1000 1500 2000 2500 3000 3500 4000 −1 −0.8 −0.6 −0.4 −0.2 0.2 0.4 0.6 0.8 1 BAC Weight

JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 19 / 47

slide-26
SLIDE 26

Extension 2: finding multiple change points shared by several profiles

100 200 300 400 500 600 700 800 900 1000 −0.5 0.5 1 1.5 100 200 300 400 500 600 700 800 900 1000 −0.5 0.5 1 1.5 100 200 300 400 500 600 700 800 900 1000 −0.5 0.5 1 1.5

JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 20 / 47

slide-27
SLIDE 27

Extension 2: finding multiple change points shared by several profiles

100 200 300 400 500 600 700 800 900 1000 −0.5 0.5 1 1.5 100 200 300 400 500 600 700 800 900 1000 −0.5 0.5 1 1.5 100 200 300 400 500 600 700 800 900 1000 −0.5 0.5 1 1.5

JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 20 / 47

slide-28
SLIDE 28

"Optimal" segmentation by dynamic programming

100 200 300 400 500 600 700 800 900 1000 −0.5 0.5 1 1.5 100 200 300 400 500 600 700 800 900 1000 −0.5 0.5 1 1.5 100 200 300 400 500 600 700 800 900 1000 −0.5 0.5 1 1.5

Define the "optimal" piecewise constant approximation ˆ U ∈ Rp×n

  • f Y as the solution of

min

U∈Rp×n Y − U 2

such that

p−1

  • i=1

1

  • Ui+1,• = Ui,•
  • ≤ k

DP finds the solution in O(p2kn) in time and O(p2) in memory But: does not scale to p = 106 ∼ 109...

JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 21 / 47

slide-29
SLIDE 29

Selecting pre-defined groups of variables

Group lasso (Yuan & Lin, 2006)

If groups of covariates are likely to be selected together, the ℓ1/ℓ2-norm induces sparse solutions at the group level: Ωgroup(w) =

  • g

wg2 Ω(w1, w2, w3) = (w1, w2)2 + w32 =

  • w2

1 + w2 2 +

  • w2

3

JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 22 / 47

slide-30
SLIDE 30

GFLseg (Bleakley and V., 2011)

Replace min

U∈Rp×n Y − U 2

such that

p−1

  • i=1

1

  • Ui+1,• = Ui,•
  • ≤ k

by min

U∈Rp×n Y − U 2

such that

p−1

  • i=1

wiUi+1,• − Ui,• ≤ µ GFLseg = Group Fused Lasso segmentation

Questions

Practice: can we solve it efficiently? Theory: does it recover the correct segmentation?

JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 23 / 47

slide-31
SLIDE 31

GFLseg (Bleakley and V., 2011)

Replace min

U∈Rp×n Y − U 2

such that

p−1

  • i=1

1

  • Ui+1,• = Ui,•
  • ≤ k

by min

U∈Rp×n Y − U 2

such that

p−1

  • i=1

wiUi+1,• − Ui,• ≤ µ GFLseg = Group Fused Lasso segmentation

Questions

Practice: can we solve it efficiently? Theory: does it recover the correct segmentation?

JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 23 / 47

slide-32
SLIDE 32

GFLseg as a group Lasso problem

Make the change of variables: γ = U1,• , βi,• = wi

  • Ui+1,• − Ui,•
  • for i = 1, . . . , p − 1 .

TV approximator is then equivalent to the following group Lasso problem (Yuan and Lin, 2006): min

β∈R(p−1)×n ¯

Y − ¯ Xβ 2 + λ

p−1

  • i=1

βi,• , where ¯ Y is the centered signal matrix and ¯ X is a particular (p − 1) × (p − 1) design matrix.

JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 24 / 47

slide-33
SLIDE 33

TV approximator implementation

min

β∈R(p−1)×n ¯

Y − ¯ Xβ 2 + λ

p−1

  • i=1

βi,• ,

Theorem

The TV approximator can be solved efficiently: "approximately" with the group LARS in O(npk) in time and O(np) in memory "exactly" with a block coordinate descent + active set method in O(np) in memory

JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 25 / 47

slide-34
SLIDE 34

Proof: computational tricks... (from Zaid)

Although ¯ X is (p − 1) × (p − 1): For any R ∈ Rp×n, we can compute C = ¯ X ⊤R in O(np) operations and memory For any two subset of indices A =

  • a1, . . . , a|A|
  • and

B =

  • b1, . . . , b|B|
  • in [1, p − 1], we can compute ¯

X ⊤

  • ,A ¯

X•,B in O (|A||B|) in time and memory For any A =

  • a1, . . . , a|A|
  • , set of distinct indices with

1 ≤ a1 < . . . < a|A| ≤ p − 1, and for any |A| × n matrix R, we can compute C =

  • ¯

X ⊤

  • ,A ¯

X•,A −1 R in O(|A|n) in time and memory

JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 26 / 47

slide-35
SLIDE 35

Speed trial

Length Nb profiles Nb change-points

10 10

2

10

4

10

6

10

8

10

−3

10

−2

10

−1

10 10

1

10

2

10

3

n time (s) GFLars GFLasso 10 10

5

10

−3

10

−2

10

−1

10 10

1

10

2

p time (s) GFLars GFLasso 10 10

1

10

2

10

3

10

−3

10

−2

10

−1

10 10

1

10

2

10

3

k time (s) GFLars GFLasso

Figure 2: Speed trials for group fused LARS (top row) and Lasso (bottom row). Left column: varying n, with fixed p = 10 and k = 10; center column: varying p, with fixed n = 1000 and k = 10; right column: varying k, with fixed n = 1000 and p = 10. Figure axes are log-log. Results are averaged over 100 trials.

JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 27 / 47

slide-36
SLIDE 36

Consistency

Suppose a single change-point: at position u = αp with increments (βi)i=1,...,n s.t. ¯ β2 = limk→∞ 1

n

n

i=1 β2 i

corrupted by i.i.d. Gaussian noise of variance σ2

100 200 300 400 500 600 700 800 900 1000 −2 2 4 100 200 300 400 500 600 700 800 900 1000 −2 2 4 100 200 300 400 500 600 700 800 900 1000 −2 2 4

Does the TV approximator correctly estimate the first change-point as p increases?

JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 28 / 47

slide-37
SLIDE 37

Consistency of the unweighted TV approximator

min

U∈Rp×n Y − U 2

such that

p−1

  • i=1

Ui+1,• − Ui,• ≤ µ

Theorem

The unweighted TV approximator finds the correct change-point with probability tending to 1 (resp. 0) as n → +∞ if σ2 < ˜ σ2

α (resp.

σ2 > ˜ σ2

α), where

˜ σ2

α = p ¯

β2 (1 − α)2(α − 1

2p)

α − 1

2 − 1 2p

. correct estimation on [pǫ, p(1 − ǫ)] with ǫ =

  • σ2

2p ¯ β2 + o(p−1/2) .

wrong estimation near the boundaries

JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 29 / 47

slide-38
SLIDE 38

Consistency of the weighted TV approximator

min

U∈Rp×n Y − U 2

such that

p−1

  • i=1

wiUi+1,• − Ui,• ≤ µ

Theorem

The weighted TV approximator with weights ∀i ∈ [1, p − 1] , wi =

  • i(p − i)

p correctly finds the first change-point with probability tending to 1 as n → +∞. we see the benefit of increasing n we see the benefit of adding weights to the TV penalty

JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 30 / 47

slide-39
SLIDE 39

Proof sketch

The first change-point ˆ i found by TV approximator maximizes Fi = ˆ ci,• 2, where ˆ c = ¯ X ⊤ ¯ Y = ¯ X ⊤ ¯ Xβ∗ + ¯ X ⊤W . ˆ c is Gaussian, and Fi is follows a non-central χ2 distribution with Gi = EFi p = i(p − i) pw2

i

σ2 + ¯ β2 w2

i w2 up2 ×

  • i2 (p − u)2

if i ≤ u , u2 (p − i)2

  • therwise.

We then just check when Gu = maxi Gi

JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 31 / 47

slide-40
SLIDE 40

Consistency for a single change-point

200 400 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 p Accuracy: unweighted u=50 u=60 u=70 u=80 u=90 200 400 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 p Accuracy: weighted u=50 u=60 u=70 u=80 u=90 200 400 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 p Accuracy: weighted+vary u=50±2 u=60±2 u=70±2 u=80±2 u=90±2

Figure 3: Single change-point accuracy for the group fused Lasso. Accuracy as a function of the number

  • f profiles p when the change-point is placed in a variety of positions u = 50 to u = 90 (left and centre

plots, resp. unweighted and weighted group fused Lasso), or: u = 50±2 to u = 90±2 (right plot, weighted with varying change-point location), for a signal of length 100.

JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 32 / 47

slide-41
SLIDE 41

Estimation of several change-points

200 400 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 p Accuracy U−LARS W−LARS U−Lasso W−Lasso 200 400 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 p Accuracy 200 400 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 p Accuracy

Figure 4: Multiple change-point accuracy. Accuracy as a function of the number of profiles p when change-points are placed at the nine positions {10, 20, . . . , 90} and the variance σ2 of the centered Gaussian noise is either 0.05 (left), 0.2 (center) and 1 (right). The profile length is 100.

JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 33 / 47

slide-42
SLIDE 42

Application: detection of frequent abnormalities

−0.5 0.5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 192021 22 Chromosome Log−ratio 1 2 3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 192021 22 Chromosome Log−ratio −0.5 0.5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 192021 22 Chromosome Log−ratio

JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 34 / 47

slide-43
SLIDE 43

Outline

1

Mapping DNA breakpoints in cancer genomes (w. K Bleakley)

2

Isoform detection from RNA-seq data (w. E Bernard, J Mairal, L Jacob)

JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 35 / 47

slide-44
SLIDE 44

Central dogma

JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 36 / 47

slide-45
SLIDE 45

Alternative splicing: 1 gene = many proteins

JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 37 / 47

slide-46
SLIDE 46

RNA-seq measures RNA abundance

JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 38 / 47

slide-47
SLIDE 47

RNA-seq and alternative splicing

(Costa et al., 2011) JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 39 / 47

slide-48
SLIDE 48

The isoform deconvolution problem

(Xia et al., 2011) JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 40 / 47

slide-49
SLIDE 49

More formally

e exons c candidate isoforms (up to 2e − 1) φ ∈ Rc

+ the vector of abundance of isoforms (unknown!)

U binary matrix:      exon1 · · · exone junction1,2 · · · junctione1,e isoform1 1 · · · 1 1 · · · 1 isoform2 1 · · · 1 · · · . . . · · · · · · isoformc · · · 1 · · ·      U⊤φ the abundance of each exon/junction. Goal: estimate φ from the observed reads on each exon/junction

JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 41 / 47

slide-50
SLIDE 50

Isoform deconvolution with the Lasso

Estimate φ sparse by solving: min

φ∈Rc

+

R(U⊤φ) + λ φ 1 IsoLasso (Li et al., 2011) NSMAP (Xia et al., 2011) SLIDE (Li et al., 2011) Works well BUT computationally challenging to enumerate all candidate isoforms (up to 2e) for large genes!

JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 42 / 47

slide-51
SLIDE 51

Fast isoform deconvolution with the Lasso

Theorem (Bernard, Mairal, Jacob and V., 2012)

The isoform deconvolution problem min

φ∈Rc

+

R(U⊤φ) + λ φ 1 can be solved in polynomial time in the number of exon. Key ideas

1

U⊤φ corresponds to a flow on the graph

2

Reformulation as a convex cost flow problem (Mairal and Yu, 2012)

3

Recover isoforms by flow decomposition algorithm "Feature selection on an exponential number of features in polynomial time"

JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 43 / 47

slide-52
SLIDE 52

From isoforms to flows

Isoforms are paths Linear combinations of isoforms are flows

JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 44 / 47

slide-53
SLIDE 53

Isoform deconvolution as convex cost flow problem

min

φ∈Rc

+

R(U⊤φ) + λ φ 1 is equivalent to min

f flow R(f) + λft

JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 45 / 47

slide-54
SLIDE 54

Speed trial

1 100 10000 1e−01 1e+01 1e+03

Number of CANDIDATES Elapsed TIME (s)

  • FLOW

NSMAP

JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 46 / 47

slide-55
SLIDE 55

Thanks!

Kevin Bleakley (INRIA), Elsa Bernard (ParisTech/Institut Curie), Laurent Jacob (UC Berkeley/CNRS), Julien Mairal (UC Berkeley/INRIA)

JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 47 / 47