Fast sparse methods for genomics data Jean-Philippe Vert - PowerPoint PPT Presentation

Fast sparse methods for genomics data Jean-Philippe Vert Optimization and Statistical Learning workshop, Les Houches, January 6-11, 2013 JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 1 / 47

Normal vs cancer cells What goes wrong? How to treat? JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 2 / 47

Biology is now quantitative, "high-throughput" DOE Joint Genome Institute JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 3 / 47

Big data in biology "The $1,000 genome, the $1 million interpretation" (B. Kopf) High-dimensional, heterogeneous, structured data. "Large p " http://aws.amazon.com/1000genomes/ JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 4 / 47

In this talk Mapping DNA breakpoints in cancer genomes (w. K Bleakley) 1 Isoform detection from RNA-seq data (w. E Bernard, J Mairal, L 2 Jacob) JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 5 / 47

Outline Mapping DNA breakpoints in cancer genomes (w. K Bleakley) 1 Isoform detection from RNA-seq data (w. E Bernard, J Mairal, L 2 Jacob) JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 6 / 47

Chromosomic aberrations in cancer JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 7 / 47

Comparative Genomic Hybridization (CGH) Motivation Comparative genomic hybridization (CGH) data measure the DNA copy number along the genome Very useful, in particular in cancer research to observe systematically variants in DNA content 1 0.5 Log-ratio 0 -0.5 -1 Chromosome 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 2021 22 23 X JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 8 / 47

Can we identify breakpoints and "smooth" each profile? 1.4 1.2 1 0.8 0.6 0.4 0.2 0 −0.2 0 100 200 300 400 500 600 700 800 900 1000 A classical multiple change-point detection problem Should scale to lengths of order 10 6 ∼ 10 9 JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 9 / 47

An optimal solution 1.4 1.2 1 0.8 0.6 0.4 0.2 0 −0.2 0 100 200 300 400 500 600 700 800 900 1000 For a signal Y ∈ R p , define an optimal approximation β ∈ R p with k breakpoints as the solution of p − 1 β ∈ R p � Y − β � 2 � min such that 1 ( β i + 1 � = β i ) ≤ k i = 1 � p � This is an optimization problem over the partitions... k Dynamic programming finds the solution in O ( p 2 k ) in time and O ( p 2 ) in memory But: does not scale to p = 10 6 ∼ 10 9 ... JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 10 / 47

Promoting sparsity with the ℓ 1 penalty The ℓ 1 penalty (Tibshirani, 1996; Chen et al., 1998) If R ( β ) is convex and "smooth", the solution of p � min β ∈ R p R ( β ) + λ | β i | i = 1 is usually sparse. JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 11 / 47

Promoting piecewise constant profiles penalty The total variation / variable fusion penalty If R ( β ) is convex and "smooth", the solution of p − 1 � β ∈ R p R ( β ) + λ min | β i + 1 − β i | i = 1 is usually piecewise constant (Rudin et al., 1992; Land and Friedman, 1996). Proof: Change of variable u i = β i + 1 − β i , u 0 = β 1 We obtain a Lasso problem in u ∈ R p − 1 u sparse means β piecewise constant JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 12 / 47

TV signal approximator p − 1 β ∈ R p � Y − β � 2 � min such that | β i + 1 − β i | ≤ µ i = 1 Adding additional constraints does not change the change-points: � p i = 1 | β i | ≤ ν (Tibshirani et al., 2005; Tibshirani and Wang, 2008) � p i = 1 β 2 i ≤ ν (Mairal et al. 2010) JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 13 / 47

Solving TV signal approximator p − 1 β ∈ R p � Y − β � 2 � min such that | β i + 1 − β i | ≤ µ i = 1 QP with sparse linear constraints in O ( p 2 ) -> 135 min for p = 10 5 (Tibshirani and Wang, 2008) Coordinate descent-like method O ( p ) ? -> 3s s for p = 10 5 (Friedman et al., 2007) With the LARS in O ( pk ) (Harchaoui and Levy-Leduc, 2008) For all µ in O ( p ln p ) (Hoefling, 2009) For the first k change-points in O ( p ln k ) (Bleakley and V., 2010) JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 14 / 47

Solving TV signal approximator in O ( p ln k ) Theorem (V. and Bleakley, 2010; see also Hoefling, 2009) TV signal approximator performs "greedy" dichotomic segmentation Algorithm 1 Greedy dichotomic segmentation Require: k number of intervals, γ ( I ) gain function to split an interval I into I L ( I ) , I R ( I ) 1: I 0 represents the interval [1 , n ] 2: P = { I 0 } 3: for i = 1 to k do I ∗ ← arg max γ ( I ∗ ) 4: I 2P P ← P\ { I ∗ } 5: P ← P [ { I L ( I ∗ ) , I R ( I ∗ ) } 6: 7: end for 8: return P Apparently greedy algorithm finds the global optimum! JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 15 / 47

Speed trial : 2 s. for k = 100, p = 10 7 Speed for K=1, 10, 1e2, 1e3, 1e4, 1e5 1 0.9 0.8 0.7 0.6 seconds 0.5 0.4 0.3 0.2 0.1 0 0 1 2 3 4 5 6 7 8 9 10 signal length 5 x 10 JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 16 / 47

Extension 1: linear discrimination / regression 0.5 0.5 0 0 −0.5 −0.5 −1 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 2 2 1 0 0 −2 −1 −4 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 1 2 0 0 −1 −2 −2 −4 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 4 1 2 0 0 −2 −1 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 2 0.5 0 0 −2 −0.5 −4 −1 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 Aggressive (left) vs non-aggressive (right) melanoma JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 17 / 47

Fused lasso for supervised classification 0.5 0.5 0 0 −0.5 −0.5 −1 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 2 2 1 0 0 −2 −1 −4 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 1 2 0 0 −1 −2 −2 −4 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 4 1 2 0 0 −2 −1 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 2 0.5 0 0 −2 −0.5 −4 −1 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 Idea: find a linear predictor f ( Y ) = β ⊤ Y that best discriminates the aggressive vs non-aggressive samples, subject to the constraints that it should be sparse and piecewise constant Mathematically: β ∈ R p R ( β ) + λ 1 � β � 1 + λ 2 � β � TV min Computationnally: proximal methods JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 18 / 47

Fast sparse methods for genomics data Jean-Philippe Vert - PowerPoint PPT Presentation

Fast sparse methods for genomics data Jean-Philippe Vert Optimization and Statistical Learning workshop, Les Houches, January 6-11, 2013 JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 1 / 47 Normal vs cancer cells What

Genomics Genomics extravaganza extravaganza Genomics Genomics overview overview Genomics

Melbourne Genomics Establishing data governance in clinical genomics Ian Pham Data Governance

Melbourne Genomics Data and technology to support and enable genomics Kate Birch Data &

Genomics extravaganza Genomics overview Genomics analysis of the structure and function of very

Outline Part 1 Introduction to Genomics Part 2 Visual Design for Genomics Part 3 Hands-On

High throughput methods approches in genomics D. Puthier Genomics The science for the 21st

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

Comparative Genomics: Comparative Genomics: Sequence, Structure, Sequence, Structure, and

Fast sparse matrixvector multiplication by partitioning and reordering Albert-Jan Yzelman

Fast sparse matrixvector multiplication by partitioning and reordering Albert-Jan Yzelman

clinical genomics Melbourne Genomics Health Alliance Melbourne Genomics Health Alliance Medical

Sparse Matrices sparse many elements are zero dense few elements are zero Example Of

Parallel Numerical Algorithms Chapter 4 Sparse Linear Systems Section 4.1 Direct Methods

Sparse tensors are a natural way of representing real-world data 1 Sparse tensors are a natural

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

Genomics Virtual Laboratory Mike Pheasant (UQ) Andrew Lonie (VLSCI) What is the Genomics

Viator - A Tool Family for Graphical Networking and Data View Creation Stephan Heymann 1,2 ,

Algorithms for Biological Graphs: Analysis and Enumeration ICTCS Doctoral Research Awards 15th

AP BIOLOGY Gene Expression Summer 2013 www.njctl.org Slide 3 / 199 Gene Expression Unit Topics

Machine Learning HMM applications in computational biology Central dogma CCTGAGCCAACTATTGATGAA

Toward Smart Healthcare Ram D. Sriram Chief, Software and Systems Division URL:

Organizational and Institutional Genesis: Organizational and Institutional Genesis: Why do life

Stability and Stabilization of Hybrid Systems Mikael Johansson KTH Stockholm, Sweden

Molecular Diagnostics at Point of Care Its The Future Already. Ack! Sheldon Campbell M.D.,

Fast sparse methods for genomics data Jean-Philippe Vert - PowerPoint PPT Presentation

Fast sparse methods for genomics data Jean-Philippe Vert Optimization and Statistical Learning workshop, Les Houches, January 6-11, 2013 JP Vert (ParisTech ) Sparse methods in genomics Les Houches 2013 1 / 47 Normal vs cancer cells What

Genomics Genomics extravaganza extravaganza Genomics Genomics overview overview Genomics

Melbourne Genomics Establishing data governance in clinical genomics Ian Pham Data Governance

Melbourne Genomics Data and technology to support and enable genomics Kate Birch Data &amp;

Genomics extravaganza Genomics overview Genomics analysis of the structure and function of very

Outline Part 1 Introduction to Genomics Part 2 Visual Design for Genomics Part 3 Hands-On

High throughput methods approches in genomics D. Puthier Genomics The science for the 21st

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

Comparative Genomics: Comparative Genomics: Sequence, Structure, Sequence, Structure, and

Fast sparse matrixvector multiplication by partitioning and reordering Albert-Jan Yzelman

Fast sparse matrixvector multiplication by partitioning and reordering Albert-Jan Yzelman

clinical genomics Melbourne Genomics Health Alliance Melbourne Genomics Health Alliance Medical

Sparse Matrices sparse many elements are zero dense few elements are zero Example Of

Parallel Numerical Algorithms Chapter 4 Sparse Linear Systems Section 4.1 Direct Methods

Sparse tensors are a natural way of representing real-world data 1 Sparse tensors are a natural

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

Genomics Virtual Laboratory Mike Pheasant (UQ) Andrew Lonie (VLSCI) What is the Genomics

Viator - A Tool Family for Graphical Networking and Data View Creation Stephan Heymann 1,2 ,

Algorithms for Biological Graphs: Analysis and Enumeration ICTCS Doctoral Research Awards 15th

AP BIOLOGY Gene Expression Summer 2013 www.njctl.org Slide 3 / 199 Gene Expression Unit Topics

Machine Learning HMM applications in computational biology Central dogma CCTGAGCCAACTATTGATGAA

Toward Smart Healthcare Ram D. Sriram Chief, Software and Systems Division URL:

Organizational and Institutional Genesis: Organizational and Institutional Genesis: Why do life

Stability and Stabilization of Hybrid Systems Mikael Johansson KTH Stockholm, Sweden

Molecular Diagnostics at Point of Care Its The Future Already. Ack! Sheldon Campbell M.D.,

Melbourne Genomics Data and technology to support and enable genomics Kate Birch Data &