Kernel based methods for microarray and mass spectrometry data - PowerPoint PPT Presentation

Kernel based methods for microarray and mass spectrometry data analysis Fabian Ojeda ESAT-SCD-SISTA Division Department of Electrical Engineering Katholieke Universiteit Leuven Leuven, Belgium May 20, 2011 Prof. dr. ir. B. De Moor, promotor Prof. dr. ir. J.A.K. Suykens, co-promotor Prof. dr. ir. P. Sas, chairman Prof. dr. ir. Y. Moreau Prof. dr. J. Rozenski Prof. dr. ir. M. Van Barel Prof. dr. ir. G. Bontempi, ULB

Outline 1 Background 2 Low rank updated LS-SVM 3 Sparse linear models 4 Entropy based spectral clustering 5 Conclusions F. Ojeda (KUL-ESAT-SISTA) PhD dissertation May 20, 2011 1 / 32

Motivation and problem description Goal Application of regularization/kernel based methods and adaptation to the areas of high dimensional and low sample/large scale data. Methods Prediction models, model selection, variable selection, clustering. Application Efficient variable selection algorithms for microarray data. Incorporate structural information of MSI data. Clustering methods for large scale gene clustering. F. Ojeda (KUL-ESAT-SISTA) PhD dissertation May 20, 2011 2 / 32

Motivation and problem description Microarray / mass spectrometry data - Simultaneous measure of thousands of genes / proteins. - Structural and prior information. - Large number of variables, low sample size. - Irrelevant variables. - Lack of labeled data. F. Ojeda (KUL-ESAT-SISTA) PhD dissertation May 20, 2011 3 / 32

Regularized learning models Microarray/MSI data representation D = { ( x i , y i ) } n i =1 , x i ∈ R d , y i ∈ R , n samples measured over d variables. d � w k x k ε i ∼ N (0 , σ 2 ) , y i = i + ε i (1) k =1 x k i : k -th component of x i . w d ) ⊤ ∈ R d Solve for ˆ w = ( ˆ w 1 , . . . , ˆ w � y − Xw � 2 ˆ w = arg min 2 + λP ( w ) . (2) λ > 0 , regularization parameter ( λ = 0 , OLS) Ridge regression P ( w ) = � w � 2 2 , LASSO P ( w ) = � w � 1 P ( · ) encodes a priori assumptions to make problem well-posed. F. Ojeda (KUL-ESAT-SISTA) PhD dissertation May 20, 2011 4 / 32

bc bc bc bc bc bc bc bc bc bc bc bc bc bc bc bc Kernel methods No assumptions about data structure. Allow introduction of prior knowledge. Input space X mapped to high dimensional space F . Non-linear general versions of linear algorithms. Kernel trick Mapping: x → ϕ ( x ) . Kernel: K ( x , z ) = ϕ ( x ) ⊤ ϕ ( z ) . +1 ϕ ( · ) − 1 X F Linear: K ( x , z ) = x ⊤ z . Polynomial: K ( x , z ) = ( x ⊤ z + τ ) p , p ∈ N , τ ≥ 0 . Gaussian: K ( x , z ) = exp( −|| x − z || 2 2 /σ 2 ) , σ ∈ R kernel width. F. Ojeda (KUL-ESAT-SISTA) PhD dissertation May 20, 2011 5 / 32

Least Squares Support Vector Machines (LS-SVM) Optimization problem Model: f ( x ) = w ⊤ ϕ ( x i ) + b n 1 2 w ⊤ w + γ 1 � e 2 min i 2 w ,b,e i =1 s . t .y i = w ⊤ ϕ ( x i ) + b + e i , i = 1 , . . . , n , Estimate parameters w ∈ R d h and feature map ϕ ( · ) : R d → R d h . Linear equations: Dual Solve in α ∈ R n , via kernel trick � Ω + γ − 1 I n � y � � α � � 1 = . 1 ⊤ 0 b 0 Model: f ( x ) = sign( � n i =1 α i K ( x , x i ) + b ) . F. Ojeda (KUL-ESAT-SISTA) PhD dissertation May 20, 2011 6 / 32

Outline 1 Background 2 Low rank updated LS-SVM 3 Sparse linear models 4 Entropy based spectral clustering 5 Conclusions F. Ojeda (KUL-ESAT-SISTA) PhD dissertation May 20, 2011 6 / 32

Variable selection problem Definition Given D = { ( x i , y i ) } n i =1 , x i ∈ R d , let S = x 1 , . . . , x k , . . . , x d � � . Find S ∗ ⊂ S , S ∗ ∈ R m , m < d , minimizing J S ∗ ≤ J S , e.g. LOO error. Elements J S ∗ → easy/cheap to evaluate. Exploit any (if possible) structure of the predictor. Reduce computational complexity. F. Ojeda (KUL-ESAT-SISTA) PhD dissertation May 20, 2011 7 / 32

Rank-one updates Linear kernels can be written in outer product form ⊤  x 1  d � x 1 , . . . , x d � . x k x k ⊤ � . Ω = =   .   x d k =1 d x k x k ⊤ + γ − 1 I n . H = Ω + γ − 1 I n = � k =1 At the level of variable x k k − 1 x j x j ⊤ + γ − 1 I n + x k x k ⊤ � H k = j =1 H k = H k − 1 + x k x k ⊤ . (3) Key point compute H − 1 from H − 1 k − 1 and obtain α ∗ , b ∗ . k F. Ojeda (KUL-ESAT-SISTA) PhD dissertation May 20, 2011 8 / 32

Low rank updates With Cholesky factorization LL ⊤ = Ω + γ − 1 I n , then adding new variable x k results in a rank-1 modification to L L ⊤ = LL ⊤ + x k x k ⊤ . L ˜ ˜ (4) The modified Cholesky factor is L ⊤ = LL ⊤ + uu ⊤ L ˜ ˜ (5) = L ( I + qq ⊤ ) L ⊤ L ⊤ L ⊤ , = L ¯ L ¯ ˜ L can be directly computed from L . Updated model parameters become: b = 1 ⊤ ˜ ν ) − 1 , ˜ χ − ˜ χ ( 1 ⊤ ˜ α = ˜ ˜ b ˜ ν . (6) L ⊤ ˜ where ˜ L ˜ χ = y and ˜ L ˜ L ⊤ ˜ ν = 1 . F. Ojeda (KUL-ESAT-SISTA) PhD dissertation May 20, 2011 9 / 32

Experiments Data Gene expression data. Leukemia. n = 72 , d = 7129 . Colon cancer n = 60 , d = 2000 . Algorithms SVM-RFE with and without retraining Naive LS-SVM with forward selection. LS-SVM with fast LOO and rank-one modifications. Validation Computational complexity. 10-fold cross-validation. F. Ojeda (KUL-ESAT-SISTA) PhD dissertation May 20, 2011 10 / 32

Computational time. Colon data set. Goal : Select 200 and 500 genes. Forward algorithms Backward algorithms 4 4 10 10 3 3 10 10 CPU time (seconds) CPU time (seconds) 2 2 10 10 1 10 1 10 0 10 0 10 −1 10 −1 10 0 20 40 60 80 100 120 140 160 180 0 50 100 150 200 250 300 350 400 450 Number of variables Number of variables Baseline LS-SVM ( ◦ ), LOO bound ( � ), SVM-RFE1 ( ◦ ), SVM-RFE2 ( � ), Low Low rank updated LS-SVM ( ∗ ) rank downdated LS-SVM( ∗ ). Improvement by two orders. F. Ojeda (KUL-ESAT-SISTA) PhD dissertation May 20, 2011 11 / 32

Prediction performance. Test error. Colon data set Leukemia data set 0.4 0.35 0.25 0.3 0.2 0.25 Test error Test error 0.15 0.2 0.15 0.1 0.1 0.05 0.05 0 0 0 100 200 300 400 500 600 700 800 900 1000 0 10 20 30 40 50 60 70 80 90 100 Number of removed genes Number of variables SVM-RFE1 ( • ) and SVM-RFE2 ( � ), LOO-Bound (Dashed), Low rank updated Low rank downdated LS-SVM ( ∗ ). LS-SVM (Solid). F. Ojeda (KUL-ESAT-SISTA) PhD dissertation May 20, 2011 12 / 32

Extension to polynomial kernels Explicit feature map – yield low rank matrices. Explicit feature map ϕ p ( · ) for polynomial kernel of degree p � ⊤ �� p �� p � z p − 1 , z p ϕ p ( z ) = 1 , z, . . . , , (7) 1 p − 1 with ϕ p ( · ) : R → R p +1 . Hence, Gram matrix becomes p d ( ϕ l ◦ x k )( ϕ l ◦ x k ) ⊤ , Ω d � Ω k with Ω k � p = p = (8) p k =1 l =0 Matrix notation ⊤ , Ω k p = Φ k p Φ k (9) p Φ k ( ϕ 0 ◦ x k ) , . . . , ( ϕ p ◦ x k ) � � p = is a n × ( p + 1) matrix. F. Ojeda (KUL-ESAT-SISTA) PhD dissertation May 20, 2011 13 / 32

Polynomial updates For a Gram matrix the following holds � � � ⊤ � � � Φ k Φ k p Φ k Ω k rank = rank = rank , (10) p p p Ω k � � ≤ p + 1 . that is rank p For all inputs, k = 1 , . . . , d , Ω d p is a sum of d rank- ( p + 1) matrices � d d � � � � � � � Ω d Ω k Ω k rank = rank ≤ rank . (11) p p p k =1 k =1 Note: For linear kernel Ω d = Ω , outer product definition. rank - ( p + 1) updates ⊤ . L ⊤ = LL ⊤ + Φ k L ˜ ˜ p Φ k (12) p Apply ( p + 1) rank-1 updates sequentially over columns of Φ k p . F. Ojeda (KUL-ESAT-SISTA) PhD dissertation May 20, 2011 14 / 32

Experimental results Synthetic data set with n = 100 and d = 500 : i − 0 . 5) 2 + 10 x 3 y i = 10 sinc ( x 1 i ) + 20( x 2 i + 5 x 4 i + ǫ i , ǫ i ∼ N (0 , 1) . 2 2 10 10 1 2 3 4 5 1 10 Average LOO PRESS 1 10 Time (secs) 0 10 −1 10 0 10 −2 10 −1 10 −3 10 1 2 3 4 5 6 7 8 9 10 Degree of the polynomial kernel 0 5 10 15 20 25 30 Number of ranked variables Time required to computed 50 updates Linear and quadratic model do not retrieve true variables. F. Ojeda (KUL-ESAT-SISTA) PhD dissertation May 20, 2011 15 / 32

Kernel based methods for microarray and mass spectrometry data - PowerPoint PPT Presentation

Kernel based methods for microarray and mass spectrometry data analysis Fabian Ojeda ESAT-SCD-SISTA Division Department of Electrical Engineering Katholieke Universiteit Leuven Leuven, Belgium May 20, 2011 Prof. dr. ir. B. De Moor, promotor

Mass Spectrometry Mass Spectrometry is a technique used to determine the molecular weight and

Mass Spectrometry in Clinical Chemistry David Hardy What is Mass Spectrometry? The

Mass Spectrometry MALDI-TOF ESI/MS/MS Mass spectrometer Basic components Ionization

Proteomics and Protein Mass Proteomics and Protein Mass Spectrometry 2004 Spectrometry 2004

Inductively coupled plasma mass spectrometry (ICPMS) What is ICP MS Inductively coupled plasma

Targeted mass spectrometry Marina Zajec Dept. of Neurology and Clinical Chemistry Lab. of

Content A brief introduction to mass spectrometry Mass spectrometry instrumentation

How We Handle Mass Spectra NIST Mass Spectrometry Data Center NIST/EPA/NIH Mass Spectral Library

What is a mass spectrum? What is a mass spectrum? 1265.6038 100 MALDI-DE-RE-TOF MS tryptic

Preprocessing, Management, and Analysis of Mass Spectrometry Proteomics Data M. Cannataro, P. H.

Characterization of complex polymer systems by MALDI Mass Spectrometry Concetto Puglisi

Mass Spectrometry - an essential tool to understand and produce proteins Mark Abbott CEO Peak

Applications of Isochronous Mass Spectrometry (IMS) at HIRFL-CSR OUTLINE Introduction to

CEE 772: Instrumental Methods in Environmental Analysis Lecture #21 Mass Spectrometry: Mass

Problem 4: Mass Spectrometry The Quadrupole The Problem Initial Ideas Binary System Ion

Direct analysis of neutrals using superconducting detector in tandem mass spectrometry TOD:

Pre-Analytics & Biobanking Christian Viertler, MD Institute of Pathology Medical University

Innovative monitoring tools for the integrated assessment of the environmental status and

Oral Presentation #6 Clinical Analysis of Speech Rhythms in Language Development using MATLAB

Biomarkers in Oncology: Biomarkers in Oncology: Research & Early Development Research &

Characterization of pleiotropic adult plant resistance loci to wheat diseases Caixia Lan, Ravi P

More than 40 Years Research on (Bio)Polymers DNA the star among the biomolecules and RNA the

Employment Updates Acacia McGuire Anderson Statewide Employment First Coordinator Tim Acker

Review of Training and Competence Documentation for Technical Staff at WMRGL Carly Mogg

Kernel based methods for microarray and mass spectrometry data - PowerPoint PPT Presentation

Kernel based methods for microarray and mass spectrometry data analysis Fabian Ojeda ESAT-SCD-SISTA Division Department of Electrical Engineering Katholieke Universiteit Leuven Leuven, Belgium May 20, 2011 Prof. dr. ir. B. De Moor, promotor

Mass Spectrometry Mass Spectrometry is a technique used to determine the molecular weight and

Mass Spectrometry in Clinical Chemistry David Hardy What is Mass Spectrometry? The

Mass Spectrometry MALDI-TOF ESI/MS/MS Mass spectrometer Basic components Ionization

Proteomics and Protein Mass Proteomics and Protein Mass Spectrometry 2004 Spectrometry 2004

Inductively coupled plasma mass spectrometry (ICPMS) What is ICP MS Inductively coupled plasma

Targeted mass spectrometry Marina Zajec Dept. of Neurology and Clinical Chemistry Lab. of

Content A brief introduction to mass spectrometry Mass spectrometry instrumentation

How We Handle Mass Spectra NIST Mass Spectrometry Data Center NIST/EPA/NIH Mass Spectral Library

What is a mass spectrum? What is a mass spectrum? 1265.6038 100 MALDI-DE-RE-TOF MS tryptic

Preprocessing, Management, and Analysis of Mass Spectrometry Proteomics Data M. Cannataro, P. H.

Characterization of complex polymer systems by MALDI Mass Spectrometry Concetto Puglisi

Mass Spectrometry - an essential tool to understand and produce proteins Mark Abbott CEO Peak

Applications of Isochronous Mass Spectrometry (IMS) at HIRFL-CSR OUTLINE Introduction to

CEE 772: Instrumental Methods in Environmental Analysis Lecture #21 Mass Spectrometry: Mass

Problem 4: Mass Spectrometry The Quadrupole The Problem Initial Ideas Binary System Ion

Direct analysis of neutrals using superconducting detector in tandem mass spectrometry TOD:

Pre-Analytics &amp; Biobanking Christian Viertler, MD Institute of Pathology Medical University

Innovative monitoring tools for the integrated assessment of the environmental status and

Oral Presentation #6 Clinical Analysis of Speech Rhythms in Language Development using MATLAB

Biomarkers in Oncology: Biomarkers in Oncology: Research &amp; Early Development Research &amp;

Characterization of pleiotropic adult plant resistance loci to wheat diseases Caixia Lan, Ravi P

More than 40 Years Research on (Bio)Polymers DNA the star among the biomolecules and RNA the

Employment Updates Acacia McGuire Anderson Statewide Employment First Coordinator Tim Acker

Review of Training and Competence Documentation for Technical Staff at WMRGL Carly Mogg

Pre-Analytics & Biobanking Christian Viertler, MD Institute of Pathology Medical University

Biomarkers in Oncology: Biomarkers in Oncology: Research & Early Development Research &