High-dimensional omics data analysis using a variable screening - PowerPoint PPT Presentation

High-dimensional omics data analysis using a variable screening protocol with prior knowledge integration (SKI) Cong Liu Oct. 3 rd , 2016 International Conference on Genome Informatics (GIW)

High-throughput Omics Data

‘P >> N’ Paradigm Reduce P: Feature Selection To avoid overfitting and improve model performance To provide faster and more cost-effective models To gain a deeper insight into the underlying biological processes

Previous Feature Selection Methods Univariate filter methods Correlation criteria FDR correction: q-value Rank-Sum test Mutual Information Multivariate filter methods Sequential Search Forward Search Backward Search Heuristic Algorithms Genetic Algorithm Penalized Regression Ridge Regression LASSO Adaptive LASSO SCAD Elastic-Net Screen + Regression

Aim: Develop a rank-based feature selection protocol with knowledge integration Background & Motivation Statistical framework Results Discussion

Sure Independent Screen (i)SIS Fan and Lv 2008 proposed two-stage method 1. Select significant predictors by sorting the corresponding marginal likelihood (correlation in linear model), thus fast reducing the ultra-high dimensionality 𝑞 to a relatively large scale 𝑒 (e.g. 𝑝(𝑜 )). 2. Use a lower dimensional model selection method such as SCAD, lasso, or adaptive lasso to further reduce the model size from 𝑒 to 𝑒′ When too many predictors are involved, the basic sure screening methods might miss some important variables due to collinearity issues. In their paper they developed an iterative version of SIS to use fully the joint information of the covariates rather than marginal information.

Intuition of Our Method: Screening with Prior Knowledge Integration (SKI) Target Sequencing Variants Filtering Downstream Validation

General Idea of SKI protocol 𝑺 𝟏𝒌 is the rank based on external knowledge; 𝑺 𝟐𝒌 is based on correlation with response(residues); Tuning parameter 𝜷 could be user defined or determined by data;

Estimation of 𝜷 Cross validation will require us to further spit the sample into training and testing, which can make the ultra-high dimensionality issue worse. We compare the 𝑒𝑓𝑤.𝑠𝑏𝑢𝑗𝑝 across different 𝛽 ’s, and select the 𝛽 yields largest 𝑒𝑓𝑤. 𝑠𝑏𝑢𝑗𝑝 as the final 𝛽 . ;<=;>?@ ABC D;<=;>?@ E 𝑒𝑓𝑤.𝑠𝑏𝑢𝑗𝑝(𝛽) = 1 − 2× ;<=;>?@ ABC D;<=;>?@ FGHH The rationale of this method is that if one set of variables is more biologically meaningful than the other, the better it could fit a ridge regression model.

How to get 𝑺 𝟏𝒌 ? - Examples Domain Knowledge Database Text Mining Other Data Sources

Simulation Study Experiment Dataset Knowledge Dataset 𝑜 I = 200 samples (X, 𝑍 𝑜 R = 200 samples (Z, 𝑍 R ) with I ) were gene number 𝑞 = 10,000 . simulated, with gene number 𝑞 = 10,000 . Gene expressions and 200 clusters were simulated responses were simulated from independently, and 50 genes in the same structure as each cluster were simulated described in experiment from a multivariate normal dataset. distribution with 𝜈 = 0, 𝜏 O = non-zero coefficients 𝛾 were 1 and AR(1) correlation simulated to have 0%, 50%, structure 𝜍 = 0. 6. and 100% overlap with non- zero 𝛾 in the internal settings. In each cluster, the coefficients 𝛾 ’s of first ten were simulated from a uniform distribution(0.5,1). All other 𝛾 ’s were set to be zeros. Continuous responses were generated from linear O = 1 regression models with 𝜏 I (or 3).

The he num umber of true ue positives among different metho hods. 1 SIS: variables were sorted by marginal Positive 4 1% 5% 10% correlation using only internal dataset; 𝟑 6 𝟑 7 % 5 𝝉 𝒚 𝝉 𝒜 𝜷 8 SIS 1 SKI 2 P 3 SIS SKI P SIS SKI P 2 SKI : variables were sorted by weighted geometric mean of two marginal 0.0 1 1 0.075 38.96 38.94 36.36 45.78 45.72 43.63 47.66 47.63 45.63 correlation based ranks using two dataset; 0.5 1 1 0.275 38.53 43.06 45.22 45.66 47.65 48.54 47.53 48.85 49.13 3 Pool : two dataset were pooled together 1.0 1 1 0.384 38.5 46.34 47.99 45.65 48.9 49.58 47.49 49.51 49.83 and treated as a single dataset, and then variables were sorted by marginal correlation; 0.0 1 3 0.090 39.10 38.97 35.01 45.81 45.80 42.94 47.71 47.72 44.03 4 Top 1%, 5% and 10% variables were 0.5 1 3 0.249 38.92 42.55 43.85 45.80 47.31 48.28 47.57 48.55 49.10 selected respectively under different settings; 1.0 1 3 0.368 39.04 45.81 47.58 45.88 48.60 49.44 47.65 49.21 49.73 5 the percentage of non-zero 𝛾 ’s 0.0 3 1 0.113 36.84 36.43 35.77 44.61 44.01 43.37 46.69 46.57 46.19 overlapped with each other in two datasets; 0.5 3 1 0.261 37.27 42.16 44.90 45.15 47.36 48.34 47.07 48.56 49.03 6 𝜏 O : the variance added in internal I dataset to generate response 𝑍 I ; 1.0 3 1 0.374 36.91 46.01 48.89 44.76 49.42 49.51 47.12 49.86 49.90 O : the variance added in external 7 𝜏 0.0 3 3 0.104 37.84 37.48 35.19 45.73 45.43 44.07 47.63 47.53 45.93 R dataset to generate response 𝑍 R ; 0.5 3 3 0.264 37.26 42.52 44.48 45.03 47.35 48.26 47.19 48.58 49.00 8 𝛽 : the estimated value of 𝛽 which control the weight of two ranks in 1.0 3 3 0.355 37.05 45.20 47.37 45.1 48.6 49.39 47.05 49.36 49.76 geometric mean.

The he num umber of true ue positives us using it iterativ ive and non-it iterativ ive appr pproaches whe when top 1% variables we were select cted. % 1 𝝇 2 𝜷 3 SIS 4 SKI 5 iSIS 6 iSKI 7 0 0.3 0.061 23.32 23.12 25.22 22.53 0.5 0.3 0.342 24.83 33.20 26.13 34.43 1 0.3 0.443 23.14 34.41 26.33 38.85 0 0.6 0.044 37.35 36.34 41.11 36.17 0.5 0.6 0.392 36.47 41.67 39.67 44.83 1 0.6 0.453 37.12 45.83 40.44 49.40 1 SIS: variables were sorted by marginal correlation using only internal dataset; 2 SKI : variables were sorted by weighted geometric mean of two marginal correlation based ranks using two dataset; 3 Pool : two dataset were pooled together and treated as a single dataset, and then variables were sorted by marginal correlation; 4 Top 1%, 5% and 10% variables were selected respectively under different settings; 5 the percentage of non-zero 𝛾 ’s overlapped with each other in two datasets; 6 𝜏 O : the variance added in internal dataset to generate response 𝑍 I ; I O : the variance added in external dataset to generate response 𝑍 7 𝜏 R ; R 8 𝛽 : the estimated value of 𝛽 which control the weight of two ranks in geometric mean.

Real Application: Drug Response Analysis Selumetinib (AZD6224) is a drug used to treat various types of cancer such as non-small cell lung cancer (NSCLC). We applied the SKI procedure to identify the potential biomarkers of response to Selumetinib using CCLE dataset. The CCLE dataset includes the drug response data (i.e. Active Area) together with its baseline omics measurement, which includes gene expression, mutation data, and copy numbers. In total there were 489 cell lines and 41872 genomic features measured. We then searched the Drug2Gene database [25] to acquire prior knowledge of association between selumetinib and genes. In total, 383 genes were identified to have associations with selumetinib.

18 variables selected by SKI procedure when top 100 variables were selected, whose association with selumetinib could be found in database. Gene Symbol Probe ID Type 𝑺 𝑻𝑱𝑻 1 𝑺 𝑻𝑳𝑱 2 BRAF NA Mut 4 1 ADCK3 56997_at Exp 172 5 TESK1 7016_at Exp 194 6 DCLK2 166614_at Exp 196 8 TNIK 23043_at Exp 206 9 NUAK2 81788_at Exp 209 10 ERBB3 2065_at Exp 328 14 PRKCD 5580_at Exp 338 15 MYLK 4638_at Exp 479 20 MAP3K1 4214_at Exp 502 21 ULK3 25989_at Exp 519 23 Boxplot of squared error for FGFR1 2260_at Exp 556 25 selumtinib response SNRK 54861_at Exp 582 26 prediction RPS6KA3 6197_at Exp 623 29 STK10 6793_at Exp 691 31 MAPK9 5601_at Exp 756 34 TAOK3 51347_at Exp 761 35 PIK3CB 5291_at Exp 764 36

Discussion The proposed approach is general and is not limited to any specific type of prior knowledge as long as the variables could be ranked based on some external criteria. Bergersen et al . has proposed a weighted LASSO (wLASSO) procedure with data integration, which shared a similar idea of our approach. As a screening-based method, SKI is apparently flexible to extend to more generalized fields (generalized linear models, additive models, cox models, and model-free), too. Li et al . proposed a variant methods, robust rank correlation screening (RRCS) method, which is based on the Kendall τ correlation coefficient between response and predictor variables rather than the Pearson correlation of SIS

High-dimensional omics data analysis using a variable screening - PowerPoint PPT Presentation

High-dimensional omics data analysis using a variable screening protocol with prior knowledge integration (SKI) Cong Liu Oct. 3 rd , 2016 International Conference on Genome Informatics (GIW) High-throughput Omics Data P >> N

PostgreSQL and Omics Data How omics data can be stored in postgres database Postgr tgreSQ eSQL

Integrating multi-omics Luciano Milanesi Outline Introduction Omics challenges Data

Statistical analysis of meta-omics data Sandra Plancade INRA (French Institute of Research in

Multi-Omics with Galaxy for Diverse Biological Applications Tim Griffin and Pratik Jagtap

Machine Learning Applications to Omics Data Kelly Ruggles April 9, 2018 Diversity of Omics in

Abou out t OM OMICS S Gr Grou oup OMICS Group International is an amalgamation of

Numberjack User Guide May 27, 2013 1 Variables Constructor for the class Variable : Constructor

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

Reporting and Evaluation of Studies of Biomarkers and Omics-based Predictors: REMARK Guidelines

n -dimensional manifold M with T := TM n -dimensional manifold M with T := TM T n -dimensional

Using Local Neighborhoods to Find Subspace Clusters Emin Aksehirli with Bart Goethals, Emmanuel

High Dimensional Data Alark Joshi High dimensional data Data with multiple dimensions,

Multivariate Data Analysis in Omics Research Diverging Alternative Splicing Fingerprints

High Dimensional Data, Covariance Matrices High Dimensional Data Examples and Application to

Bayesian Methods for Variable Selection with Applications to High-Dimensional Data Part 5:

Statistics for High-Dimensional Data: Selected Topics Peter B uhlmann Seminar f ur

Resource-Oriented Architecture (ROA) <div

Taming JavaScript with Cloud9 IDE: a Tale of Tree Hugging Zef Hemel (@zef) .js browser.js

TH THE E MC MCPS PS VI VISUAL SUAL AR ART T CE CENT NTER ER A COUNTYWIDE, PRE-COLLEGE

st t st rrs

A Tutorial of the Mobile Multimedia Wireless Sensor Network OMNeT++ Framework Zhongliang

Pathwise Coordinate Optimization for Nonconvex Sparse Learning Tuo Zhao

POLI 120N: Contention and Conflict in Africa Professor Adida Kenya & Solutions to Electoral

Nonconvex Sparse Graph Learning under Laplacian-structured Graphical Model a talk by Jiaxi Ying,

High-dimensional omics data analysis using a variable screening - PowerPoint PPT Presentation

High-dimensional omics data analysis using a variable screening protocol with prior knowledge integration (SKI) Cong Liu Oct. 3 rd , 2016 International Conference on Genome Informatics (GIW) High-throughput Omics Data P >> N

PostgreSQL and Omics Data How omics data can be stored in postgres database Postgr tgreSQ eSQL

Integrating multi-omics Luciano Milanesi Outline Introduction Omics challenges Data

Statistical analysis of meta-omics data Sandra Plancade INRA (French Institute of Research in

Multi-Omics with Galaxy for Diverse Biological Applications Tim Griffin and Pratik Jagtap

Machine Learning Applications to Omics Data Kelly Ruggles April 9, 2018 Diversity of Omics in

Abou out t OM OMICS S Gr Grou oup OMICS Group International is an amalgamation of

Numberjack User Guide May 27, 2013 1 Variables Constructor for the class Variable : Constructor

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

Reporting and Evaluation of Studies of Biomarkers and Omics-based Predictors: REMARK Guidelines

n -dimensional manifold M with T := TM n -dimensional manifold M with T := TM T n -dimensional

Using Local Neighborhoods to Find Subspace Clusters Emin Aksehirli with Bart Goethals, Emmanuel

High Dimensional Data Alark Joshi High dimensional data Data with multiple dimensions,

Multivariate Data Analysis in Omics Research Diverging Alternative Splicing Fingerprints

High Dimensional Data, Covariance Matrices High Dimensional Data Examples and Application to

Bayesian Methods for Variable Selection with Applications to High-Dimensional Data Part 5:

Statistics for High-Dimensional Data: Selected Topics Peter B uhlmann Seminar f ur

Resource-Oriented Architecture (ROA) &lt;div

Taming JavaScript with Cloud9 IDE: a Tale of Tree Hugging Zef Hemel (@zef) .js browser.js

TH THE E MC MCPS PS VI VISUAL SUAL AR ART T CE CENT NTER ER A COUNTYWIDE, PRE-COLLEGE

st t st rrs

A Tutorial of the Mobile Multimedia Wireless Sensor Network OMNeT++ Framework Zhongliang

Pathwise Coordinate Optimization for Nonconvex Sparse Learning Tuo Zhao

POLI 120N: Contention and Conflict in Africa Professor Adida Kenya &amp; Solutions to Electoral

Nonconvex Sparse Graph Learning under Laplacian-structured Graphical Model a talk by Jiaxi Ying,

Resource-Oriented Architecture (ROA) <div

POLI 120N: Contention and Conflict in Africa Professor Adida Kenya & Solutions to Electoral