Microarray Data Integration and Machine Learning Techniques For - PowerPoint PPT Presentation

Microarray Data Integration and Machine Learning Techniques For Lung Cancer Survival Prediction Daniel Berrar , Brian Sturgeon, Ian Bradbury, C. Stephen Downes, Werner Dubitzky November 14, 2003

Outline • Summary of Results (1 slide) • Overview of Tasks (1 slide) • Data Integration (4 slides) • Methods (6 slides) • Results and Biological Interpretation (6 slides) • Conclusions (1 slide)

Summary of Results • With respect to tasks: – Classification task : Prediction of 5-year survival is most accurate when we build a model using only patient data (age, tumor stage,…); – Regression task : Prediction of survival in months is more accurate for the model relying on expression data than on patient data, and best when the model relies on both patient and expression data; • With respect to methods: – “Best” model : Decision tree

Tasks • Task #1: Data integration – Integration of lung cancer microarray data of Harvard and Michigan data set and data pre-processing; • Task #2: Classification – (a) Prediction of 5-year survival of patients based on (1) patient data, (2) expression data, and (3) both patient and expression data; – (b) Comparison of 6 machine learning methods; • Task #3: Regression – Prediction of survival time using (1) patient data, (2) expression data, and (3) both patient and expression data; • Task #4: Interpretation – Biological interpretation of identified genes.

Task #1: Data Integration [1/4]

Task #1: Data Integration [2/4] Target variables Patient data Expression data 211 patients 3,588 genes

Task #1: Data Integration [3/4] • Data pre-processing for classification task: – Group patients into 2 classes: • LOW RISK: Survival ≥ 5 years • HIGH RISK: Survival < 5 years – Discard patients that are censored before 60 months – Remaining number of patients: 136 • Data pre-processing for regression task: – Include all 211 patients. • Data pre-processing for both tasks: – Generate learning set and test set by randomly splitting the entire data set (~70% : ~30%).

Task #1: Data Integration [4/4] Learning Set Test Set Learning Set Test Set 1 1 1 1 Patient data ... ... ... ... Task #2: Classification 96 40 148 63 Task #3: Regression Model CART Learning Set Test Set Learning Set Test Set 1 1 1 1 Expression data ... ... ... ... 96 40 148 63 Model CART Learning Set Test Set Learning Set Test Set 1 1 1 1 Patient + Expression data ... ... ... ... 96 40 148 63 Model CART

Tasks • Task #1: Data pre-processing – Integration of lung cancer microarray data of Harvard and Michigan data set; • Task #2: Classification – (a) Prediction of 5-year survival of patients based on (1) patient data, (2) expression data, and (3) both patient and expression data; – (b) Comparison of 6 machine learning methods; • Task #3: Regression – Prediction of survival time using (1) patient data, (2) expression data, and (3) both patient and expression data; • Task #4: Interpretation – Biological interpretation of identified genes.

Methods – Overview • Methods used to address Classification-Task (1) k -nearest neighbour ( k -NN) (2) Decision Tree C5.0 (3) Boosted Decision Trees (4) Support Vector Machines (SVMs) (5) Artificial Neural Networks (Multilayer Perceptrons, MLPs) (6) Probabilistic Neural Networks (PNNs) • Methods used to address Regression-Task (1) Classification and Regression Tree (CART)

Methods – Comparison of Principles • Consider the following 2-class problem

Methods – Decision Tree • Recursively split the data set into decision regions and generate a rule set • Classify the test case using the rule set Root Node y ≤ Split # 1 y > Split # 1 Class • Split again Boosted decision trees: • Aggregating decision trees (committee) by weighted voting and resampling of the data set.

Methods – Support Vector Machine • Find optimal separating hyperplane by maximizing the margin between 2 classes • Classify the test case using hyperplane Class � Class •

Methods – Strengths and Weaknesses Most (if not all) models ultimately rely on a definition of distance between objects. *Lee Y., Lee C.K.: Classification of multiple cancer types by multicategory support vector machines using gene expression data. Bioionformatics 19 (9), pp. 1132-1139, (2003). This definition is not trivial in high-dimensional space. Distance metric: tuning parameter? � fractal distance [Aggarwal et al., ICDT , 2001]

Results of Task#1: Classification

Methods – Classification and Regression Tree • Algorithm is similar to the decision tree C5.0 • Heuristic is based on recursive partitioning of data set • Differences:

Results of Task #2: Regression [1/3] • Evaluation criteria: – How many death events are correctly identified as death � accuracy events, and how many are not? – For the correctly identified death events, what is the deviance of the residuals between the real and the predicted survival time?

Results of Task #2: Regression [2/3]

Results of Task #2: Regression [3/3] 10.9

Task #4: Biological Interpretation [1/2] • How to interpret the results? � Using literature, OMIM, PubMed,… • # of features relevant for classification task: 8, e.g. ZNF174 (zinc finger protein) • Proteins of this family probably have an impact on repression of growth factor gene expression [OMIM, 603900] • Example: Wilms tumour suppressor WT1 encodes a zinc finger protein that downregulates the expression of various growth factor genes [OMIM, 603900] • Decision tree: overexpression of ZNF174 is associated with LOW RISK, underexpression with HIGH RISK • ZNF174: important marker in Burkitt’s lymphoma cells [Li et al., PNAS, May 2003]

Task #4: Biological Interpretation [2/2] • # of features relevant for regression task: 5, e.g. NifU • function not fully understood yet • is likely to be involved in the mobilization of iron and sulfur for nitrogenase-specific iron-sulfur cluster formation • important for breast cancer classification [Hedenfalk et al., N Engl J Med, Feb. 2001]; • Decision tree: overexpression of NifU is associated with good clinical outcome for patients with early tumour stage.

Conclusions • Integrating clinical and transcriptional data might improve survival outcome prediction; • “Best” model in this study: decision tree, but… • Method of choice is not available; • No Free Lunch Theorem: “ No classifier is inherently superior to any other. The type of the problem determines which classifier is most appropriate .” • George Box: “ Statisticians, like artists, have the bad habit of falling in love with their models. ”

Acknowledgements • Brian Sturgeon • Ian Bradbury • C. Stephen Downes • Werner Dubitzky Supplementary information will be available at http://research.bioinformatics.ulster.ac.uk/~dberrar/camda03.html.

Methods – Comparison of Principles • Consider the following 2-class problem of cases ( x,y ) • t = {0.1, 0.2,…10}, • Class A : f • ( x ) = t cos( t ), f • ( y ) = t sin( t ) • Class B : f � ( x ) = t sin( t ), f � ( y ) = t cos( t )

Methods – k -Nearest Neighbour • Retrieve the nearest neighbours of the test case • Classify test case based on the class membership of the nearest neighbours

Methods – k -Nearest Neighbour Learning: • For each case in the learning set, determine all neighbours and rank them with respect to similarity = 1 − distance • Determine global optimal number of nearest neighbours k opt (e.g., in LOOCV) Test: • Use k opt for classifying the test cases • Interpret normalized similarities as measure of confidence 1 − distance) Case # Normed Class Similarity (= • 27 0.0921 0.35795 Suppose that k opt = 3 29 0.0833 0.32375 and the following • 34 0.0819 0.31831 nearest neighbours: Confidence for class • : 0.35795 + 0.67626 0.31831 = 0.32375 Confidence for class :

Methods – Support Vector Machine • Goal: Finding the optimal decision boundery between 2 classes • SVM-heuristic for separable problems (non-overlapping classes): – Construct optimal separating hyperplane by maximizing the margin

Microarray Data Integration and Machine Learning Techniques For - PowerPoint PPT Presentation

Microarray Data Integration and Machine Learning Techniques For Lung Cancer Survival Prediction Daniel Berrar , Brian Sturgeon, Ian Bradbury, C. Stephen Downes, Werner Dubitzky November 14, 2003 Outline Summary of Results (1 slide)

Capturing Best Practice for Microarray Gene Expression Data Analysis Gregory Piatetsky-Shapiro

A CMOS Label- -free DNA free DNA A CMOS Label Microarray Microarray Erik Anderson Stanford

Biology-Driven Clustering of Microarray Data Applications to the NCI60 Data Set K.R. Coombes,

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Microarray Data Analysis ECS 289A ECS289A a) Oligonucleotide and b) Spotted Arrays Lochart and

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

Recent development in microarray data analysis Guan-Hua Huang Institute of Statistics National

Systems Systems Systems Integration Systems Integration Systems Systems Integration Systems

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

Machine Learning 1 Machine(Learning(in(a(Nutshell ( Data$ Model$ Performance$ Measure$

Biweight Correlation as a Measure of Distance between Genes on a Microarray Aya Mitani Pitzer

Conflicts between Optimality Criteria in Incomplete-Block Designs for Microarray Experiments R.

For personal use only USA Corporate Presentation USA Corporate Presentation Viralytics Ltd

NCIs Center for Global Health Funded Research Portfolio Implications for Future Dissemination

Welcome to Africa! Presented by the 2016 Tanzania Global Health Students Overview 3 weeks

Expanding Access to Treatment in Wisconsin Final Report and Policy Recommendations The Pew

Improving Global Health Care Delivery Through Collaboration & Partnership Michelle

RE RECORDI RDING OP OPTION ONS You can record your presentation using the recording feature in

Oregon Smoke Management Plan Smoke Management Plan Daily Operations Reporting

Utilizing a Comprehensive Holistic Innovative Approach Jamie Heffernan BSN, RN, CCRN Odette

Microarray Data Integration and Machine Learning Techniques For - PowerPoint PPT Presentation

Microarray Data Integration and Machine Learning Techniques For Lung Cancer Survival Prediction Daniel Berrar , Brian Sturgeon, Ian Bradbury, C. Stephen Downes, Werner Dubitzky November 14, 2003 Outline Summary of Results (1 slide)

Capturing Best Practice for Microarray Gene Expression Data Analysis Gregory Piatetsky-Shapiro

A CMOS Label- -free DNA free DNA A CMOS Label Microarray Microarray Erik Anderson Stanford

Biology-Driven Clustering of Microarray Data Applications to the NCI60 Data Set K.R. Coombes,

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Microarray Data Analysis ECS 289A ECS289A a) Oligonucleotide and b) Spotted Arrays Lochart and

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

Recent development in microarray data analysis Guan-Hua Huang Institute of Statistics National

Systems Systems Systems Integration Systems Integration Systems Systems Integration Systems

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

Machine Learning 1 Machine(Learning(in(a(Nutshell ( Data$ Model$ Performance$ Measure$

Biweight Correlation as a Measure of Distance between Genes on a Microarray Aya Mitani Pitzer

Conflicts between Optimality Criteria in Incomplete-Block Designs for Microarray Experiments R.

For personal use only USA Corporate Presentation USA Corporate Presentation Viralytics Ltd

NCIs Center for Global Health Funded Research Portfolio Implications for Future Dissemination

Welcome to Africa! Presented by the 2016 Tanzania Global Health Students Overview 3 weeks

Expanding Access to Treatment in Wisconsin Final Report and Policy Recommendations The Pew

Improving Global Health Care Delivery Through Collaboration &amp; Partnership Michelle

RE RECORDI RDING OP OPTION ONS You can record your presentation using the recording feature in

Oregon Smoke Management Plan Smoke Management Plan Daily Operations Reporting

Utilizing a Comprehensive Holistic Innovative Approach Jamie Heffernan BSN, RN, CCRN Odette

Improving Global Health Care Delivery Through Collaboration & Partnership Michelle