Microarray Data Integration and Machine Learning Techniques For - - PowerPoint PPT Presentation

microarray data integration and machine learning
SMART_READER_LITE
LIVE PREVIEW

Microarray Data Integration and Machine Learning Techniques For - - PowerPoint PPT Presentation

Microarray Data Integration and Machine Learning Techniques For Lung Cancer Survival Prediction Daniel Berrar , Brian Sturgeon, Ian Bradbury, C. Stephen Downes, Werner Dubitzky November 14, 2003 Outline Summary of Results (1 slide)


slide-1
SLIDE 1

Microarray Data Integration and Machine Learning Techniques For Lung Cancer Survival Prediction

Daniel Berrar, Brian Sturgeon, Ian Bradbury,

  • C. Stephen Downes, Werner Dubitzky

November 14, 2003

slide-2
SLIDE 2

Outline

  • Summary of Results (1 slide)
  • Overview of Tasks (1 slide)
  • Data Integration (4 slides)
  • Methods (6 slides)
  • Results and Biological Interpretation (6 slides)
  • Conclusions (1 slide)
slide-3
SLIDE 3

Summary of Results

  • With respect to tasks:

– Classification task: Prediction of 5-year survival is most accurate when we build a model using only patient data (age, tumor stage,…); – Regression task: Prediction of survival in months is more accurate for the model relying on expression data than on patient data, and best when the model relies on both patient and expression data;

  • With respect to methods:

– “Best” model: Decision tree

slide-4
SLIDE 4

Tasks

  • Task #1: Data integration

– Integration of lung cancer microarray data of Harvard and Michigan data set and data pre-processing;

  • Task #2: Classification

– (a) Prediction of 5-year survival of patients based on (1) patient data, (2) expression data, and (3) both patient and expression data; – (b) Comparison of 6 machine learning methods;

  • Task #3: Regression

– Prediction of survival time using (1) patient data, (2) expression data, and (3) both patient and expression data;

  • Task #4: Interpretation

– Biological interpretation of identified genes.

slide-5
SLIDE 5

Task #1: Data Integration [1/4]

slide-6
SLIDE 6

Task #1: Data Integration [2/4]

211 patients Target variables Patient data 3,588 genes Expression data

slide-7
SLIDE 7

Task #1: Data Integration [3/4]

  • Data pre-processing for classification task:

– Group patients into 2 classes:

  • LOW RISK: Survival ≥ 5 years
  • HIGH RISK: Survival < 5 years

– Discard patients that are censored before 60 months – Remaining number of patients: 136

  • Data pre-processing for regression task:

– Include all 211 patients.

  • Data pre-processing for both tasks:

– Generate learning set and test set by randomly splitting the entire data set (~70% : ~30%).

slide-8
SLIDE 8

Task #1: Data Integration [4/4]

1 96

...

Learning Set 1 40

...

Test Set 1 96

...

Learning Set 1 40

...

Test Set 1 96

...

Learning Set 1 40

...

Test Set

Model Model Model

Task #2: Classification

148 1 148

...

Learning Set 1 63

...

Test Set 1 148

...

Learning Set 1 63

...

Test Set 1

...

Learning Set 1 63

...

Test Set

CART CART CART

Task #3: Regression

Patient data Expression data Patient + Expression data

slide-9
SLIDE 9

Tasks

  • Task #1: Data pre-processing

– Integration of lung cancer microarray data of Harvard and Michigan data set;

  • Task #2: Classification

– (a) Prediction of 5-year survival of patients based on (1) patient data, (2) expression data, and (3) both patient and expression data; – (b) Comparison of 6 machine learning methods;

  • Task #3: Regression

– Prediction of survival time using (1) patient data, (2) expression data, and (3) both patient and expression data;

  • Task #4: Interpretation

– Biological interpretation of identified genes.

slide-10
SLIDE 10

Methods – Overview

  • Methods used to address Classification-Task

(1) k -nearest neighbour (k-NN) (2) Decision Tree C5.0 (3) Boosted Decision Trees (4) Support Vector Machines (SVMs) (5) Artificial Neural Networks (Multilayer Perceptrons, MLPs) (6) Probabilistic Neural Networks (PNNs)

  • Methods used to address Regression-Task

(1) Classification and Regression Tree (CART)

slide-11
SLIDE 11

Methods – Comparison of Principles

  • Consider the following 2-class problem
slide-12
SLIDE 12

Methods – Decision Tree

  • Recursively split the data set into

decision regions and generate a rule set

  • Classify the test case using the rule set

Root Node y ≤ Split #1 Class • y > Split #1 Split again Boosted decision trees:

  • Aggregating decision trees (committee) by weighted voting

and resampling of the data set.

slide-13
SLIDE 13

Methods – Support Vector Machine

  • Find optimal separating hyperplane

by maximizing the margin between 2 classes

  • Classify the test case using hyperplane

Class • Class

slide-14
SLIDE 14

Methods – Strengths and Weaknesses

*Lee Y., Lee C.K.: Classification of multiple cancer types by multicategory support vector machines using gene expression data. Bioionformatics 19(9), pp. 1132-1139, (2003).

Most (if not all) models ultimately rely on a definition of distance between objects. This definition is not trivial in high-dimensional space. Distance metric: tuning parameter? fractal distance [Aggarwal et al., ICDT , 2001]

slide-15
SLIDE 15

Results of Task#1: Classification

slide-16
SLIDE 16

Tasks

  • Task #1: Data pre-processing

– Integration of lung cancer microarray data of Harvard and Michigan data set;

  • Task #2: Classification

– (a) Prediction of 5-year survival of patients based on (1) patient data, (2) expression data, and (3) both patient and expression data; – (b) Comparison of 6 machine learning methods;

  • Task #3: Regression

– Prediction of survival time using (1) patient data, (2) expression data, and (3) both patient and expression data;

  • Task #4: Interpretation

– Biological interpretation of identified genes.

slide-17
SLIDE 17

Methods – Classification and Regression Tree

  • Algorithm is similar to the decision tree C5.0
  • Heuristic is based on recursive partitioning of data set
  • Differences:
slide-18
SLIDE 18

Results of Task #2: Regression [1/3]

  • Evaluation criteria:

– How many death events are correctly identified as death events, and how many are not? accuracy – For the correctly identified death events, what is the deviance

  • f the residuals between the real and the predicted survival

time?

slide-19
SLIDE 19

Results of Task #2: Regression [2/3]

slide-20
SLIDE 20

Results of Task #2: Regression [3/3]

10.9

slide-21
SLIDE 21

Tasks

  • Task #1: Data pre-processing

– Integration of lung cancer microarray data of Harvard and Michigan data set;

  • Task #2: Classification

– (a) Prediction of 5-year survival of patients based on (1) patient data, (2) expression data, and (3) both patient and expression data; – (b) Comparison of 6 machine learning methods;

  • Task #3: Regression

– Prediction of survival time using (1) patient data, (2) expression data, and (3) both patient and expression data;

  • Task #4: Interpretation

– Biological interpretation of identified genes.

slide-22
SLIDE 22

Task #4: Biological Interpretation [1/2]

  • How to interpret the results? Using literature, OMIM,

PubMed,…

  • # of features relevant for classification task: 8, e.g.

ZNF174 (zinc finger protein)

  • Proteins of this family probably have an impact on

repression of growth factor gene expression [OMIM, 603900]

  • Example: Wilms tumour suppressor WT1 encodes a

zinc finger protein that downregulates the expression

  • f various growth factor genes [OMIM, 603900]
  • Decision tree: overexpression of ZNF174 is associated

with LOW RISK, underexpression with HIGH RISK

  • ZNF174: important marker in Burkitt’s lymphoma cells

[Li et al., PNAS, May 2003]

slide-23
SLIDE 23

Task #4: Biological Interpretation [2/2]

  • # of features relevant for regression task: 5, e.g. NifU
  • function not fully understood yet
  • is likely to be involved in the mobilization of iron and

sulfur for nitrogenase-specific iron-sulfur cluster formation

  • important for breast cancer classification

[Hedenfalk et al., N Engl J Med, Feb. 2001];

  • Decision tree: overexpression of NifU is associated

with good clinical outcome for patients with early tumour stage.

slide-24
SLIDE 24

Conclusions

  • Integrating clinical and transcriptional data might

improve survival outcome prediction;

  • “Best” model in this study: decision tree, but…
  • Method of choice is not available;
  • No Free Lunch Theorem: “No classifier is inherently

superior to any other. The type of the problem determines which classifier is most appropriate.”

  • George Box: “Statisticians, like artists, have the bad

habit of falling in love with their models.”

slide-25
SLIDE 25

Acknowledgements

  • Brian Sturgeon
  • Ian Bradbury
  • C. Stephen Downes
  • Werner Dubitzky

Supplementary information will be available at http://research.bioinformatics.ulster.ac.uk/~dberrar/camda03.html.

slide-26
SLIDE 26

Methods – Comparison of Principles

  • Consider the following 2-class problem of cases (x,y)
  • t = {0.1, 0.2,…10},
  • Class A: f •(x) = t cos(t), f•(y) = t sin(t)
  • Class B: f (x) = t sin(t), f (y) = t cos(t)
slide-27
SLIDE 27

Methods – k-Nearest Neighbour

  • Retrieve the nearest neighbours of

the test case

  • Classify test case based on the

class membership of the nearest neighbours

slide-28
SLIDE 28

Methods – k-Nearest Neighbour

Learning:

  • For each case in the learning set,

determine all neighbours and rank them with respect to similarity = 1 − distance

  • Determine global optimal number of

nearest neighbours kopt (e.g., in LOOCV) Test:

  • Use kopt for classifying the test cases
  • Interpret normalized similarities as

measure of confidence

Case # Similarity (= 1 − distance) Normed Class 27 0.0921 0.35795

  • 29

0.0833 0.32375 34 0.0819 0.31831

  • Confidence for class • : 0.35795 +

0.31831 = 0.67626

Confidence for class :

0.32375

Suppose that kopt = 3 and the following nearest neighbours:

slide-29
SLIDE 29

Methods – Support Vector Machine

  • Goal: Finding the optimal decision boundery between 2 classes
  • SVM-heuristic for separable problems (non-overlapping classes):

– Construct optimal separating hyperplane by maximizing the margin

slide-30
SLIDE 30

Methods – Support Vector Machine

Linear separable Not linear separable Minimization calculus

slide-31
SLIDE 31

Methods – Support Vector Machine

  • Use of projection to higher dimensional space using kernel

function.

(x, y) (x, y, xy)

slide-32
SLIDE 32

Methods – Linear SVM

Class • Class Linear SVM Linear SVM (higher space)

slide-33
SLIDE 33

Methods – SVM

SVM1 SVM2 SVM

3

SVM2 SVM1 SVM3

Fruits on this site of the hyperplane, given by SVM , cannot be bananas.

1

slide-34
SLIDE 34

Methods – Multilayer Perceptron

Class • Class

  • Construct non-linear decision boundery
  • Classify the test case using

decision boundery

slide-35
SLIDE 35

Methods – Probabilistic Neural Network

  • Parallel implementation of Bayes-Parzen classifier
  • Bayes decision criterion

– pk: prior probability that case belongs to class k – ck: costs associated with a case of this class being misclassified – fk: estimated density of class k

An unknown case z is classified as member of class i if for all j ≠ i :

pi ci fi (z) > pj cj fj (z)

x1 x2x3 x4 x 5 x6 x 7 y1 y2 y 3 y4y5y6y7

ˆ

X

f ˆ

Y

f

z

x1 x2 x3 x5 x6 x7 y1 y2 y3 y4 y5 y6 y7

O

X,Y

IX,Y ? ˆ( |?) p X ˆ( |?) p Y

X Y

x4

slide-36
SLIDE 36

Methods – Probabilistic Neural Network

  • Parallel implementation of Bayes-

Parzen classifier

  • Takes into account class densities

and class priors

  • Estimates class posteriors for test

cases

  • Classify new cases, e.g.

using argmax(pi)

slide-37
SLIDE 37

Curse of Dimensionality

  • Aka large p, small n-problem: many variables, few observations
  • Frequent in life sciences (e.g., microarray data analysis)
  • Most machine learning methods have been developed for

scenarios that are characterized by many observations and few variables

  • Problem: how to define similarity?
  • Ok, similarity = 1 – distance, but how to define distance?
slide-38
SLIDE 38

Curse of Dimensionality

  • L1 norm: Manhattan distance
  • L2 norm: Euclidean distance
  • The higher the dimension of the data, the smaller the Lk norm

[Aggarwal et al.,2001, “On the surprising behavior of distance metrics in high dimensional space.”]

  • Fractal distance:
  • Implemented for PNN and k-NN in the present study
  • Additional tuning parameter: fract
slide-39
SLIDE 39

Task #2: CART – Background

  • Method: Classification and Regression Tree (CART)
  • Heuristic: Recursive partitioning of data set
  • Example:
slide-40
SLIDE 40

Task #2: CART – Background

  • Problem: Censored observations
slide-41
SLIDE 41

Results of Task #2: Regression

slide-42
SLIDE 42

Results of Task #2: Regression [1/3]

Node #4 (Learning) Node #4 (Test)