[PPT] - Class discrimination for microarray studies Vlad Popovici Swiss PowerPoint Presentation

SLIDE 1

Class discrimination for microarray studies

Vlad Popovici

Swiss Institute of Bioinformatics

February 5th, 2008

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 1 / 45

SLIDE 2

Outline

1

Introduction

2

Discriminant analysis

3

Performance assessment

4

Estimating the performance parameters

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 2 / 45

SLIDE 3

Introduction

Example: ER status prediction

Questions: How to decide which patient is ER+ and which is ER-?

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 3 / 45

SLIDE 4

Introduction

Example: ER status prediction

Questions: How to decide which patient is ER+ and which is ER-?

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 3 / 45

SLIDE 5

Introduction

Example: ER status prediction

Questions: How to decide which patient is ER+ and which is ER-? What is the expected error? What if I prefer to detect most of ER+?

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 3 / 45

SLIDE 6

Introduction

Know your problem!

Remember

Good study ←→ clear objectives.

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 4 / 45

SLIDE 7

Introduction

Know your problem!

Remember

Good study ←→ clear objectives. Problems: Class Comparison: find genes differentially expressed between predefined classes; Class Prediction: predict one of the predefined classes using the gene expressions; Class Discovery: cluster analysis – define new classes using clusters

f genes/specimens.

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 4 / 45

SLIDE 8

Introduction

Know your problem!

Remember

Good study ←→ clear objectives. Problems: Class Comparison: find genes differentially expressed between predefined classes; Class Prediction: predict one of the predefined classes using the gene expressions; Class Discovery: cluster analysis – define new classes using clusters

f genes/specimens.

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 4 / 45

SLIDE 9

Introduction

Class prediction

Typical applications: predict treatment response predict patient relapse predict the phenotype toxico–genomics: predict which chemicals are toxic

. . .

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 5 / 45

SLIDE 10

Introduction

Class prediction

Typical applications: predict treatment response predict patient relapse predict the phenotype toxico–genomics: predict which chemicals are toxic

. . .

Characteristics: supervised learning: requires labelled training data the goal is prediction accuracy uses some measure of similarity relies on feature selection quite often incorrectly used

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 5 / 45

SLIDE 11

Introduction

Usage problems: improper methodological approach:

◮ well fitted model does not ensure good prediction (overfitted model) ◮ too many features used in the model (curse of dimensionality) ◮ feature selection on the full dataset(!)

reproducibility:

◮ improper/insufficient validation ◮ batch effects unaccounted for ◮ insufficiently documented

therapeutic relevance

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 6 / 45

SLIDE 12

Introduction

External validation Data acquisition Design decisions: −feature selection method(s) −classifier(s) −performance criterion Model construction:

Feature selection Classifier design Performance estimation Model selection

Data acquisition: everything up to (and including) normalization Design decisions: should be taken before real modeling Model design: DO NOT USE ALL DATA AT ONCE!! External validation: other datasets; clinical trials: phase II and III

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 7 / 45

SLIDE 13

Discriminant analysis

Goal

Find a separation boundary between the classes. f(x) > 0 f(x) < 0 f(x) = 0

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 8 / 45

SLIDE 14

Discriminant analysis

Representing data

1007_s_at 117_at 1053_at 211585_at 211584_s_at Tumor 1 Tumor k Tumor n Tumor 2 probeset i

. . . . . . . . . . . .

p features

each element to be classified is a vector

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 9 / 45

SLIDE 15

Discriminant analysis

Representing data

1007_s_at 117_at 1053_at 211585_at 211584_s_at Tumor 1 Tumor k Tumor n Tumor 2 probeset i

. . . . . . . . . . . .

p features

each element to be classified is a vector usually we classify tumors/samples/patients

→ use columns

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 9 / 45

SLIDE 16

Discriminant analysis

Representing data

1007_s_at 117_at 1053_at 211585_at 211584_s_at Tumor 1 Tumor k Tumor n Tumor 2 probeset i

. . . . . . . . . . . .

p features

each element to be classified is a vector usually we classify tumors/samples/patients

→ use columns

p ≫ n

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 9 / 45

SLIDE 17

Discriminant analysis

Formalism: Data points: X = {xi ∈ Rp|i = 1, . . . , n} x could be the log ratios or log signals Labels: Y = {yi ∈ {ω1, . . . , ωk}|i = 1, . . . , n}; (k classes) e.g. ω1 = pCR and ω2 = non-pCR Easier: take yi ∈ {1, 2, . . . , k} or yi ∈ {−1, +1} (for two classes)

2–class problem (dichotomy)

Given a finite set of points X and their corresponding labels Y (a training set), find a discriminant function f such that f(x)

       > 0

for x ∈ ω1

< 0

for x ∈ ω2

,

for all x.

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 10 / 45

SLIDE 18

Discriminant analysis

Assumptions: training set is representative for the whole population the characteristics of the population do not change over time (e.g. same experimental conditions) Comments: for all x: i.e. infere a rule that works for unseen data – generalization perfect classification of the training data does not ensure generalization; e.g.: f(xi) = yi will hardly work on new data as stated, the problem is ill–posed: there are an infinity of solutions real data is noisy: usually there is no perfect solution

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 11 / 45

SLIDE 19

Discriminant analysis

Linear discriminants

With x = (x1, . . . , xn)t, w = (w1, . . . , wn)t,

Linear discriminant functions

f(x) = wtx + w0 =

n

i=1

wixi + w0, w, x ∈ Rn, w0 ∈ R New problem: optimize some criterion

(w∗, w∗

0) = arg max w,w0 J(X, Y; w, w0)

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 12 / 45

SLIDE 20

Discriminant analysis

Geometry of the linear discriminants

w x

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 13 / 45

SLIDE 21

Discriminant analysis

Geometry of the linear discriminants

w w0

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 13 / 45

SLIDE 22

Discriminant analysis

Fisher’s LDA

Fisher’s criterion

J(w) = wtSBw wtSWw where SB = (µ1 − µ2)(µ1 − µ2)t is the between class scatter matrix (µi is the average of the class i) SW =

1 n−2(n1 ˆ

Σ1 + n2 ˆ Σ2) is the pooled within class covariance matrix

(ˆ

Σi is the estimated covariance of the class i)

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 14 / 45

SLIDE 23

Discriminant analysis

Fisher’s criterion (in plain English)

Find the direction w along which the two classes are best separated – in some sense. Solution: w = S−1

W (µ1 − µ2)

w0 =??

◮ assuming data is normal with

equal covariances: closed–form formula

◮ alternative: estimate w0 by

line search

◮ can be used to embed prior

probabilities in the classifier

w w0 Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 15 / 45

SLIDE 24

Discriminant analysis

Versions of Fisher’s DA

Under normality assumption and if the covariance matrices are equal and the features are uncorrelated: Diagonal LDA (the covariance matrices are diagonal); if the covariance matrices are not equal: Quadratic DA

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 16 / 45

SLIDE 25

Discriminant analysis

Versions of Fisher’s DA

(from Duda, Hart & Stork, Pattern Classification) Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 17 / 45

SLIDE 26

Discriminant analysis

A Bayesian perspective

2 classes, ω1, ω2

ne continuous feature

(e.g. log expression) p(x|ωi): class–conditional distribution from Bayes’ formula: p(ωi|x) = p(x|ωi)p(ωi) p(x) posterior = likelihood × prior evidence

ptimal decision (Bayes decision rule):

decide x ∈ ω1 if p(ω1|x) > p(ω2|x),

therwise decide x ∈ ω2

p(ωi) =? p(x|ωi) =?

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 18 / 45

SLIDE 27

Discriminant analysis

Comments: taking p(x|ωi) to be normal densities ⇒ Bayes decision boundary is

ne of the previous DA forms

this hypothesis allows estimating the posteriors too ⇒ we can have a confidence measure on the decision estimating p(x|ωi) from data ⇒ Naïve Bayes classifier

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 19 / 45

SLIDE 28

Discriminant analysis

Support Vector Machines (SVM)

separation of the classes is measured in terms of margin. rigorous mathematical framework (structural risk minimization) linear discriminant (optimal hyperplane)

margin

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 20 / 45

SLIDE 29

Discriminant analysis

Support Vector Machines (SVM)

separation of the classes is measured in terms of margin. rigorous mathematical framework (structural risk minimization) linear discriminant (optimal hyperplane)

error

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 20 / 45

SLIDE 30

Discriminant analysis

Principle of SVM

Maximize the margin while minimizing the training error. Optimization problem: find the hyperplane such that 1 margin2 + cost ·

errori

is minimized.

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 21 / 45

SLIDE 31

Discriminant analysis

Kernel trick

Q: What if data is not linearly separable? A: Transform data such that it becomes (more or less) linearly separable.

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 22 / 45

SLIDE 32

Discriminant analysis

Kernel trick

Q: What if data is not linearly separable? A: Transform data such that it becomes (more or less) linearly separable. Kernels: polynomial, radial basis function, etc.

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 22 / 45

SLIDE 33

Discriminant analysis

Other commonly used classifiers

nearest centroid, k−NN logistic regression classification trees (CART, C4.5); random forests compound covariate predictor

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 23 / 45

SLIDE 34

Discriminant analysis

Final remarks on classifiers

try to understand the classifier: assumptions, strong/weak points

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 24 / 45

SLIDE 35

Discriminant analysis

Final remarks on classifiers

try to understand the classifier: assumptions, strong/weak points feature selection should be related to the classifier used (e.g. t–test is suited for LDA but not for SVM)

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 24 / 45

SLIDE 36

Discriminant analysis

Final remarks on classifiers

try to understand the classifier: assumptions, strong/weak points feature selection should be related to the classifier used (e.g. t–test is suited for LDA but not for SVM) complex classifiers generally need more data for training

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 24 / 45

SLIDE 37

Discriminant analysis

Final remarks on classifiers

try to understand the classifier: assumptions, strong/weak points feature selection should be related to the classifier used (e.g. t–test is suited for LDA but not for SVM) complex classifiers generally need more data for training use the simplest classifier that gives good results...

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 24 / 45

SLIDE 38

Discriminant analysis

Final remarks on classifiers

try to understand the classifier: assumptions, strong/weak points feature selection should be related to the classifier used (e.g. t–test is suited for LDA but not for SVM) complex classifiers generally need more data for training use the simplest classifier that gives good results... ...but not a simpler one!

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 24 / 45

SLIDE 39

Performance assessment

Aspects of classifiers’ performance

discriminability: how well the rule predicts unseen data reliability: robustness of the prediction or how well the posterior probabilities are estimated

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 25 / 45

SLIDE 40

Performance assessment

Counting errors

basic: how many times the predicted class differs from the true class (0–1 loss) cost–aware methods: some errors may be more costly root mean square error (RMSE): requires yi ∈ {0, 1} and that the classifier outputs posterior probabilities pi: RMSE =

(1/n)
i

(yi − pi)2

mean absolute error (MAE): same requirements as RMSE: MAE = (1/n)

i

|yi − pi|

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 26 / 45

SLIDE 41

Performance assessment

Counting errors

basic: how many times the predicted class differs from the true class (0–1 loss) cost–aware methods: some errors may be more costly root mean square error (RMSE): requires yi ∈ {0, 1} and that the classifier outputs posterior probabilities pi: RMSE =

(1/n)
i

(yi − pi)2

mean absolute error (MAE): same requirements as RMSE: MAE = (1/n)

i

|yi − pi|

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 26 / 45

SLIDE 42

Performance assessment

Confusion matrix

(for 2 classes called positive and negative, respectively) Ground truth (gold standard) Positive Negative Predicted Positive a b Negative c d a + b + c + d = n ideally: b = c = 0 a: true positive, b: false positive, c: false negative, d: true negative

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 27 / 45

SLIDE 43

Performance assessment

Error/performance metrics

These are point estimates:

apparent positives = a + c, apparent negatives = b + d false positive rate FPR = b/(b + d), false negative rate FNR = c/a + c) accuracy ACC = (a + d)/n misclassification rate Err = (b + c)/n sensitivity SN = a/(a + c) = true positive rate specificity SP = d/(b + d) = true negative rate positive prediction value PPV = a/(a + b) negative prediction value NPV = d/(c + d)

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 28 / 45

SLIDE 44

Performance assessment

ROC and AUC

What if we like to see the effect of trading–off true positive rate and false positive rate (SN and 1 − SP)?

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 29 / 45

SLIDE 45

Performance assessment

ROC and AUC

What if we like to see the effect of trading off true positive rate and false positive rate (SN and 1 − SP)? vary the threshold for each value of the threshold, measure TPR, FPR

1 1

True positive rate (SN) False positive rate (1−SP) SN SP

ROC = Receiver Operating Characteristics

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 30 / 45

SLIDE 46

Performance assessment

ROC and AUC

Different types of ROC curve:

1 1

True positive rate (SN) False positive rate (1−SP)

random classifier

R0 R1 R2

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 31 / 45

SLIDE 47

Performance assessment

ROC and AUC

ROC curves can be used for comparing classifiers...

1 1

True positive rate (SN) False positive rate (1−SP)

R1 R2

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 32 / 45

SLIDE 48

Performance assessment

ROC and AUC

ROC curves can be used for comparing classifiers... but not always

1 1

True positive rate (SN) False positive rate (1−SP)

R1 R2

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 33 / 45

SLIDE 49

Performance assessment

ROC and AUC

Area Under the Curve (AUC) is a summary of the ROC...

1 1

True positive rate (SN) False positive rate (1−SP)

AUC

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 34 / 45

SLIDE 50

Performance assessment

ROC and AUC

Area Under the Curve (AUC) is a summary of the ROC...

1 1

True positive rate (SN) False positive rate (1−SP)

AUC

...and can be used for comparing classifiers.

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 34 / 45

SLIDE 51

Estimating the performance parameters

Estimation schemes

Why estimation? Generally there is no analytic form of the error rate. We discuss the error rate, but this applies to any performance metric. Simple schemes: apparent error: estimate on the training set holdout estimate: unique split train/test Resampling schemes: k–fold cross–validation repeated k–fold cross–validation leave–one–out bootstrapping

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 35 / 45

SLIDE 52

Estimating the performance parameters

k–fold cross–validation

separated train and test sets randomly dived data into k subsets (folds) – you may also choose to enforce the proportion

f the classes (stratified CV)

train on k − 1 folds and test on the holdout fold estimate the error as the average error measured on holdout folds

TRAIN SET TEST SET

usually k = 5 or k = 10 if k = n ⇒ leave–one–out estimator improved estimation: repeated k−CV (e.g. 100 × (5 − CV))

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 36 / 45

SLIDE 53

Estimating the performance parameters

k–fold cross–validation

From k folds:

ǫ1, . . . , ǫk errors on the test folds

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 37 / 45

SLIDE 54

Estimating the performance parameters

k–fold cross–validation

From k folds:

ǫ1, . . . , ǫk errors on the test folds ˆ

Ek−CV = 1

k

j=1 ǫj

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 37 / 45

SLIDE 55

Estimating the performance parameters

k–fold cross–validation

From k folds:

ǫ1, . . . , ǫk errors on the test folds ˆ

Ek−CV = 1

k

j=1 ǫj

estimated standard deviation

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 37 / 45

SLIDE 56

Estimating the performance parameters

k–fold cross–validation

From k folds:

ǫ1, . . . , ǫk errors on the test folds ˆ

Ek−CV = 1

k

j=1 ǫj

estimated standard deviation Confidence intervals (simple version – binomial approximation): E ≈ ˆ E ±

         

0.5 n + z

ˆ

E(1 − ˆ E) n

         

where n is the dataset size and z = Φ−1(1 − α/2), for a 1 − α confidence interval (e.g. z = 1.96 for 95% conf. interval)

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 37 / 45

SLIDE 57

Estimating the performance parameters

Bootstrap error estimation

Performance estimation (X,Y) (X1,Y1) (X2,Y2) (XB,YB) E1 E2 EB ...

1

generate a new dataset (Xb, Yb) by resampling with replacement from the original dataset (X, Y)

2

train the classifier on (Xb, Yb) and test on the left out data, to obtain an error ˆ Eb.

3

repeat 1–2 for b = 1, . . . , B and collect ˆ Eb.

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 38 / 45

SLIDE 58

Estimating the performance parameters

Bootstrap error estimation

estimate the error: for example, use the .632 estimator

ˆ

E = 0.368E0 + 0.632 1 B

B

b=1

ˆ

Eb where E0 is the error rate on the full training set (X, Y). use the empirical distribution of ˆ Eb to obtain confidence intervals

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 39 / 45

SLIDE 59

Estimating the performance parameters

Case study (1)

Build a classifier to predict ER status: select top 2 probesets using the ratio of between–groups to within–groups sum of squares (similar to t–test) use LDA to discriminate between ER- and ER+

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 40 / 45

SLIDE 60

Estimating the performance parameters

Case study (1)

Build a classifier to predict ER status: select top 2 probesets using the ratio of between–groups to within–groups sum of squares (similar to t–test) use LDA to discriminate between ER- and ER+ Algorithm Let (X, Y) be the full dataset. For (all k=1...K folds): let the training set be Xtr = X \ X(k), Ytr = Y \ Y(k), (all but the k–th fold), and let the test set be Xts = X(k), Yts = Y(k) select the best 2 probesets on (Xtr, Ytr) and train LDA on these data test on (Xts, Yts) and record the values for error, AUC,... Compute the expected error, AUC,...

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 40 / 45

SLIDE 61

Estimating the performance parameters

Case study (1)

CV scheme Error 95% CI for Error AUC 5 − CV 0.0692 0.0073–0.1311 0.966 10 − CV 0.0615 0.0026–0.1204 0.970 20 × (5 − CV) 0.0615 (0.0538–0.0692)∗ 0.969

∗ from the quantiles of the empirical distribution of the error

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 41 / 45

SLIDE 62

Estimating the performance parameters

Case study (2)

Build a classifier to predict patients with pathologic complete response (pCR). find the best number of probesets to include in the model using the ratio of between–groups to within–groups sum of squares (similar to t–test) use LDA to discriminate between patients with pCR and those without.

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 42 / 45

SLIDE 63

Estimating the performance parameters

Case study (2)

Build a classifier to predict patients with pathologic complete response (pCR). find the best number of probesets to include in the model using the ratio of between–groups to within–groups sum of squares (similar to t–test) use LDA to discriminate between patients with pCR and those without. best number of probesets is a meta–parameter, which has to be estimated within a cross–validation problem: if we estimate the meta–parameter on the full training set, we will likely fail to correctly estimate the performance (optimistic bias)

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 42 / 45

SLIDE 64

Estimating the performance parameters

Case study (2)

Two–external (nested) CV:

TRAIN SET TEST SET Outer CV loop Inner CV loop

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 43 / 45

SLIDE 65

Estimating the performance parameters

What we do learn from CV: the expected performance of the modeling recipe; the imprecision in estimating the performance; we can have a look at:

◮ what are the most stable features ◮ what are the points always missclassified Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 44 / 45

SLIDE 66

Estimating the performance parameters

What we do learn from CV: the expected performance of the modeling recipe; the imprecision in estimating the performance; we can have a look at:

◮ what are the most stable features ◮ what are the points always missclassified

What we do not learn from CV: the best features the best classifier the best meta–parameters We obtain these by training on the full dataset (no CV).

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 44 / 45

SLIDE 67

Bibliography

Duda, Hart, Stork: Pattern Classification Hastie, Tibshirani, Friedman: The Elements of Statistical Learning T.Fawcett: ROC Graphs: Notes and Practical Considerations for Data Mining Researchers, HP Laboratories Tech. Rep. HPL–2003–4 A.Webb: Statistical Pattern Recognition I.Shmulevich, E.Dougherty: Genomic Signal Processing Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 45 / 45