Rank-Based Classification of Gene Expression Profiles Daniel Q. - - PowerPoint PPT Presentation

rank based classification of gene expression profiles
SMART_READER_LITE
LIVE PREVIEW

Rank-Based Classification of Gene Expression Profiles Daniel Q. - - PowerPoint PPT Presentation

Rank-Based Classification of Gene Expression Profiles Daniel Q. Naiman Collaborators: Donald Geman , Christian dAvignon & Raimond L. Winslow Department of Applied Mathematics and Statistics Center for


slide-1
SLIDE 1

INTERFACE 2004 1

Rank-Based Classification

  • f

Gene Expression Profiles

Collaborators: Donald Geman†‡, Christian d’Avignon†§ & Raimond L. Winslow†§

‡Department of Applied Mathematics and Statistics

†Center for Cardiovascular Bioinformatics and Modeling, Whitaker Biomedical Engineering Institute §Department of Biomedical Engineering

Johns Hopkins University Baltimore, MD

Daniel Q. Naiman‡

slide-2
SLIDE 2

INTERFACE 2004 2

Basic approach to classification using gene expression

Use pairwise comparisons between gene expression levels in pairs as a feature for classification.

Motivations

  • the small sample dilemma
  • parsimony/interpretability
  • transparency - invariance to normalization
  • experimental evidence
slide-3
SLIDE 3

INTERFACE 2004 3

Microarray Data Analysis

Expression data:

G number of genes/EST’s

G n

  • matrix with labeled columns

n number of samples (tissues)

  • btained under various

biological conditions column labels indicate class of samples e.g.

  • tumor/normal
  • disease/non-disease
slide-4
SLIDE 4

INTERFACE 2004 4

Typical Experimental Objectives

Clustering – group genes or samples in meaningful ways Modeling – describe statistical behavior of expression levels

  • marginal behavior for individual genes
  • joint behavior for multiple genes

Classification (the focus of this talk) – predict classes e.g.

  • cancerous tumor vs. normal tissue
  • treatment outcome (success/failure)
  • disease type
slide-5
SLIDE 5

INTERFACE 2004 5

Statistical Perspective: Small Sample Dilemma

  • Problem: Small number of experiments ( ), typically tens,

relative to the number of genes , typically thousands.

  • Example: samples, and genes.
  • Consequence: Standard methods in machine where

algorithms are “tuned” (outside of the CV loop!!!) often lead to

  • ver-fitting and inflated estimates of performance.

34 n 7,129 G n ( ) G

slide-6
SLIDE 6

INTERFACE 2004 6

The Bias Variance Tradeoff

  • Machine learning community mantra:

Complex models lead to low bias/high variance. Simpler models give rise to high bias/low variance.

  • Consequence: Minimization of error rates can result

from choosing models in a smaller class.

slide-7
SLIDE 7

INTERFACE 2004 7

Biological Perspective: Interpretability/Parsimony Dilemma

  • Problem: The decision boundary generated by

standard classifiers can often be highly complex

  • Examples: Support-vector machines, neural

networks, random forests, logitboost, nearest neighbors.

  • The manner in which decisions are made too much

resembles a black box, and decision rules are lacking in transparency.

  • We seek transparent classifiers involving small

numbers of genes.

slide-8
SLIDE 8

INTERFACE 2004 8

Mathematical Formulation

  • Expression random variables:
  • Class random variable:
  • Classifier:
  • Training data: a matrix consisting of columns (expression

profiles) where of the columns are iid samples of given for

  • Learning algorithm: Mapping that assigns a classifier for every

choice of training data

  • Generalization error: the probability of making an

error on a future profile (depends on L and the distribution of .

  • Estimated error rate: An estimate of from data.

1

( ,..., ).

G

X X X

  • 1, 2

Y

  • :

1, 2

G

f

  • L

1 2

n n n

  • k

n

X

Y k

  • 1,2.

k

L

f

S . L ( ) e f ( , ) X Y ( ) [ ( ) ] e f P f X Y

slide-9
SLIDE 9

INTERFACE 2004 9

Pairwise Comparison

Focus on detecting “marker gene pairs” whose expression values invert in going from class 1 to class 2, that is, for which

( , ) i j

:

|

ij i j

p k P X X Y k

  • changes considerably when changing from to

These probabilities are estimated by relative frequencies of

  • ccurrences of

1 k 2. k ,

i j

X X

  • within profiles and over samples.
slide-10
SLIDE 10

INTERFACE 2004 10

“Scoring” Gene Pairs

Define a “score” associated with each gene pair We seek pairs with high scores .

( , ) i j ( , ) i j (1) (2)

ij ij ij

p p

  • ij
slide-11
SLIDE 11

INTERFACE 2004 11

Gene Pair Score Example

39 35 4 class 2 21 4 17 class 1

i j

X X

  • i

j

X X

  • ˆ (1)

17/21

ij

p

  • ˆ (2)

4/39

ij

p

  • 1

21 n

2

39 n ˆ 17/21 4/39 .707

ij

slide-12
SLIDE 12

INTERFACE 2004 12

Interpretation of the Score

Consider classification “stump” based on the feature defined by the indicator :

( )

i j

I X X

  • ˆ

argmax |

k i j

k P X X Y k

  • i

j

X X

  • i

j

X X

  • ˆ

argmax |

k i j

k P X X Y k

  • Sum of error probs =

ˆ ˆ 2 1 1 2 P k k P k k

  • 1

ij

slide-13
SLIDE 13

INTERFACE 2004 13

Gene Pair Selection

  • Estimate

for all gene pairs .

  • Rank all pairs based on
  • Select all of the pairs attaining the

maximum score (ties are common).

ij

  • ˆ .

ij

  • (, )

i j (, ) i j (, ) i j

slide-14
SLIDE 14

INTERFACE 2004 14

The Top Scoring Pair (TSP) Classifier

  • Pair selection results in a family of distinguished

top scoring pairs.

  • We seek classification decisions that are easily

interpreted.

  • Voting is an example of an easy to interpret

algorithm.

  • Let each pair vote using the maximum

likelihood scheme described above.

  • Make a majority rules decision.

( , ) i j

slide-15
SLIDE 15

INTERFACE 2004 15

Voting and Maximum Likelihood

Under the following assumptions, the majority rules procedure can be interpreted as a maximum likelihood estimate of the class:

  • all informative pairs are included
  • individual comparisons are conditionally independent given

the class

  • for some we have either

for all and for all classes

k p ( )

ij

p k p

  • r

( ) 1

ij

p k p

( , ) i j 1,2 k

slide-16
SLIDE 16

INTERFACE 2004 16

Miscellaneous Remarks

  • The TSP classifier is rank-based hence invariant to a large

class of normalization methods (monotone transformations)

  • NO PARAMETERS TO TUNE in TSP leading to HONEST

ERROR RATES.

  • Natural generalization to k-TSP where we choose the k top

scores

  • k determined inside a cross-validation loop (double CV)
  • method remains rank-based, hence invariant as above
  • Bø and Jonassen (2002) introduced an indirect approach to

selecting gene pairs involving profile classification, linear discriminant analysis, and nearest neighbors.

slide-17
SLIDE 17

INTERFACE 2004 17

Miscellaneous Remarks (cont.)

  • Another approach to selection is possible, where, first

attention is restricted to differentially expressed genes

  • possible to miss certain gene pairs when both are not

significantly differentially expressed

  • loss of invariance to normalization
  • A gene may appear in more than one TSP, and this typically
  • ccurs
slide-18
SLIDE 18

INTERFACE 2004 18

Class Prediction Problems

  • Cardiac study: Classifying tissue samples of patients diagnosed

with idiopathic dilated cardiomyopathy (IDCM) vs. control. 3 publicly available studies from the Kent Ridge Bio-medical Data Set Repository

  • Survival study: Predicting outcomes of treatment for tumors of

the central nervous system.

  • Leukemia study: Classifying profiles into leukemia subtypes
  • Prostate study: Distinguishing prostate cancers from normal

profiles.

slide-19
SLIDE 19

INTERFACE 2004 19

Data Set Parameters

50 normal 52 tumors

102 12,600 Prostate

25AML 47ALL

72 7,129 Leukemia

39 survivor 21 non-survivor

60 7,129 Survival

12 IDCM 10 normal

22 22,283 Cardiac class 2 class 1 Study

G

n

slide-20
SLIDE 20

INTERFACE 2004 20

Numbers of Top Scoring Pairs

Generally, the larger the sample size is large relative to the number of genes the fewer TSPs we expect to see.

1 Prostate 3 Leukemia 1 Survival 2,460 Cardiac Number of TSPs Study

slide-21
SLIDE 21

INTERFACE 2004 21

TSP Classification

slide-22
SLIDE 22

INTERFACE 2004 22

Performance Comparisons (Classification Rates by LOOCV)

86%-92% 95% Prostate 85%, 95% 94% Leukemia 47%-77% 83% Survival 100% 100% Cardiac Previous results TSP Study

slide-23
SLIDE 23

INTERFACE 2004 23

Significance by Permutation Analysis

Create artificial data sets by random permutations of column labels

  • maintain sample sizes of the two classes
  • preserve statistical dependency structure among genes
  • resulting top scores in artificial data are indicative of scores
  • btained when attempting to classify based on profile labels that

cannot be predicted from expression values

slide-24
SLIDE 24

INTERFACE 2004 24

Histograms of Simulated TSPs

slide-25
SLIDE 25

INTERFACE 2004 25

Permutation Analysis Results

large Cardiac Prostate Leukemia .10 Survival Simulated p-value Study

(Based on 1,000 permutations)

slide-26
SLIDE 26

INTERFACE 2004 26

Conclusions from Permutation Analysis

Prostate/Leukemia studies Clear statistical significance of TSPs Survival study Ambiguous Cardiac study Insignificant*

*Note: Despite this, there must be informative pairs since

  • therwise, random voting in the LOOCV would lead to

poor classification results.

slide-27
SLIDE 27

INTERFACE 2004 27

Individual t-Statistics of TSPs

6.62 M23197 7.87 J05243 .979 Leukemia 10.92 X95735 1.60 D86976 .979 Leukemia 10.92 X95735 1.99 L11373 .979 Leukemia 4.13 M55914 7.46 M84526 .902 Prostate 3.23 U39317 2.82 M73547 .707 Survival* t-stat2 Genbank ID 2 t-stat1 Genbank ID 1 Score Study *Neither gene for the TSP in the survival study shows significant differential expression by itself.

slide-28
SLIDE 28

INTERFACE 2004 28

Conclusions

TSP classifier appears to have many desirable properties:

  • Simple model /Easily interpretable
  • Competitive statistical performance
  • Invariant to normalization
  • Generalizes to more complex (but still

simple) k-TSP classifier