Gene Expression Analysis Clustering Samples/Condi4ons Find groups - - PDF document

gene expression analysis
SMART_READER_LITE
LIVE PREVIEW

Gene Expression Analysis Clustering Samples/Condi4ons Find groups - - PDF document

4/3/09 CSCI1950Z Computa4onal Methods for Biology Lecture 16 Ben Raphael March 30, 2009 hHp://cs.brown.edu/courses/csci1950z/ Gene Expression Analysis Clustering Samples/Condi4ons Find groups of genes with similar expression


slide-1
SLIDE 1

4/3/09 1

CSCI1950‐Z Computa4onal Methods for Biology Lecture 16

Ben Raphael March 30, 2009

hHp://cs.brown.edu/courses/csci1950‐z/

Gene Expression Analysis

Clustering

  • Find groups of genes with

similar expression profiles. Biclustering

  • Find subsets of genes may

behave similarly under only a subset of condi4ons.

Gene expression Samples/Condi4ons

BMC Genomics 2006, 7:279

slide-2
SLIDE 2

4/3/09 2

Classifica4on

Use gene expression profiles to predict/classify samples into a number of predefined groups.

  • Tissue of origin (heart,

kidney, brain, etc.)

  • Cancer vs. normal
  • Cancer subtypes.

Gene expression Samples/Condi4ons

BMC Genomics 2006, 7:279

Cancer Subtype Classifica4on

Dis4nguish acute lyphoblas4c leukemia (ALL) and acute myeloid leukemia (AML). (Golub et al. Science 1999)

slide-3
SLIDE 3

4/3/09 3

Breast Cancer Prognosis

  • 70 gene signature to

predict breast cancer pa4ents with metastasis within 5 years (van de Vijver et al. 2002, van’t Veer et al. 2002)

  • Now an FDA

approved test: Mammaprint

Classifica4on

Binary classifica2on Given a set of examples (xi, yi), where yi = +‐ 1, from unknown distribu4on D. Design func4on f: Rn  {‐1,+1} that op7mally assigns addi4onal samples xi to

  • ne of two classes.

Supervised learning (xi, yi) training data xi(j): feature. Rn: feature space.

slide-4
SLIDE 4

4/3/09 4

Classifica4on in 2D k‐nearest neighbors

slide-5
SLIDE 5

4/3/09 5

Decision Tree

  • Par44on feature space Rn according to series of decision rules.
  • Internal nodes: Decision rules xi > c
  • Leaves: Label assigned to point.
  • “popular among medical scien4sts, perhaps because it mimics

the way that a doctor thinks.” (Has4e, Tibshirani, Friedman. Elements

  • f Sta4s4cal Learning)

Overfijng

  • Many classifica4on methods can

fit training data with no mistakes.

– k‐NN: k=1 – Decision tree with large depth

  • Classifier will be tuned to

“quirks” in training data. Poor generaliza4on to other data sets.

  • Evaluate classifier on test data,

not training data.

slide-6
SLIDE 6

4/3/09 6

Training and Tes4ng

  • Split data D into training/test sets.
  • Evaluate performance on test set.
  • Available data D olen limited. E.g. few gene

expression samples.

  • Splijng into training/test sets leaves too few

data points.

Cross‐Valida4on

Split D into K equally sized parts D1, …, Dk. For k = 1, …K Train on D’ = D \ Dk. Test on Dk.

Compute performance quan4ty ρk.

Output average of ρk. K‐fold cross‐valida2on. Special case k=n: leave one out cross valida2on.

slide-7
SLIDE 7

4/3/09 7

Feature Selec4on

  • Selec4ng a subset of

features (reduc4on of dimensionality) some4mes leads to beHer classifica4on, performance, etc.

  • Gene expression:

subset of genes informa4ve for classifica4on.

Results: Class Discovery with TNoM (ben‐Dor, Friedman, Yakhini, 2001)

  • Find op4mal labeling L.

– Solu4on: use heuris4c search

  • Find mul4ple (subop4mal) labelings

– Solu4on: Peeling: remove previously used genes from set.

slide-8
SLIDE 8

4/3/09 8

Results: Class Discovery with TNoM

(ben‐Dor, Friedman, Yakhini, 2001)

Leukemia (Golub et al. 1999): 72 expression profiles. 25 AML, 47 ALL. 7129 genes Lymphoma (Alizadeh et al.): 96 expression profiles, 46 Diffuse large B‐cell lymphoma (DLBCL) 50 from 8 different 4ssues. Lymphoma‐DLBCL: subset of 46 of above.

TNoM Results

(ben‐Dor, Friedman, Yakhini, 2001)

% survival 24 pa4ents with low clinical risk. 40 pa4ents years