Capturing Best Practice for Microarray Gene Expression Data - - PowerPoint PPT Presentation

capturing best practice for microarray gene expression
SMART_READER_LITE
LIVE PREVIEW

Capturing Best Practice for Microarray Gene Expression Data - - PowerPoint PPT Presentation

Capturing Best Practice for Microarray Gene Expression Data Analysis Gregory Piatetsky-Shapiro Tom Khabaza Sridhar Ramaswamy Presented briefly by Joey Mudd What is Microarray Data? Microarray devices obtain RNA expression levels from


slide-1
SLIDE 1

Capturing Best Practice for Microarray Gene Expression Data Analysis

Gregory Piatetsky-Shapiro Tom Khabaza Sridhar Ramaswamy

Presented briefly by Joey Mudd

slide-2
SLIDE 2
  • Microarray devices obtain RNA expression

levels from gene samples

  • Data obtained can be used for a variety of

medical purposes: diagnosis, predicting treatment outcome, etc.

  • Data produced are typically large and

complex, which makes data mining a useful task

What is Microarray Data?

slide-3
SLIDE 3
  • Crisp-DM: Cross-Industry

Standard Process model for Data Mining

  • Crisp-DM is a way of

standardizing steps taken in a data mining process using high-level structure and terminology

  • Useful for describing best

practice

Standardizing Data Mining Process

slide-4
SLIDE 4
  • Typical number of records is small (<100) due to

difficulty of collecting samples

  • Typical number of attributes (genes) is large

(many thousands)

  • Can lead to false positives (correlation due to

chance), over-fitting

  • Paper suggests reducing number of genes

examined (feature reduction)

Microarray Data Analysis Issues

slide-5
SLIDE 5
  • Thresholding: Determine appropriate range of values

(authors used min:100, max 16,000 for Affymetrix arrays)

  • Normalization: Required for clustering

(authors used mean 0, stddev 1)

  • Filtering: Remove attributes that do not vary enough

across samples, such as: MaxValue(G)-MinValue(G)<500, MaxValue(G)/MinValue(G)<5

Data Cleaning and Preparation

slide-6
SLIDE 6

Feature Selection

  • Because of the large number of attributes/small number
  • f samples, feature selection is important
  • Use statistical measures to determine “best genes” for

each class

  • To avoid under representing some classes, apply

heuristic of selecting equal number of genes from each class

slide-7
SLIDE 7

Building Classification Models

  • For this data, decision trees work poorly, neural nets

work well

  • Feature reduction alone not sufficient
  • Test models using a varying number of genes from

each class

  • Five-fold sufficient, leave-one-out cross-validation

considered most accurate

slide-8
SLIDE 8

Case Study 1

  • Leukemia data, 2 classes (AML, ALL), 38 samples

training, 34 samples test (separate samples)

  • Filter to reduce number of genes, select top 100 based
  • n T-values
  • Build neural net models, 10 genes turned out to be best

subset size

  • 97% accuracy (33/34 test record correctly classified)
slide-9
SLIDE 9

Case Study 2

  • Brain data, 5 classes, 42 samples (no separate test set)
  • Same preprocessing as Case Study 1
  • Select top genes based on Signal to Noise measure, select

equal number of genes per class

  • Build neural net models, 12 genes per class (60 total)

gave best results

  • Lowest average error rate was 15%.
slide-10
SLIDE 10

Case Study 3

  • Cluster analysis, with goal of discovering natural classes
  • Leukemia data with 3 classes: ALL -> ALL-T and ALL-B
  • Same preprocessing as before, also normalize values for

clustering

  • Used two clustering methods in Clementine package, both

able to discover natural classes in data, to the authors’ satisfaction

slide-11
SLIDE 11

Conclusions

  • Ideas presented could be applicable to other domains

where balance between attributes and samples is similar (cheminfomatics or drug design)

  • Future work could evaluate cost-sensitive classification

which minimize errors based on cost they inflict

  • Principled methodology can lead to good results