Capturing Best Practice for Microarray Gene Expression Data - PowerPoint PPT Presentation

Capturing Best Practice for Microarray Gene Expression Data Analysis Gregory Piatetsky-Shapiro Tom Khabaza Sridhar Ramaswamy Presented briefly by Joey Mudd

What is Microarray Data? •Microarray devices obtain RNA expression levels from gene samples •Data obtained can be used for a variety of medical purposes: diagnosis, predicting treatment outcome, etc. •Data produced are typically large and complex, which makes data mining a useful task

Standardizing Data Mining Process •Crisp-DM: Cross-Industry Standard Process model for Data Mining •Crisp-DM is a way of standardizing steps taken in a data mining process using high-level structure and terminology •Useful for describing best practice

Microarray Data Analysis Issues •Typical number of records is small (<100) due to difficulty of collecting samples •Typical number of attributes (genes) is large (many thousands) •Can lead to false positives (correlation due to chance), over-fitting •Paper suggests reducing number of genes examined (feature reduction)

Data Cleaning and Preparation •Thresholding: Determine appropriate range of values (authors used min:100, max 16,000 for Affymetrix arrays) •Normalization: Required for clustering (authors used mean 0, stddev 1) •Filtering: Remove attributes that do not vary enough across samples, such as: MaxValue(G)-MinValue(G)<500, MaxValue(G)/MinValue(G)<5

Feature Selection •Because of the large number of attributes/small number of samples, feature selection is important •Use statistical measures to determine “best genes” for each class •To avoid under representing some classes, apply heuristic of selecting equal number of genes from each class

Building Classification Models •For this data, decision trees work poorly, neural nets work well •Feature reduction alone not sufficient •Test models using a varying number of genes from each class •Five-fold sufficient, leave-one-out cross-validation considered most accurate

Case Study 1 •Leukemia data, 2 classes (AML, ALL), 38 samples training, 34 samples test (separate samples) •Filter to reduce number of genes, select top 100 based on T-values •Build neural net models, 10 genes turned out to be best subset size •97% accuracy (33/34 test record correctly classified)

Case Study 2 •Brain data, 5 classes, 42 samples (no separate test set) •Same preprocessing as Case Study 1 •Select top genes based on Signal to Noise measure, select equal number of genes per class •Build neural net models, 12 genes per class (60 total) gave best results •Lowest average error rate was 15%.

Case Study 3 •Cluster analysis, with goal of discovering natural classes •Leukemia data with 3 classes: ALL -> ALL-T and ALL-B •Same preprocessing as before, also normalize values for clustering •Used two clustering methods in Clementine package, both able to discover natural classes in data, to the authors’ satisfaction

Conclusions •Ideas presented could be applicable to other domains where balance between attributes and samples is similar (cheminfomatics or drug design) •Future work could evaluate cost-sensitive classification which minimize errors based on cost they inflict •Principled methodology can lead to good results

Capturing Best Practice for Microarray Gene Expression Data - PowerPoint PPT Presentation

Capturing Best Practice for Microarray Gene Expression Data Analysis Gregory Piatetsky-Shapiro Tom Khabaza Sridhar Ramaswamy Presented briefly by Joey Mudd What is Microarray Data? Microarray devices obtain RNA expression levels from

Gene Expression Data Introduction to gene expression data Expression data storage concept An

CSci 8980: Advanced Topics in Graphical Models Application: Gene Expression Analysis Instructor:

Gene Expression Microarray 02-223 How to Analyze Your Own

Eukaryotic Gene Eukaryotic Gene Prediction Prediction Eukaryotic gene structure Eukaryotic

Introduction to Microarray Data Analysis and Gene Networks Alvis Brazma European Bioinformatics

Gene expression analysis Roadmap Microarray technology: how it work Applications: what

Recent development in microarray data analysis Guan-Hua Huang Institute of Statistics National

Microarray Data Analysis ECS 289A ECS289A a) Oligonucleotide and b) Spotted Arrays Lochart and

Analysis of Gene Expression Profiles Analysis of Gene Expression Profiles and Drug Activity

Gene Expression Remember the days of 10 th grade biology Learning about gene expression Which can

A Data Warehouse-based A Data Warehouse-based Gene Expression Analysis Gene Expression Analysis

AP BIOLOGY Gene Expression Summer 2013 www.njctl.org Slide 3 / 199 Gene Expression Unit Topics

1 Milestones Milestones ID Task Name Duration Start Finish % Complete 1 Project Proposal

CSEP 527 Computational Biology Gene Expression Analysis 1 Assaying Gene Expression 3

CSEP 527 Computational Biology Gene Expression Analysis 1 Assaying Gene Expression 3

CSEP 590 B Computational Biology Gene Expression Analysis 1 Assaying Gene Expression 3

Assessing Differential Gene Expression from RNA-Seq Data Yanming Di Department of Statistics

Predicting perturbation effects in large-scale systems from observational data Marloes Maathuis

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Clustering Gene Expression Data

3D folding of chromosomal domains in relation to gene expression Marc A. Marti-Renom

On the Expressive Power of Programming Languages 1 Historical Context Control Reduction

On the expressivity of total reversible programming languages Luca Paolini and Luca Roversi

Expressive curves Sergey Fomin University of Michigan arXiv:2006.14066 (with E. Shustin)

Expressivity within second-order transitive-closure logic Background Transitive closure FO(TC)

Sambuz

Useful Links

Newsletter

Mail Us