Tiered Computation Raymond Ng CIO, Proof Centre Acting Head, - - PowerPoint PPT Presentation

tiered computation
SMART_READER_LITE
LIVE PREVIEW

Tiered Computation Raymond Ng CIO, Proof Centre Acting Head, - - PowerPoint PPT Presentation

Tiered Computation Raymond Ng CIO, Proof Centre Acting Head, Department of Computer Science, UBC BiT Biomarker Discovery Strategy Omics Tools and Approaches Serum and Urine PAXgene Whole Blood Plasma Albumin TRANSCRIPTOMICS


slide-1
SLIDE 1

Tiered Computation

Raymond Ng CIO, Proof Centre Acting Head, Department of Computer Science, UBC

slide-2
SLIDE 2

5/4/2011 2

BiT Biomarker Discovery Strategy

“Omics” Tools and Approaches

BIOMARKER BIOLIBRARY Blood Urine Tissue

METABOLOMICS

Serum and Urine

U of Alberta Metabolomics Platform, Edmonton, AB

NMR & Mass Spec Analysis

TRANSCRIPTOMICS

PAXgene Whole Blood

Microarray Core Laboratory, Children’s Hospital, LA, CA

Affymetrix Microarray Analysis RNA Extraction

PROTEOMICS

Nascent Plasma Depleted Plasma Bound to Column

Albumin

Plasma

UVic-Genome BC Proteomics Platform, Victoria, BC

ABI 4800 iTRAQ Analysis Plasma Depletion

QA/QC – All sample collection and processing is done to SOP

slide-3
SLIDE 3

5/4/2011 3 3

Importance of Data Cleansing and Pre-processing

  • A. Clinical: “Detecting potential labeling errors in

microarrays by data perturbation,’’ Bioinformatics 2006

(Malossini, Blanzieri)

  • B. mRNA: “MDQC: a new quality assessment method

for microarrays based on quality control reports,”

Bioinformatics 2007 (Cohen-Freue, Hollander et al.)

  • C. DNA: “Modelling Recurrent DNA Copy Number

Alterations in array CGH Data,” Bioinformatics 2006, 2007

(Shah, Murphy, Lam)

slide-4
SLIDE 4

5/4/2011 4 4

50 100 150 200 500 1000 1500 21-4

Sample Quality

50 100 150 200 100 200 300 400 Sample 21-4 17-6 25-5 302-7

Chip Quality

50 100 150 200 5 10 15 Sample 317-10 13-2 13-3 13-4 13-5 13-6 19-1 320-1

RNA Quality

Microarray Quality Control Assessment Tool

slide-5
SLIDE 5

Finding “Needles in a Haystack”

5

Pre-filtering

54,000 Probe Sets 2,000 Proteins/Metabolites

< 10,000 Probe Sets < 100 Proteins/Metabolites

I. Remove features with small variations across all samples (rejection

  • r otherwise)
slide-6
SLIDE 6

54,000 Probe Sets 2,000 Proteins/Metabolites

~ 100-500 Genes/Proteins/Metabolites/ Clinical Variables

< 10,000 Probe Sets < 100 Proteins/Metabolites

“Needles in the Haystack” (cont)

6

Ranking and Filtering

  • II. (Univariate) Rank each

individual feature on how well it discriminates the rejection samples from non-rejection samples

  • III. (Multi-variate) Rank groups
  • f features together on their

joint discrimination power, taking correlation into account

slide-7
SLIDE 7

BIOMARKER PANEL

INTERNALLY VALIDATED BIOMARKER PANEL

54,000 Probe Sets 2,000 Proteins/Metabolites ~ 100-500 Genes/Proteins/Metabolites/ Clinical Variables < 10,000 Probe Sets < 100 Proteins/Metabolites

“Needles in the Haystack” (cont)

7

Panel Selection, Model Building

  • IV. Select features to be

included in the panel, possibly assigning different weights to different features

slide-8
SLIDE 8

Rich Space for Choices

8

Pre-filtering (remove

probe-sets with low variability) 1) k samples above absolute threshold 2) First half using inter-quartile range 3) First half using empirical central mass range

Uni-variate ranking

(FDR-based; per probe-set) 1) Maximum of LIMMA, robust LIMMA and SAM 2) LIMMA 3) Robust LIMMA

Uni-variate filtering

(per probe set) 1) FDR cut-off (FDR<0.01) 2) Size cut-off: Top 50 probe-sets 3) Combination rule: FDR<0.05 but at least 50 and at most 500 probe sets

Multi-variate ranking (optional)

1) Stepwise Discriminant Analysis 2) SVM-based ranking (one step) 3) Recursive Feature Elimination (multi-step) 4) Elastic Net-based (coefficients)

Multi-variate filtering (optional)

1) Significance of improvement cut-off 2) Top 50 (as returned by multi-variate ranking) 3) Non-zero coefficients (Elastic Net)

Classifier Generation

1) Linear Discriminant Analysis 2) Support Vector Machine 3) Random Forest 4) Elastic Net 5) Logistic regression