The Analysis of Biomedical Data - The Analysis of Biomedical Data - - - PowerPoint PPT Presentation

the analysis of biomedical data the analysis of
SMART_READER_LITE
LIVE PREVIEW

The Analysis of Biomedical Data - The Analysis of Biomedical Data - - - PowerPoint PPT Presentation

The Analysis of Biomedical Data - The Analysis of Biomedical Data - - The Analysis of Biomedical Data Caveats and Challenges Caveats and Challenges Caveats and Challenges Ray L. Somorjai Ray L. Somorjai Ray L. Somorjai Head, Biomedical


slide-1
SLIDE 1

The Analysis of Biomedical Data - Caveats and Challenges

Ray L. Somorjai

Head, Biomedical Informatics Group

Institute for Biodiagnostics National Research Council Canada Winnipeg, MB Canada

The Analysis of Biomedical Data The Analysis of Biomedical Data -

  • Caveats and Challenges

Caveats and Challenges

Ray L. Somorjai Ray L. Somorjai

Head, Biomedical Informatics Group Head, Biomedical Informatics Group

Institute for Biodiagnostics Institute for Biodiagnostics National Research Council Canada National Research Council Canada Winnipeg, MB Winnipeg, MB Canada Canada

slide-2
SLIDE 2

The Prime Caveat:

“There Are No Panaceas in Data Analysis”

  • P. J. Huber, Annals of Statistics (1985)

The Prime Caveat:

“ “There Are There Are No Panaceas No Panaceas in Data Analysis in Data Analysis” ”

  • P. J. Huber, Annals of Statistics (1985)
  • P. J. Huber, Annals of Statistics (1985)
slide-3
SLIDE 3

Two Goals of Biomedical Data Classification:

  • 1. Develop Robust Classifiers
  • Capable of Reliably Classifying Unknown Patterns
  • 2. Identify Fewest Maximally Discriminatory Features

(genes, proteins, chemical compounds)

  • Find Biologically Relevant, Interpretable Features

Not All Classifiers Satisfy Both Requirements

Two Goals of Biomedical Data Classification:

  • 1. Develop Robust Classifiers
  • Capable of Reliably Classifying Unknown Patterns
  • 2. Identify Fewest Maximally Discriminatory Features

(genes, proteins, chemical compounds)

  • Find Biologically Relevant, Interpretable Features

Not All Classifiers Satisfy Both Requirements

slide-4
SLIDE 4

The Two Realities of Biomedical Data

{Microarrays (Genomics), Mass Spectra (Proteomics) Magnetic Resonance, Raman & Infrared Spectra}:

The Clinical Reality:

Few Samples, K = O(10) – O(100)

The “Acquisitional” Reality:

Many Features (genes, M/Z values, spectral data points), N = O(1 000) – O(10 000)

The The Two Realities Two Realities of

  • f Biomedical Data

Biomedical Data

{ {Microarrays Microarrays (Genomics), (Genomics), Mass Spectra Mass Spectra (Proteomics) (Proteomics) Magnetic Resonance, Raman & Infrared Spectra Magnetic Resonance, Raman & Infrared Spectra}: }:

The The Clinical Clinical Reality: Reality:

Few Samples Few Samples, , K = O(10) K = O(10) – – O(100) O(100)

The The “ “Acquisitional Acquisitional” ” Reality: Reality:

Many Features Many Features (genes, M/Z values, spectral data points), (genes, M/Z values, spectral data points), N = O(1 000) N = O(1 000) – – O(10 000) O(10 000)

slide-5
SLIDE 5

Contrast

Classical Statistics –

The Art of Asymptotics : N ∞

with

Modern “Statistics“ –

Methods Applicable when N 0 ?

Contrast Contrast

Classical Statistics Classical Statistics – –

The Art of Asymptotics : The Art of Asymptotics : N N ∞

with with

Modern Modern “ “Statistics Statistics“ “ – –

Methods Applicable when Methods Applicable when N N 0 ?

slide-6
SLIDE 6

Two Realities Two Curses:

The Curse of Dimensionality:

Penalty for Too Many Features

The Curse of Dataset Sparsity:

Penalty for Too Few Samples

Two Realities Two Curses: Two Realities Two Curses:

The Curse of The Curse of Dimensionality Dimensionality: :

Penalty for Penalty for Too Many Features Too Many Features

The Curse of The Curse of Dataset Sparsity Dataset Sparsity: :

Penalty for Penalty for Too Few Samples Too Few Samples

slide-7
SLIDE 7

The Curse of Dimensionality

  • r

Penalty for Too Many Features:

A Robust Classifier Needs a Sample to Feature Ratio (SFR) ≥ 10

For Biomedical Data SFR ~ 1/20 – 1/200

The Curse of The Curse of Dimensionality Dimensionality

  • r
  • r

Penalty for Penalty for Too Many Features: Too Many Features:

A A Robust Classifier Robust Classifier Needs Needs a a Sample to Feature Ratio (SFR) Sample to Feature Ratio (SFR) ≥ ≥ 10 10

For Biomedical Data For Biomedical Data SFR SFR ~ 1/20 ~ 1/20 – – 1/200 1/200

slide-8
SLIDE 8

The Curse of Dataset Sparsity: The Curse of The Curse of Dataset Sparsity Dataset Sparsity: :

If Too Few Samples, Trivial to Classify Them Perfectly If Too Few Samples, If Too Few Samples, Trivial to Classify Them Perfectly Trivial to Classify Them Perfectly More Samples, More Realistic Assessment of Intrinsic Class Overlap (Bayes Error) More Samples, More Realistic Assessment of More Samples, More Realistic Assessment of Intrinsic Class Overlap Intrinsic Class Overlap ( (Bayes Error Bayes Error) )

slide-9
SLIDE 9

Consequences of the Curses:

  • 1. Curse of Dimensionality (SFR low)
  • Danger of Overfitting
  • Conclusions Are Suspect
  • No Discriminatory Features Identified
  • 2. Curse of Dataset Sparsity

Insidious:

  • Practically Anything Seems to Work!
  • Several Equally Good Solutions:

Uniqueness Problematic - Classifier Robustness Is Suspect

Consequences of the Curses:

  • 1. Curse of Dimensionality (SFR low)
  • Danger of Overfitting
  • Conclusions Are Suspect
  • No Discriminatory Features Identified
  • 2. Curse of Dataset Sparsity

Insidious:

  • Practically Anything Seems to Work!
  • Several Equally Good Solutions:

Uniqueness Problematic - Classifier Robustness Is Suspect

slide-10
SLIDE 10

Steps of Classifier Development:

  • 1. Partition Dataset into Training & Validation Sets
  • 2. Create Optimal Classifier Using Training Set Only
  • Important to Use External Crossvalidation
  • 3. Whenever Possible or Feasible, Validate Classifier

with Independent Validation Set, Not Involved in

Developing Classifier

Steps of Classifier Development: Classifier Development:

  • 1. Partition Dataset into
  • 1. Partition Dataset into Training

Training & & Validation Validation Sets Sets

  • 2. Create
  • 2. Create Optimal Classifier

Optimal Classifier Using Using Training Training Set Only Set Only

  • Important to Use

Important to Use External External Crossvalidation Crossvalidation

  • 3. Whenever Possible or Feasible,
  • 3. Whenever Possible or Feasible, Validate

Validate Classifier Classifier with with Independent Independent Validation Validation Set, Set, Not Involved

Not Involved in in Developing Classifier Developing Classifier

slide-11
SLIDE 11

A Classifier is Claimed Robust if

Training and Validation Set Results Are of

Comparable Accuracy Fallacious when Curses Are “Active”!

A A Classifier Classifier is Claimed is Claimed Robust Robust if if

Training Training and and Validation Validation Set Results Are of Set Results Are of

Comparable Accuracy Comparable Accuracy Fallacious Fallacious when Curses Are “Active”! when Curses Are “Active”!

slide-12
SLIDE 12

Developed Statistical Classification Strategy - SCS Divide and Conquer:

Four-Stage, Multivariate, Robust

  • 1. Visualization of High-Dimensional Data
  • 2. Preprocessing/Feature Extraction (GA_ORS)
  • 3. Robust Classifier (“Bootstrap” Aggregation)
  • 4. Classifier Fusion (e.g. Stacked Generalization)

Very Successful! Developed Developed Statistical Classification Strategy Statistical Classification Strategy -

  • SCS

SCS Divide and Conquer: Divide and Conquer:

Four Four-

  • Stage, Multivariate, Robust

Stage, Multivariate, Robust

1.

  • 1. Visualization of High

Visualization of High-

  • Dimensional Data

Dimensional Data

  • 2. Preprocessing/Feature Extraction
  • 2. Preprocessing/Feature Extraction (GA_ORS)

(GA_ORS)

  • 3. Robust Classifier
  • 3. Robust Classifier (“Bootstrap” Aggregation)

(“Bootstrap” Aggregation)

  • 4. Classifier Fusion
  • 4. Classifier Fusion (e.g. Stacked Generalization)

(e.g. Stacked Generalization)

Very Successful! Very Successful!

slide-13
SLIDE 13

Stage 1- Visualization (later) Stage 2- Preprocessing

a) Normalization (alignment, common area) b) Transformation (derivatives, rank ordering) c) “Feature Space Reduction”:

Critical ⇒ Optimal Feature Selector

Stage 1 Stage 1-

  • Visualization (later)

Visualization (later) Stage 2 Stage 2-

  • Preprocessing

Preprocessing

a) a) Normalization Normalization (alignment, common area) (alignment, common area) b) b) Transformation Transformation (derivatives, rank ordering) (derivatives, rank ordering) c) “ c) “Feature Space Reduction Feature Space Reduction”: ”:

Critical Critical ⇒ ⇒ Optimal Feature Selector Optimal Feature Selector

slide-14
SLIDE 14

For Biomedical Spectra

Optimal Feature Selector Optimal Region Selector (ORS_GA) Characteristics of ORS_GA:

a) Retains Spectral Identity b) Feature is Some Function of Adjacent Data Points (e.g. Average or Variance) c) Genetic Algorithm (GA)- Driven

⇒ M < K << N Attributes

For Biomedical Spectra For Biomedical Spectra

Optimal Optimal Feature Feature Selector Selector Optimal Optimal Region Region Selector Selector (ORS_GA) (ORS_GA) Characteristics of ORS_GA: Characteristics of ORS_GA:

a) a) Retains Spectral Identity Retains Spectral Identity b) b) Feature Feature is Some is Some Function of Adjacent Data Points Function of Adjacent Data Points (e.g. Average or Variance) (e.g. Average or Variance) c) c) Genetic Algorithm Genetic Algorithm (GA) (GA)-

  • Driven

Driven

⇒ ⇒ M M < K << < K << N N Attributes Attributes

slide-15
SLIDE 15

Stage 3- Robust Classifier Development

How Do We “Robustify”?

  • 1. Already Completed: Feature Selection [ORS] to

Satisfy (Sample / Feature Ratio) K / N ~ 5 - 10

  • 2. “Bootstrap-Inspired Classifier Aggregation”:

Stage 3 Stage 3-

  • Robust

Robust Classifier Development Classifier Development

How Do We How Do We “ “Robustify Robustify” ”? ? 1.

  • 1. Already Completed:

Already Completed: Feature Selection Feature Selection [ORS] to [ORS] to Satisfy Satisfy ( (S Sample / ample / F Feature eature R Ratio) atio) K / N ~ 5 K / N ~ 5 -

  • 10

10 2.

  • 2. “

“Bootstrap Bootstrap-

  • Inspired Classifier Aggregation

Inspired Classifier Aggregation”: ”:

slide-16
SLIDE 16

How Do We Create a Robust Classifier?

a) Training set: Pick randomly ~half of the samples b) Using these, create optimum classifier (e.g., LDA/LOO) c) Test set: other half is used for validation d) Repeat a) - c) B times (restarting with full dataset) B ~ 5000 - 10000 ( B sets of LDA coefficients) e) Create a single classifier as the Qtest-weighted average

  • f these B sets (Qm

test = (Cm test)1/2κm test ; 0 ≤ κm, Cm ≤ 1)

κm

test is the chance-corrected accuracy,

Cm

test the crispness of the mth of the B test sets)

Classifier Outcome is a Class Probability

How Do We How Do We Create Create a Robust Classifier? a Robust Classifier?

a) a) Training Training set set: Pick : Pick randomly randomly ~half ~half of the samples

  • f the samples

b) Using these, create optimum classifier (e.g., LDA/L b) Using these, create optimum classifier (e.g., LDA/LOO) OO) c) c) Test Test set set: : other half

  • ther half is used for

is used for validation validation d) Repeat a) d) Repeat a) -

  • c)

c) B times B times (restarting with (restarting with full dataset full dataset) ) B ~ B ~ 5000 5000 -

  • 10000

10000 ( ( B sets of LDA coefficients) B sets of LDA coefficients) e) Create a e) Create a single classifier single classifier as the as the Q Qtest

test-

  • weighted

weighted average average

  • f these B sets (
  • f these B sets (Q

Qm

m test test = (C

= (Cm

m test test)

)1/2

1/2κ

κm

m test test ; 0

; 0 ≤ ≤ κ κm

m, C

, Cm

m ≤

≤ 1) 1) κ κm

m test test is the

is the chance chance-

  • corrected

corrected accuracy accuracy, , C Cm

m test test the

the crispness crispness of the

  • f the m

mth

th of the B

  • f the B test

test sets) sets)

Classifier Outcome is a Classifier Outcome is a Class Probability Class Probability

slide-17
SLIDE 17

Stage 4 - Classifier Aggregation / Fusion

Activated when the best single, N-attribute, C-class classifier is inaccurate and/or fuzzy: a) Create L “independent” classifiers; b) Treat their C-class probability outputs as L(C-1) features for a new classifier to be trained

Stacked Generalizer (Wolpert)

Stage 4 Stage 4 -

  • Classifier Aggregation / Fusion

Classifier Aggregation / Fusion

Activated when the best Activated when the best single, N single, N-

  • attribute,

attribute, C C-

  • class

class classifier is classifier is inaccurate inaccurate and/or and/or fuzzy fuzzy: : a) Create a) Create L “independent” L “independent” classifiers; classifiers; b) Treat their b) Treat their C C-

  • class

class probability probability outputs

  • utputs as

as L(C L(C-

  • 1) features

1) features for a new classifier to be trained for a new classifier to be trained

Stacked Generalizer Stacked Generalizer (Wolpert)

(Wolpert)

  • DATA

DATA

Classifier 1 Classifier 1 Classifier 2 Classifier 2 Classifier L Classifier L Aggregate Classifier Aggregate Classifier

slide-18
SLIDE 18

Further Considerations / Problems Further Considerations / Problems Further Considerations / Problems

  • Tarnished Gold Standard
  • The “Reject” Class (Multimorbidity) Problem
  • The Multi-Class (K > 2) Problem
  • Nonlinear Mapping of Feature Space
  • Regression instead of Classification

(for diseases with steady progression)

  • Tarnished

Tarnished Gold Standard Gold Standard

  • The

The “Reject” “Reject” Class Class ( (Multimorbidity Multimorbidity) ) Problem Problem

  • The

The Multi Multi-

  • Class

Class (K > 2) Problem (K > 2) Problem

  • Nonlinear Mapping

Nonlinear Mapping of Feature Space

  • f Feature Space
  • Regression

Regression instead of instead of Classification Classification (for diseases with steady progression) (for diseases with steady progression)

slide-19
SLIDE 19

Tarnished Gold Standard Tarnished Gold Tarnished Gold Standard

Standard

  • Dangerous Assumption of Error-Free Class Labels
  • Incorrect Class Labels

Unreliable Classifiers

  • Accurate Class Labels

Robust Classifiers

Solutions?

  • Fuzzy Labels
  • Regression (2-Class Case)
  • Unsupervised Pattern Recognition (e.g., Clustering)?
  • Dangerous

Dangerous Assumption Assumption of

  • f Error

Error-

  • Free Class Labels

Free Class Labels

  • Incorrect

Incorrect Class Labels Class Labels

  • Unreliable

Unreliable Classifiers Classifiers

  • Accurate

Accurate Class Labels Class Labels

  • Robust

Robust Classifiers Classifiers

Solutions? Solutions?

  • Fuzzy Labels

Fuzzy Labels

  • Regression

Regression (2 (2-

  • Class Case)

Class Case)

  • Unsupervised Pattern Recognition

Unsupervised Pattern Recognition (e.g., Clustering)? (e.g., Clustering)?

slide-20
SLIDE 20

What is a “Reject” Class? What is a “ What is a “Reject Reject” Class? ” Class?

The Stage:

  • K-Class Classifier System
  • Classify Unknown Sample SNew

Important Concepts:

  • Multivariate Outliers
  • “Open” vs. “Closed” K-Class Systems
  • Ambiguity vs. Distance “Rejects”

Example: Normal vs. Diabetes; Test: Arthritis

The Stage: The Stage:

  • K

K-

  • Class

Class Classifier System Classifier System

  • Classify

Classify Unknown Sample Unknown Sample S SNew

New

Important Concepts: Important Concepts:

  • Multivariate Outliers

Multivariate Outliers

“Open Open” vs. “ ” vs. “Closed Closed” K ” K-

  • Class Systems

Class Systems

  • Ambiguity

Ambiguity vs.

  • vs. Distance

Distance “Rejects” “Rejects” Example: Example: Normal vs. Normal vs. Diabetes Diabetes; ; Test: Test: Arthritis Arthritis

slide-21
SLIDE 21

Classification Systems Classification Systems Classification Systems

OPEN OPEN OPEN CLOSED CLOSED CLOSED SNew S SNew

New

C1 C C1

1

CK C CK

K

SNew S SNew

New

C1 C C1

1

CK C CK

K

Reject Class Reject Class Reject Class Class C1 Class C Class C1

1

slide-22
SLIDE 22

Multivariate Outliers: Multivariate Outliers: Multivariate Outliers:

Distance “Reject” Distance Distance “Reject” “Reject” Ambiguity “Reject” Ambiguity Ambiguity “Reject” “Reject”

slide-23
SLIDE 23

A Solution to the K-Class Problem:

1) Develop K(K-1)/2 Pair Classifiers Cmn for Classes Cm and Cn, m < n = 1,…,K 2) Combine Outcome Probabilities pmn(x)

  • f Pair Classifiers Cmn for Sample x:

pm(x) = [1 +∑n ≠ m (1/pmn(x) – 1)]-1

Generalizable: Can Include Quality Qm of Cm

A A Solution Solution to the to the K K-

  • Class

Class Problem: Problem:

1) Develop 1) Develop K(K K(K-

  • 1)/2 Pair Classifiers

1)/2 Pair Classifiers C Cmn

mn for

for Classes Classes C Cm

m and

and C Cn

n, m < n = 1,…,K

, m < n = 1,…,K 2) 2) Combine Combine Outcome Probabilities Outcome Probabilities p pmn

mn(x

(x) )

  • f
  • f Pair Classifiers C

Pair Classifiers Cmn

mn for Sample

for Sample x x: :

p pm

m(x

(x) = [1 + ) = [1 +∑

∑n

n ≠ ≠ m m (1/p

(1/pmn

mn(x)

(x) – – 1)] 1)]-

  • 1

1

Generalizable Generalizable: Can Include : Can Include Quality Quality Q Qm

m of C

  • f Cm

m

slide-24
SLIDE 24

Example of Combining Pair-Classifiers: “Simpler is Better” Example of Combining Pair-Classifiers: “Simpler is Better”

6-Class, 12600 Gene Microarray Dataset for Acute Leukemia (ALL):

(Yeoh et al. 2002; Li et al. Bioinformatics 2003)

Class

# of Samples 1: t-all 43 2: e2a-pbx1 27 3: tel-aml1 79 4: bcr-abl 15 5: mll 20 6: hyperdip>50 64

6 6-

  • Class, 12600 Gene Microarray Dataset

Class, 12600 Gene Microarray Dataset for Acute Leukemia (ALL): for Acute Leukemia (ALL):

( (Yeoh Yeoh et al. et al. 2002; Li 2002; Li et al. et al. Bioinformatics 2003) Bioinformatics 2003)

Class Class

# of Samples 1: 1: t t-

  • all

all 43 43 2: 2: e2a e2a-

  • pbx1

pbx1 27 27 3: 3: tel tel-

  • aml1

aml1 79 79 4: 4: bcr bcr-

  • abl

abl 15 15 5: 5: mll mll 20 20 6: 6: hyperdip hyperdip>50 >50 64 64

slide-25
SLIDE 25

Best 15 Pair-Classifiers (LDA-LOO):

Pair Discriminatory Gene(s)

1 vs. 2 7715 1 vs. 3 9101 1 vs. 4 8763 1 vs. 5 3767, 4428 1 vs. 6 8273, 12436 2 vs. 3 7715 2 vs. 4 7715 2 vs. 5 7715 2 vs. 6 7715 3 vs. 4 5478, 8709 3 vs. 5 3391, 5478 3 vs. 6 2610, 4339, 6557 4 vs. 5 1137 4 vs. 6* 6863, 7521, 11072, 12157 5 vs. 6 7188, 11463 * 1 of 79 samples misclassified

Best Best 15 Pair 15 Pair-

  • Classifiers

Classifiers (LDA (LDA-

  • LOO):

LOO):

Pair Pair Discriminatory Gene(s) Discriminatory Gene(s)

1 vs. 1 vs. 2 2 7715 7715 1 vs. 3 1 vs. 3 9101 9101 1 vs. 4 1 vs. 4 8763 8763 1 vs. 5 1 vs. 5 3767, 4428 3767, 4428 1 vs. 6 1 vs. 6 8273, 12436 8273, 12436 2 2 vs. 3

  • vs. 3

7715 7715 2 2 vs. 4

  • vs. 4

7715 7715 2 2 vs. 5

  • vs. 5

7715 7715 2 2 vs. 6

  • vs. 6

7715 7715 3 vs. 4 3 vs. 4 5478 5478, 8709 , 8709 3 vs. 5 3 vs. 5 3391, 3391, 5478 5478 3 vs. 6 3 vs. 6 2610, 4339, 6557 2610, 4339, 6557 4 vs. 5 4 vs. 5 1137 1137 4 vs. 6* 4 vs. 6* 6863, 7521, 11072, 12157 6863, 7521, 11072, 12157 5 vs. 6 5 vs. 6 7188, 11463 7188, 11463 * * 1 1 of

  • f 79

79 samples misclassified samples misclassified

slide-26
SLIDE 26

Combining the Output Probabilities

  • f 15 Pair Classifiers:

Confusion Matrix:

1 2 3 4 5 6 1 43 2 0 27 3 0 79 4 0 14 1 5 0 20 6 0 64

Combining the Output Probabilities Combining the Output Probabilities

  • f 15 Pair Classifiers:
  • f 15 Pair Classifiers:

Confusion Matrix: Confusion Matrix:

1 1 2 2 3 3 4 4 5 5 6 6 1 1 43 43 2 2 0 27 27 3 3 0 79 79 4 4 0 14 14 1 1 5 5 0 20 20 6 6 0 64 64

slide-27
SLIDE 27

An Alternate Solution to the K-Class Problem

(For K Large)

1) Develop K Pair Classifiers Cm, {n≠m} for Classes Cm and C{n≠m}, m = 1,…,K 2) Combine Pair Classifier Outcome Probabilities pm,{n≠m}(x) for Sample x: Advantage: K << K(K-1)/2 for Large K Disadvantage: Unbalanced Classes

An Alternate An Alternate Solution Solution to the to the K K-

  • Class

Class Problem Problem

(For (For K Large K Large) )

1) Develop 1) Develop K Pair Classifiers K Pair Classifiers C Cm,

m, { {n n≠ ≠m} m} for

for Classes Classes C Cm

m and

and C C{n

{n≠ ≠m} m}, m = 1,…,K

, m = 1,…,K 2) 2) Combine Combine Pair Classifier Pair Classifier Outcome Outcome Probabilities Probabilities p pm,{n

m,{n≠ ≠m} m}(x)

(x) for Sample for Sample x x: : Advantage: Advantage: K << K(K K << K(K-

  • 1)/2

1)/2 for Large K for Large K Disadvantage: Disadvantage: Unbalanced Classes Unbalanced Classes

slide-28
SLIDE 28

Nonlinear Mapping of Feature Space Nonlinear Mapping Nonlinear Mapping of Feature Space

  • f Feature Space

1 1 2 2

E.g., Original Features : x1, x2 Best Hyperplane by LDA: Line. Cannot Separate Classes E.g., E.g., Original Original Features : Features : x x1

1,

, x x2

2

Best Best Hyperplane by LDA: Hyperplane by LDA: Line. Line. Cannot Cannot Separate Classes Separate Classes Best Separating Hypersurface: Circle

In Transformed Space LDA Separates Classes

Best Best Separating Separating Hypersurface: Hypersurface: Circle Circle

In Transformed Space In Transformed Space LDA Separates Classes LDA Separates Classes

Nonlinear Mapping: y1 = x1

2, y2 = x2 2

Transformed Features : y1, y2 Nonlinear Mapping: Nonlinear Mapping: y y1

1 =

= x x1

12 2,

, y y2

2 =

= x x2

22 2

Transformed Transformed Features : Features : y y1

1,

, y y2

2

slide-29
SLIDE 29

Classification vs. Regression

(Same for Linear 2 – Class Problems)

Classification Regression

0 0 0 0 0 0 0 0 0 0 +1 0 0 0 0 0 0

Class 1

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 0

Class 2

Advantages of Regression:

  • 1. Better for Continuous Transition
  • 2. “Robustifiable” – Outlier Detection

Classification Classification vs.

  • vs. Regression

Regression

( (Same Same for for Linear 2 Linear 2 – – Class Class Problems) Problems)

Classification Regression Classification Regression

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +1

+1 0

0 0 0 0 0 0 0 0 0 0

Class 1 Class 1

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -

  • 1

1

0 0 0 0 0 0 0 0 0 0 0 0

Class 2 Class 2

Advantages of Regression: Advantages of Regression:

1.

  • 1. Better for

Better for Continuous Transition Continuous Transition

  • 2. “
  • 2. “Robustifiable

Robustifiable” ” – – Outlier Detection Outlier Detection

slide-30
SLIDE 30

Some Successes of the SCS: Some Successes of the Some Successes of the SCS SCS: :

  • Brain, Prostate, Ovarian, Breast, Thyroid Cancer (MR; T)
  • Colon Cancer (MR, IR; T, F)
  • Alzheimer’s (IR; T)
  • Arthritis (Various forms), Diabetes I & II (IR; F)
  • Breast tumour grade & steroid receptor status (IR; T)
  • Kidney allograft rejection (MR, IR; F)
  • Barrett’s (in vivo Raman)
  • Proteomics: Bladder, Colon, Prostate, Ovarian Cancer (MS, F)
  • cDNA Microarray data (T): Public Domain Data; Scrapie

T = Tissue F = Fluid MR = Magnetic Resonance IR = Infrared MS = Mass Spectroscopy

  • Brain, Prostate, Ovarian, Breast, Thyroid

Brain, Prostate, Ovarian, Breast, Thyroid Cancer Cancer (MR; T) (MR; T)

  • Colon

Colon Cancer (MR, IR; T, F) Cancer (MR, IR; T, F)

  • Alzheimer’s

Alzheimer’s (IR; T) (IR; T)

  • Arthritis (

Arthritis (Various forms), Various forms), Diabetes I & II Diabetes I & II (IR; F) (IR; F)

  • Breast tumour grade & steroid receptor status

Breast tumour grade & steroid receptor status (IR; T) (IR; T)

  • Kidney allograft rejection

Kidney allograft rejection (MR, IR; F) (MR, IR; F)

  • Barrett’s

Barrett’s ( (in vivo in vivo Raman) Raman)

  • Proteomics: Bladder, Colon, Prostate, Ovarian

Proteomics: Bladder, Colon, Prostate, Ovarian Cancer Cancer (MS, F) (MS, F)

  • cDNA Microarray data

cDNA Microarray data (T): Public Domain Data; (T): Public Domain Data; Scrapie Scrapie T T = Tissue = Tissue F F = Fluid = Fluid MR MR = Magnetic Resonance = Magnetic Resonance IR IR = Infrared = Infrared MS MS = Mass Spectroscopy = Mass Spectroscopy

slide-31
SLIDE 31

Consequences of Dataset Sparsity

Example: Microarray Expression Profiles

for

Small Round Blue-Cell Tumours

(SRBCT)

Consequences of Consequences of Dataset Sparsity Dataset Sparsity

Example: Example: Microarray Expression Profiles Microarray Expression Profiles

for for

Small Round Blue Small Round Blue-

  • Cell Tumours

Cell Tumours

(SRBCT) (SRBCT)

slide-32
SLIDE 32

4-Class SRBCT:

2308 Features: Expression Levels

EWS BL NB RMS Training set: 23 8 12 20 Validation set: 6 3 6 5

Developed 6 Pair Classifiers (LDA with LOO-CV) Exhaustive Search for Best Gene Pairs

4 4-

  • Class

Class SRBCT SRBCT: :

2308 2308 Features: Features: Expression Levels Expression Levels

EWS EWS BL BL NB NB RMS RMS Training Training set: set: 23 23 8 8 12 12 20 20 Validation Validation set: 6 set: 6 3 3 6 6 5 5

Developed Developed 6 Pair Classifiers 6 Pair Classifiers (LDA with LOO (LDA with LOO-

  • CV)

CV) Exhaustive Search Exhaustive Search for for Best Gene Pairs Best Gene Pairs

slide-33
SLIDE 33

Number of Gene Pairs Pair Classifiers EWS vs. BL

102

EWS vs. NB EWS vs. RMS

14 36

BL vs. NB

163

BL vs. RMS

199* 23

NB vs. RMS

Results:

Small Round Blue-Cell Tumours (SRBCT) Perfect Classification of both Training & Validation Sets

*Single genes: 509 and 1932

Common Fallacy: If Both Training and Validation Sets Classify Accurately, Then the Classifier is Robust!

Curse of Dataset Sparsity

Biological Significance of Individual Results Doubtful

Results:

Small Round Blue Small Round Blue-

  • Cell Tumours (SRBCT)

Cell Tumours (SRBCT) Perfect Perfect Classification Classification of both

  • f both Training

Training & & Validation Validation Sets Sets

*Single genes: *Single genes: 509 509 and and 1932 1932

Common Fallacy Common Fallacy: If : If Both Training Both Training and and Validation Validation Sets Sets Classify Accurately, Then the Classifier is Classify Accurately, Then the Classifier is Robust Robust! !

Curse Curse of

  • f Dataset Sparsity

Dataset Sparsity

Biological Significance Significance of

  • f Individual

Individual Results Results Doubtful Doubtful

slide-34
SLIDE 34

Examples

Proteomics:

SELDI Mass Spectra

Examples Examples

Proteomics: Proteomics:

SELDI Mass Spectra SELDI Mass Spectra

slide-35
SLIDE 35

The Datasets*: The Datasets*: The Datasets*:

1: Healthy vs. Ovarian Cancer (“6-19-02”) 2: Healthy vs. Prostate Cancer (“JNCI 7-3-02”) 15154 Features: M/Z values Five Random Partitions, D1 - D5 of Datasets into: Ovarian: Training set: 122 (61 + 61) Validation set: 116 (30 + 101) Prostate: Training set: 84 (42 + 42) Validation set: 48 (21 + 27) *http://clinicalproteomics.steem.com 1: 1: Healthy Healthy vs.

  • vs. Ovarian

Ovarian Cancer Cancer (“6 (“6-

  • 19

19-

  • 02”)

02”) 2: 2: Healthy Healthy vs.

  • vs. Prostate

Prostate Cancer Cancer (“JNCI 7 (“JNCI 7-

  • 3

3-

  • 02”)

02”) 15154 15154 Features Features: M/Z values : M/Z values Five Five Random Partitions, D1 Random Partitions, D1 -

  • D5

D5 of Datasets into:

  • f Datasets into:

Ovarian Ovarian: : Training Training set: 122 ( set: 122 (61 61 + + 61 61) ) Validation Validation set: 116 set: 116 ( (30 30 + + 101 101) ) Prostate Prostate: : Training Training set: 84 ( set: 84 (42 42 + + 42 42) ) Validation set: 48 (21 + 27) *http://clinicalproteomics.steem.com

slide-36
SLIDE 36

Applying the SCS: Applying Applying the the SCS: SCS:

Feature Selection:

Exhaustive Search Not Feasible (Even for K = 2) 1 Million Random Sets of K ≥ 2 Features from 15154

“Wrapper” Classifier:

Linear Discriminant Analysis (LDA)

(with Leave-One-Out Crossvalidation)

Feature Selection: Feature Selection:

Exhaustive Search Not Feasible Exhaustive Search Not Feasible (Even for K = 2) (Even for K = 2) 1 Million 1 Million Random Sets of Random Sets of K K ≥ ≥ 2 2 Features Features from from 15154 15154

“Wrapper” “Wrapper” Classifier: Classifier:

Linear Discriminant Analysis (LDA) Linear Discriminant Analysis (LDA)

(with Leave (with Leave-

  • One

One-

  • Out

Out Crossvalidation Crossvalidation) )

slide-37
SLIDE 37

Results:

Healthy vs. Ovarian Cancer

Number of Solutions with 0 Errors for both Training & Validation Sets (D1 – D5) # of Features D1 D2 D3 D4 D5 All 2 3

  • 1
  • 1

3 27 1 16 1

  • 18

4 142 13 79 3 1 108

Curse of Dataset Sparsity:

Biological Significance of Individual Results Doubtful –

Independent Biological Validation is Imperative!

Results: Results:

Healthy Healthy vs.

  • vs. Ovarian

Ovarian Cancer Cancer

Number of Solutions with Number of Solutions with 0 Errors 0 Errors for for both both Training Training & & Validation Validation Sets (D1 Sets (D1 – – D5) D5) # of Features # of Features D1 D1 D2 D2 D3 D3 D4 D4 D5 D5 All All 2 2 3 3

  • 1

1

  • 1

1 3 3 27 27 1 1 16 16 1 1

  • 18

18 4 4 142 142 13 79 13 79 3 3 1 108 1 108

Curse Curse of Dataset Sparsity

  • f Dataset Sparsity:

:

Biological Significance of Significance of Individual Individual Results Results Doubtful Doubtful – –

Independent Biological Validation is Imperative! Independent Biological Validation is Imperative!

slide-38
SLIDE 38

Results:

Healthy vs. Prostate Cancer

Average of D1 – D5: # of Features Training (TS) Validation (VS) 3 98.1% 92.9% 5 99.1% 94.6% 6 99.8% 95.0%

Multiple Solutions of Identical Accuracy!

E.g. D5 (5 Features): Two Sets with 100% (TS & VS)

Results: Results:

Healthy Healthy vs.

  • vs. Prostate

Prostate Cancer Cancer

Average of D1 Average of D1 – – D5: D5: # of Features Training (TS) Validation (VS) # of Features Training (TS) Validation (VS) 3 3 98.1% 98.1% 92.9% 92.9% 5 5 99.1% 99.1% 94.6% 94.6% 6 6 99.8% 99.8% 95.0% 95.0%

Multiple Solutions of Identical Accuracy! Multiple Solutions of Identical Accuracy!

E.g. E.g. D5 (5 Features): D5 (5 Features): Two Two Sets with Sets with 100% 100% (TS & VS) (TS & VS)

slide-39
SLIDE 39

Generic Problem:

Visualization & Display of High-Dimensional Data: “A Picture is Worth a 1000 Words” Stage 1 of the SCS

Generic Problem: Generic Problem:

Visualization Visualization & & Display Display of

  • f

High High-

  • Dimensional Data:

Dimensional Data: “A Picture is Worth a 1000 Words” “A Picture is Worth a 1000 Words” Stage 1 Stage 1 of the SCS

  • f the SCS
slide-40
SLIDE 40

Previous Approaches:

Mapping from N Dimensions to 2-3, by Minimizing an Objective Function to Approximately Preserve All Inter-Pattern Distances:

Shannon’s Mapping Niemann’s Mapping Multidimensional Scaling Kohonen’s Self Organizing Map Projection Pursuit All Require Nonlinear Optimization

Previous Approaches: Previous Approaches:

Mapping from Mapping from N Dimensions N Dimensions to to 2 2-

  • 3,

3, by by Minimizing Minimizing an an Objective Function Objective Function to to Approximately Approximately Preserve Preserve All All Inter Inter-

  • Pattern Distances:

Pattern Distances:

Shannon’s Mapping Shannon’s Mapping Niemann’s Mapping Niemann’s Mapping Multidimensional Scaling Multidimensional Scaling Kohonen’s Self Organizing Map Kohonen’s Self Organizing Map Projection Pursuit Projection Pursuit All Require All Require Nonlinear Optimization Nonlinear Optimization

slide-41
SLIDE 41

Our Approach: Mapping to the Relative Distance Plane (RDP) Our Approach: Our Approach: Mapping Mapping to the to the Relative Distance Plane Relative Distance Plane (RDP) (RDP)

slide-42
SLIDE 42

The RDP Mapping Procedure:

Given: K N-Dimensional Patterns

  • 1. Select a Distance (Similarity) Measure
  • 2. Calculate the K X K Distance Matrix
  • 3. Select Any Two of K Patterns as

Reference Patterns R1 and R2

  • 4. Display the Positions of the Remaining

K-2 Patterns Xm Relative to R1 and R2

The RDP Mapping Procedure: The RDP Mapping Procedure:

Given: Given: K K N N-

  • Dimensional

Dimensional Patterns Patterns

  • 1. Select a
  • 1. Select a Distance

Distance (Similarity) (Similarity) Measure Measure 2.

  • 2. Calculate the

Calculate the K K X

X K

K Distance Matrix Distance Matrix

  • 3. Select
  • 3. Select Any Two

Any Two of

  • f K Patterns

K Patterns as as Reference Patterns R Reference Patterns R1

1 and

and R R2

2

  • 4. Display the
  • 4. Display the Positions

Positions of the Remaining

  • f the Remaining

K K-

  • 2

2 Patterns Patterns X Xm

m Relative to

Relative to R R1

1 and

and R R2

2

slide-43
SLIDE 43

Mapping to the Relative Distance Plane (RDP) Mapping Mapping to the to the Relative Distance Plane (RDP) Relative Distance Plane (RDP)

R R1

1

R R2

2

D D12

12

R1 R R1

1

R2 R R2

2

D1m D D1m

1m

D2m D D2m

2m

D12 D D12

12

Xm X Xm

m

New Sample New Sample New Sample

slide-44
SLIDE 44

2 2-

  • Class RDP Mapping with Training

Class RDP Mapping with Training (TS) & Validation (VS) Sets (TS) & Validation (VS) Sets Class RDP Mapping with Training Class RDP Mapping with Training (TS) & Validation (VS) Sets (TS) & Validation (VS) Sets

TS ( TS ( TS ( TS (red red red red & & & & blue blue blue blue disks), VS ( disks), VS ( disks), VS ( disks), VS (yellow llow yellow llow & & & & turquoise turquoise turquoise turquoise triangles) triangles) triangles) triangles) Full Display Full Display Full Display Full Display Zoomed Display Zoomed Display Zoomed Display Zoomed Display

R R2

2

R R1

1

D D12

12

D D1m

1m

X Xm

m

D D2m

2m

Test Margin Test Margin Train Margin Train Margin Optimal LDA Optimal LDA “Hyperplain” “Hyperplain”

slide-45
SLIDE 45

RDP Mapping Preserves Exactly Relative Distances of Any N-Dimensional Pattern P to Any Two Reference Patterns, R1, R2 RDP Mapping Provides Multiple Viewpoints of Data: (“Discretized” Projection Pursuit)

Requires No Optimization!

RDP Mapping RDP Mapping Preserves Preserves Exactly Exactly Relative Distances Relative Distances of

  • f Any

Any N N-

  • Dimensional

Dimensional Pattern Pattern P P to to Any Two Any Two Reference Patterns, Reference Patterns, R R1

1, R

, R2

2

RDP Mapping RDP Mapping Provides Provides Multiple Viewpoints Multiple Viewpoints of Data:

  • f Data:

(“Discretized” (“Discretized” Projection Pursuit Projection Pursuit) )

Requires Requires No Optimization No Optimization! !

slide-46
SLIDE 46

Fisher Iris

RDP Map (Mahalanobis), From 4 Dimensions; 2 vs. 3 (63 – 74)

Fisher Iris Fisher Iris

RDP Map (Mahalanobis), RDP Map (Mahalanobis), From From 4 Dimensions; 4 Dimensions; 2 vs. 3 2 vs. 3 (63 (63 – – 74) 74)

slide-47
SLIDE 47

Generalization to 3D and Higher: Three Reference Points; Best Separating Plane (Still Exact!) Generalization to 3D Generalization to 3D and Higher: and Higher: Three Three Reference Points; Reference Points; Best Separating Best Separating Plane Plane (Still (Still Exact Exact!) !)

slide-48
SLIDE 48

Fisher Iris

RDP Map, from 4 Dimensions

Fisher Iris Fisher Iris

RDP Map, from RDP Map, from 4 Dimensions 4 Dimensions

slide-49
SLIDE 49

Distance Measures Implemented:

  • 1. L2 (Euclidean) : Dij = {Σk=1N(xik – xjk)2}1/2
  • 2. L1 (City block) : Dij = Σk=1N|xik – xjk|
  • 3. L∞ (Max norm) : Dij = maxk |xik – xjk|
  • 4. Anderson – Bahadur (AB) Distance:

Dij2 = (xi - µ1)T Sα-1 (xj - µ2) Sα = α S1 + (1- α)S2 ; 0 ≤ α ≤ 1

(α controls amount of mixing between S1 & S2 α = 0.5: Mahalanobis Distance)

Generalized Discriminant Function: fAB(x | µ1, µ2; α; β) = [x - βµ1 + (1 - β)µ2]T Sα-1(µ1- µ2) 0 ≤ α, β ≤ 1 (β controls position of “hyperplane”)

Distance Measures Distance Measures Implemented: Implemented:

1.

  • 1. L

L2

2 (Euclidean) :

(Euclidean) : D Dij

ij = {

= {Σ

Σk=1

k=1N N(x

(xik

ik –

– x xjk

jk)

)2

2}

}1/2

1/2

2.

  • 2. L

L1

1 (City block) :

(City block) : D Dij

ij =

= Σ

Σk=1

k=1N N|

|x xik

ik –

– x xjk

jk|

| 3.

  • 3. L

L∞

∞ (Max norm) :

(Max norm) : D Dij

ij =

= max maxk

k |

|x xik

ik –

– x xjk

jk|

| 4.

  • 4. Anderson

Anderson – – Bahadur Bahadur (AB) Distance: (AB) Distance: D Dij

ij2 2 = (

= (x xi

i -

  • µ

µ1

1)

)T

T S

α-

  • 1

1 (

(x xj

j -

  • µ

µ2

2)

) S Sα

α =

= α

α S

S1

1 + (1

+ (1-

  • α

α)S

)S2

2 ;

; 0 ≤

≤ α α ≤ ≤ 1

1

( (α α controls amount of mixing between controls amount of mixing between S S1

1 &

& S S2

2

α α = 0.5: = 0.5: Mahalanobis Mahalanobis Distance) Distance)

Generalized Discriminant Function: Generalized Discriminant Function: f fAB

AB(x

(x | | µ µ1,

1, µ

µ2

2;

; α

α;

; β

β) = [ ) = [x x -

  • β

βµ

µ1

1 + (1

+ (1 -

  • β

β) )µ

µ2

2]

]T

T S

α-

  • 1

1(

(µ µ1

1-

  • µ

µ2

2)

) 0 ≤

≤ α α, , β β ≤ ≤ 1

1 ( (β

β controls position of

controls position of “ “hyperplane hyperplane” ”) )

slide-50
SLIDE 50

RDP Mapping

  • f

High-Dimensional Data: Examples of Applications RDP Mapping RDP Mapping

  • f
  • f

High High-

  • Dimensional Data:

Dimensional Data: Examples of Examples of Applications Applications

slide-51
SLIDE 51

Microarray Expression Profiles (Genomics) RDP Mapping with Three Classes

Visualizing the Distribution of a 3rd Class Relative to the Two Classes of Interest

Microarray Expression Profiles Microarray Expression Profiles (Genomics) (Genomics) RDP Mapping with RDP Mapping with Three Classes Three Classes

Visualizing the Visualizing the Distribution Distribution of a

  • f a 3

3rd

rd Class

Class Relative Relative to to the the Two Classes of Interest Two Classes of Interest

slide-52
SLIDE 52

Visual Assessment of Dataset Sparsity Visual Visual Assessment of Assessment of Dataset Sparsity Dataset Sparsity

slide-53
SLIDE 53

CNS Tumors:

RDP Maps (L2) from 7129 Dimensions A vs. B A vs. C B vs. C

1505 / 1505 (4,72) 233 / 2623 (8,78) 607 / 2135 (14,38)

CNS Tumors: CNS Tumors:

RDP Maps ( RDP Maps (L L2

2) from

) from 7129 Dimensions 7129 Dimensions A A vs.

  • vs. B

B A A vs.

  • vs. C

C B B vs.

  • vs. C

C

1505 1505 / 1505 / 1505 (4,72) (4,72) 233 233 / 2623 / 2623 (8,78) (8,78) 607 607 / 2135 / 2135 (14,38) (14,38)

C C A A B B

slide-54
SLIDE 54

Assessing the Relevance of Feature Set Dimensionality Reduction

(RDP Results Likely Provide Worst Case Classification Scenario)

Assessing the Relevance of Assessing the Relevance of Feature Set Dimensionality Reduction Feature Set Dimensionality Reduction

(RDP Results Likely Provide (RDP Results Likely Provide Worst Case Classification Scenario Worst Case Classification Scenario) )

slide-55
SLIDE 55

Ovarian Cancer Ovarian Cancer

23-103 (D = 15154; 7,15,80 23-103 (D = 15154; 7,15,80 red) 66-152 red) 66-152 (D = 2; 1120/14996; (D = 2; 1120/14996; “Outliers”: 146, 188) “Outliers”: 146, 188)

Ovarian Cancer Ovarian Cancer Ovarian Cancer Ovarian Cancer

23 23 23 23-

  • 103 (

103 ( 103 ( 103 (D = 15154 D = 15154 D = 15154 D = 15154; 7,15,80 ; 7,15,80 ; 7,15,80 ; 7,15,80 red red red red) 66 ) 66 ) 66 ) 66-

  • 152 (

152 ( 152 ( 152 (D = 2 D = 2 D = 2 D = 2; ; ; ; 1120 1120 1120 1120/14996; “ /14996; “ /14996; “ /14996; “Out Outlie iers rs” Out Outlie iers rs”: 146, 188) : 146, 188) : 146, 188) : 146, 188)

Outliers Outliers Outliers Outliers

slide-56
SLIDE 56

Detection of Outliers

Proteomics Mass Spectra Ovarian Cancer

Confirmation Using Different Distance Measures and/or Different Reference Pairs

Detection of Detection of Outliers Outliers

Proteomics Proteomics Mass Spectra Mass Spectra Ovarian Cancer Ovarian Cancer

Confirmation Using Confirmation Using Different Different Distance Measures Distance Measures and/or and/or Different Different Reference Pairs Reference Pairs

slide-57
SLIDE 57

Ovarian Cancer Cancer

RDP Mapping from 3 Features: 2193, 2241, 2349

AB Distance (α = 0.7; β = 0.45; 44,102) 2792 / 3844

“Outliers”: 11, 15, 22, 25, 191, 195, 216

Ovarian Ovarian Cancer Cancer Cancer Cancer

RDP Mapping RDP Mapping from from 3 Features: 3 Features: 2193, 2241, 2349 2193, 2241, 2349

AB AB Distance Distance ( (α α = 0.7 = 0.7; ; β β = 0.45 = 0.45; 44,102) ; 44,102) 2792 2792 / 3844 / 3844

“Outliers”: “Outliers”: 11, 15, 22, 25, 191, 195, 216 11, 15, 22, 25, 191, 195, 216

“Outliers”: 11, 15, 22 “Outliers”: 191,195, 216

11 11 15 15 25 25 22 22 191 191 216 216 195 195

slide-58
SLIDE 58

Ovarian Ovarian Cancer

RDP Mapping RDP Mapping from 3 from 3 Features: 2193, 2241, 2349 : 2193, 2241, 2349

L2 norm; 35 / 3844; “Outliers”: norm; 35 / 3844; “Outliers”: 11,15,22 L 1,15,22 L1 norm; 3 / 3844; “Outliers”: norm; 3 / 3844; “Outliers”: 11,15 11,15

Ovarian Ovarian Ovarian Ovarian Cancer Cancer

RDP Mapping RDP Mapping RDP Mapping RDP Mapping from 3 from 3 from 3 from 3 Features Features: : : : 2193, 2241, 2349 2193, 2241, 2349 2193, 2241, 2349 2193, 2241, 2349

L L2

2 norm;

norm; norm; norm; 35 35 35 35 / 3844; / 3844; / 3844; / 3844; “Outliers”: “Outliers”: “Outliers”: “Outliers”: 11,15,22 L 11,15,22 L 11,15,22 L 11,15,22 L1

1 norm;

norm; norm; norm; 3 3 3 3 / 3844; / 3844; / 3844; / 3844; “Outliers”: “Outliers”: “Outliers”: “Outliers”: 11,15 11,15 11,15 11,15

slide-59
SLIDE 59

Comparison of and Distinction between Equally Accurate Feature Sets

E.g. D5 (5 Features): Two Sets with 100% (TS & VS)

Comparison Comparison of

  • f

and and Distinction Distinction between between Equally Accurate Equally Accurate Feature Sets Feature Sets

E.g. E.g. D5 ( D5 (5 Features 5 Features): ): Two Two Sets with Sets with 100% 100% (TS & VS) (TS & VS)

slide-60
SLIDE 60

Prostate Cancer

RDP Mapping (L RDP Mapping (L2 Distance) from 5 Features: Distance) from 5 Features:

7-12-17-37-53 7-12-17-37-53 9-22-26-43-54 9-22-26-43-54 3 of 1849 f 1849 with with TS: 0 VS: 0 TS: 0 VS: 0 TS: 7 (4 + 3); VS: 2 (1 + 1) TS: 7 (4 + 3); VS: 2 (1 + 1)

Prostate Cancer Prostate Cancer

RDP Mapping ( RDP Mapping ( RDP Mapping ( RDP Mapping (L L2

2 Distance

Distance Distance Distance) from ) from ) from ) from 5 Features: 5 Features: 5 Features: 5 Features:

7 7-

  • 12

12 12 12-

  • 17

17 17 17-

  • 37

37 37 37-

  • 53

53 53 53 9 9-

  • 22

22 22 22-

  • 26

26 26 26-

  • 43

43 43 43-

  • 54

54 54 54 3 3 of

  • f
  • f
  • f 1849

1849 1849 1849 with with with with TS: 0 VS: 0 TS: 7 (4 + 3); VS: 2 (1 + 1) TS: 0 VS: 0 TS: 7 (4 + 3); VS: 2 (1 + 1) TS: 0 VS: 0 TS: 7 (4 + 3); VS: 2 (1 + 1) TS: 0 VS: 0 TS: 7 (4 + 3); VS: 2 (1 + 1)

slide-61
SLIDE 61

Prostate Cancer

RDP Mapping (AB Distance: RDP Mapping (AB Distance: α = 0.72; = 0.72; β = 0.52) from = 0.52) from

5 Features (7-12-17-37-53); 5 Features (7-12-17-37-53); 238 238 of 1849

  • f 1849 with No

with No TS or VS Errors: TS or VS Errors:

Prostate Cancer Prostate Cancer

RDP Mapping ( RDP Mapping ( RDP Mapping ( RDP Mapping (AB AB AB AB Distance: Distance: Distance: Distance: α α = 0.72; = 0.72; = 0.72; = 0.72; β

β = 0.52

= 0.52 = 0.52 = 0.52) from ) from ) from ) from

5 Features 5 Features 5 Features 5 Features ( (7 7-

  • 12

12 12 12-

  • 17

17 17 17-

  • 37

37 37 37-

  • 53

53 53 53); ); ); ); 238 238 238 238 of

  • f
  • f
  • f 1849

1849 1849 1849 with with with with No No No No TS or VS Errors: TS or VS Errors: TS or VS Errors: TS or VS Errors:

9-22-26-43-54

slide-62
SLIDE 62

Prostate Prostate Cancer

RDP Mapping Mapping (AB Distance: (AB Distance: α = 0.61; = 0.61; β = 0.46) from = 0.46) from

5 Features (9-22-26-43-54); 5 Features (9-22-26-43-54); 32 32 of

  • f 1849 with

1849 with No No TS or VS Errors TS or VS Errors

Prostate Prostate Prostate Prostate Cancer Cancer

RDP RDP Mapping Mapping Mapping Mapping (AB (AB (AB (AB Distance: Distance: Distance: Distance: α α = 0.61; = 0.61; = 0.61; = 0.61; β

β = 0.46

= 0.46 = 0.46 = 0.46) from ) from ) from ) from

5 Features 5 Features 5 Features 5 Features ( (9 9-

  • 22

22 22 22-

  • 26

26 26 26-

  • 43

43 43 43-

  • 54

54 54 54); ); ); ); 32 32 32 32 of

  • f
  • f
  • f 1849

1849 1849 1849 with with with with No No No No TS or VS Errors TS or VS Errors TS or VS Errors TS or VS Errors

9-22-26-43-54

slide-63
SLIDE 63

Different Reference Point Pairs: Multiple Views of a High-Dimensional Dataset Different Reference Point Pairs: Different Reference Point Pairs: Multiple Views Multiple Views of a

  • f a

High High-

  • Dimensional Dataset

Dimensional Dataset

slide-64
SLIDE 64

Prostate Cancer

5 Features: 5 Features: 7, 12, 17, 37, 53

7, 12, 17, 37, 53 (A (AB Distance: B Distance: α = 0.72) = 0.72)

Reference Pair: 1 – 1 – 66

Prostate Cancer Prostate Cancer

5 Features: 5 Features: 5 Features: 5 Features: 7, 12, 17, 37, 53

7, 12, 17, 37, 53 7, 12, 17, 37, 53 7, 12, 17, 37, 53 (A (AB Distance: B Distance: (A (AB Distance: B Distance: α

α = 0.72)

= 0.72) = 0.72) = 0.72)

Reference Pair: Reference Pair: 1 1 1 1 – – 66 66 66 66

slide-65
SLIDE 65

Prostate Cancer

5 Features: 7, 12, 17, 37, 53 5 Features: 7, 12, 17, 37, 53

(AB Distance: (AB Distance: α = 0.72; Reference Pair from = 0.72; Reference Pair from Same Same Class) Class)

41 – 41 – 46 (Red 46 (Red Class) Class) 83 – 83 – 78 (Blue 78 (Blue Class) lass)

Prostate Cancer Prostate Cancer

5 Features: 5 Features: 5 Features: 5 Features: 7, 12, 17, 37, 53 7, 12, 17, 37, 53 7, 12, 17, 37, 53 7, 12, 17, 37, 53

(AB Distance: (AB Distance: (AB Distance: (AB Distance: α

α = 0.72; Reference Pair from

= 0.72; Reference Pair from = 0.72; Reference Pair from = 0.72; Reference Pair from Same Same Same Same Class) Class) Class) Class)

41 41 41 41 – – 46 46 46 46 ( (Red Red Red Red Class) Class) Class) Class) 83 83 83 83 – – 78 78 78 78 ( (Blue Blue Blue Blue Class) Class) Class) Class)

slide-66
SLIDE 66

Major Challenge: Dataset Sparsity What to Do? Major Challenge: Major Challenge: Dataset Sparsity Dataset Sparsity What to Do? What to Do?

slide-67
SLIDE 67

Choice #1: Obtain More Real Data

Generally NOT Feasible for Biomedical Data

Choice #2: Create Surrogate Data

Possibilities:

  • 1. “Noise Injection” (Raudys)
  • 2. Convex Pseudo-Data (Breiman)
  • 3. Fit Gaussian Distribution to Data
  • 4. Within-Class Feature Permutation
  • 5. Random Selection: Feature-by-Feature CDF

Choice #1: Choice #1: Obtain Obtain More Real More Real Data Data

Generally Generally NOT NOT Feasible Feasible for Biomedical Data for Biomedical Data

Choice #2: Choice #2: Create Create Surrogate Surrogate Data Data

Possibilities: Possibilities:

  • 1. “Noise Injection”
  • 1. “Noise Injection” (Raudys)

(Raudys)

  • 2. Convex Pseudo
  • 2. Convex Pseudo-
  • Data

Data (Breiman) (Breiman) 3.

  • 3. Fit Gaussian Distribution

Fit Gaussian Distribution to Data to Data

  • 4. Within
  • 4. Within-
  • Class Feature Permutation

Class Feature Permutation

  • 5. Random Selection: Feature
  • 5. Random Selection: Feature-
  • by

by-

  • Feature CDF

Feature CDF

slide-68
SLIDE 68

Simultaneous Visualization and Comparison of Distributional Properties of Classes and/or Training and Validation Sets: Kolmogorov – Smirnov (KS) Test

CDF from Histograms on Reference Axis

Simultaneous Simultaneous Visualization Visualization and and Comparison Comparison of

  • f

Distributional Properties Distributional Properties of

  • f Classes

Classes and/or and/or Training Training and and Validation Validation Sets Sets: : Kolmogorov Kolmogorov – – Smirnov Smirnov (KS) (KS) Test Test

CDF CDF from from Histograms Histograms on

  • n Reference Axis

Reference Axis

slide-69
SLIDE 69

Relevant Properties:

If two distributions originate from different populations, the KS test gives a value ~1 If the KS value is ~0, then the two distributions derive from the same population.

Some Applications:

  • 1. Verify that a Training / Test Set Split is Meaningful
  • 2. Check Distributional Relevance of Surrogate Data
  • 3. Confirm Dataset Sparsity: Little or No Distinction

Between Training and Test Sets

Relevant Properties: Relevant Properties:

If two distributions originate from If two distributions originate from different different populations, populations, the the KS KS test gives a value test gives a value ~1 ~1 If the If the KS KS value is value is ~0 ~0, then the two distributions derive , then the two distributions derive from the from the same same population. population.

Some Applications: Some Applications:

  • 1. Verify that a
  • 1. Verify that a Training / Test Set Split

Training / Test Set Split is is Meaningful Meaningful

  • 2. Check Distributional Relevance of
  • 2. Check Distributional Relevance of Surrogate Data

Surrogate Data

  • 3. Confirm
  • 3. Confirm Dataset Sparsity:

Dataset Sparsity: Little or Little or No Distinction No Distinction Between Between Training Training and and Test Sets Test Sets

slide-70
SLIDE 70

SRBCT: EWS (23 + 8) vs. BL (6 + 3)

RDP Mapping (L2) from 2308 Features

Class 1 (TS) vs. Class 1 (VS) Class 2 (TS) vs. Class 2 (VS) KS = 0.96; p = 0.00008 KS = 1.00; p = 0.007

SRBCT: SRBCT: EWS ( EWS (23 23 + + 8 8) vs. BL ) vs. BL ( (6 6 + + 3 3) )

RDP Mapping (L RDP Mapping (L2

2) from

) from 2308 2308 Features Features

Class 1 (TS) vs. Class 1 (VS) Class 2 (TS) vs. Class 2 (VS) KS = 0.96; p = 0.00008 KS = 1.00; p = 0.007

slide-71
SLIDE 71

Ovarian Cancer Surrogate Samples

Mapping (L2) from 15154 M/Z (Yellow (3) and Turquoise (4) Triangles) Misclassifications: 11 + 13 Misclassifications: 10 + 11 KS (1 vs. 2): 0.789 KS (1 vs. 3): 0.099; (2 vs. 4): 0.062

Ovarian Cancer Surrogate Samples

Mapping (L2) from 15154 M/Z (Yellow (3) and Turquoise (4) Triangles) Misclassifications: 11 + 13 Misclassifications: 10 + 11 KS (1 vs. 2): 0.789 KS (1 vs. 3): 0.099; (2 vs. 4): 0.062

slide-72
SLIDE 72

Exploring Possibility of Direct Classification in the RDP Exploring Possibility of Possibility of Direct Classification Direct Classification in the in the RDP RDP

slide-73
SLIDE 73

Fisher Iris

RDP Map (AB Distance: α = 0.5; β = 0.5) From 4 Dimensions; 2 vs. 3 (40 – 101) Two misclassified

Fisher Iris Fisher Iris

RDP Map ( RDP Map (AB Distance AB Distance: : α α = 0.5; = 0.5; β

β = 0.5

= 0.5) ) From From 4 Dimensions; 4 Dimensions; 2 vs. 3 2 vs. 3 (40 (40 – – 101) 101) Two Two misclassified misclassified

Optimal LDA Optimal LDA “hyperplane” “hyperplane”

slide-74
SLIDE 74

Fisher Iris

RDP Map (AB Distance: α = 0.07; β = 0.53) From 4 Dimensions; 2 vs. 3 (54 – 99). One misclassified

Fisher Iris Fisher Iris

RDP Map ( RDP Map (AB Distance AB Distance: : α α = 0.07; = 0.07; β β = 0.53 = 0.53) ) From From 4 Dimensions; 4 Dimensions; 2 vs. 3 2 vs. 3 (54 (54 – – 99). 99). One One misclassified misclassified

Optimal LDA Optimal LDA “hyperplane” “hyperplane”

slide-75
SLIDE 75

Influence of α on RDP Mapping Distribution and on Classification α = 0.0 α = 0.7 (Best) α = 1.0 Influence of α on RDP Mapping Distribution and on Classification α = 0.0 α = 0.7 (Best) α = 1.0

slide-76
SLIDE 76

Work in Progress

  • 1. Introduce Quadratic (Nonlinear) Features:

Augment RDP x, y by x2, y2 & xy: From 2 to only 5 Dimensions

  • 2. Combine Different Reference Pair Classifiers:
  • Leave-Two-Out Crossvalidation
  • Classifier Fusion
  • 3. Create 2D Classifiers from Two ~Orthogonal

Reference Axes Ak and Am. Sample xj with Coordinates Sk(xj) and Sm(xj)

  • 4. Optimal Rotation of Class Separator Line
  • 5. Unsupervised RDP

Work in Progress Work in Progress

  • 1. Introduce
  • 1. Introduce Quadratic

Quadratic (Nonlinear) (Nonlinear) Features: Features: Augment Augment RDP RDP x x, , y y by by x x2

2,

, y y2

2 &

& xy xy: : From From 2 2 to only to only 5 5 Dimensions Dimensions

  • 2. Combine Different
  • 2. Combine Different Reference Pair

Reference Pair Classifiers: Classifiers:

  • Leave

Leave-

  • Two

Two-

  • Out

Out Crossvalidation Crossvalidation

  • Classifier Fusion

Classifier Fusion

  • 3. Create
  • 3. Create 2D Classifiers

2D Classifiers from from Two ~Orthogonal Two ~Orthogonal Reference Axes Reference Axes A Ak

k and

and A Am

m.

. Sample Sample x xj

j with Coordinates

with Coordinates S Sk

k(x

(xj

j)

) and and S Sm

m(x

(xj

j)

) 4.

  • 4. Optimal Rotation

Optimal Rotation of Class Separator Line

  • f Class Separator Line

5.

  • 5. Unsupervised

Unsupervised RDP RDP

slide-77
SLIDE 77

Ovarian Cancer Cancer

Optimal Optimal Rotation of Class Separator “Hyperplane” : AB Distance (α = 0.7; β = 0.45; γ = ??)

Ovarian Ovarian Cancer Cancer Cancer Cancer

Optimal Optimal Optimal Optimal Rotation Rotation of Class Separator “Hyperplane” :

  • f Class Separator “Hyperplane” :

AB AB Distance Distance ( (α α = 0.7; = 0.7; β β = 0.45; = 0.45; γ γ = ?? = ??) ) γ γ γ

slide-78
SLIDE 78

Coworkers & Collaborators:

Richard Baumgartner Chris Bowman Stephanie Booth (National Microbiology Lab) Aleks Demko Brion Dolenko Marina Mandelzweig Sasha Nikulin Nick Pizzi Randy Summers Peter Zhilkin

Coworkers & Collaborators:

Richard Baumgartner Richard Baumgartner Chris Bowman Chris Bowman Stephanie Booth (National Microbiology Lab) Stephanie Booth (National Microbiology Lab) Aleks Aleks Demko Demko Brion Dolenko Brion Dolenko Marina Mandelzweig Marina Mandelzweig Sasha Nikulin Sasha Nikulin Nick Pizzi Nick Pizzi Randy Summers Randy Summers Peter Zhilkin Peter Zhilkin