Many Features, Few Samples: Many Features, Few Samples: From - - PowerPoint PPT Presentation

many features few samples many features few samples
SMART_READER_LITE
LIVE PREVIEW

Many Features, Few Samples: Many Features, Few Samples: From - - PowerPoint PPT Presentation

Many Features, Few Samples: Many Features, Few Samples: From cheminformatics cheminformatics to bioinformatics to bioinformatics From Kristin P. Bennett Kristin P. Bennett Department of Mathematical Sciences Department of Mathematical


slide-1
SLIDE 1

Many Features, Few Samples: Many Features, Few Samples:

From From cheminformatics cheminformatics to bioinformatics to bioinformatics

Kristin P. Bennett Kristin P. Bennett

Department of Mathematical Sciences Department of Mathematical Sciences Rensselaer Polytechnic Institute Rensselaer Polytechnic Institute and RPI DDASSL Project Members: and RPI DDASSL Project Members: C.

  • C. Breneman

Breneman, M. , M. Embrechts Embrechts, J. Bi, M. , J. Bi, M. Momma, N. Momma, N. Sukumar Sukumar, M. Song , M. Song

Interface 2004 5/04

slide-2
SLIDE 2

Cheminformatics Cheminformatics Problem Problem

Given for each Molecule Given for each Molecule i i

  • Descriptor vector

Descriptor vector

  • Bioresponse

Bioresponse i

x

i

y

Catch: many descriptors/attributes (600-1000+ ) very few data points (30-200) descriptors very correlated Construct a function to predict bioresponse

( )

i i

f x y ≈

slide-3
SLIDE 3

Electron Density-Derived TAE-Wavelet Descriptors

1 ) 1 ) Surface properties are encoded on 0.002 e/au Surface properties are encoded on 0.002 e/au3

3 surface

surface

Breneman Breneman, C.M. and , C.M. and Rhem Rhem, M. [1997] , M. [1997] J. Comp. Chem.

  • J. Comp. Chem., Vol.

, Vol. 18 18 (2), p. 182 (2), p. 182-

  • 197

197

2 ) Histograms or wavelet encoded of surface properties give 2 ) Histograms or wavelet encoded of surface properties give TAE property descriptors TAE property descriptors

PIP (Local Ionization Potential)

Histograms Wavelet Coefficients

slide-4
SLIDE 4

PEST Hybrid Property/Shape Descriptors

  • Surface properties and shape information are encoded into

alignment-free descriptors

PIP vs Segment Length

  • 9 different surface properties
slide-5
SLIDE 5

Many features/Little Data Many features/Little Data Issues Issues

  • Overfitting

Overfitting

  • Feature selection

Feature selection

  • Difficult validation

Difficult validation

  • Model/parameter selection

Model/parameter selection

  • High model variance

High model variance

  • Not confident in any one model

Not confident in any one model

slide-6
SLIDE 6

DDASSL Learning Methodology: DDASSL Learning Methodology: One Method with Three Engines One Method with Three Engines

  • Method

Method

  • Regularized Kernel Learning Engines

Regularized Kernel Learning Engines

  • Bagged Feature Selection/Visualization

Bagged Feature Selection/Visualization

  • Bagged Final Models

Bagged Final Models

  • Learning Engines (linear and kernel)

Learning Engines (linear and kernel)

  • Support Vector Machine (SVM)

Support Vector Machine (SVM)

  • Partial Least Squares (PLS)

Partial Least Squares (PLS)

  • Boosted Latent Analysis (BLA)

Boosted Latent Analysis (BLA)

slide-7
SLIDE 7

Minimize Regularized Loss Minimize Regularized Loss

  • Minimize the training error and capacity

Minimize the training error and capacity

f1(x) f2(x) Overfitting is likely with high-capacity functions Capacity control makes good generalization possible even in very high-dimensional input spaces

min ( ( ), ) ( )

f i i i

Loss f x y P f +

slide-8
SLIDE 8

Support Vector Regression Support Vector Regression (SVR)

(SVR)

  • Minimize the regularized empirical error:

Minimize the regularized empirical error:

  • training error + model complexity

training error + model complexity

=

+ +

l i i i b w

w C

i i

1 2 * , , ,

|| || 2 1 ) ( min

*

ξ ξ

ξ ξ

  • Overfitting is avoided by controlling the model complexity: || w ||
  • Add kernels to create nonlinear functions

ε-insensitive loss function:

ε

  • ε

ξ*

( ( )) : m ax(0,| ( ) | ) L y f x y f x

ε

ε − = − −

y-f(x) Lε

slide-9
SLIDE 9

Feature Selection via Feature Selection via Sparse SVM/LP Sparse SVM/LP

  • Construct linear

Construct linear ν ν-

  • SVM using 1

SVM using 1-

  • norm LP:

norm LP:

  • Pick best C for SVM

Pick best C for SVM

  • Keep descriptors

Keep descriptors with nonzero coefficients with nonzero coefficients

( )

( ) ( )

*

* * , , , , 1 * 1 * * *

min ( ) . , , , 1,.. || | , |

i i w b z z i i i i i i i i i

w w C z z C x b y z st x b z i w y z z

ε

ν ε ε ε ε ε ε

=

+ + + + ⋅ + − + ≥ − ⋅ + − − ≤ ≥ =

  • |

| 0

i

w >

slide-10
SLIDE 10

Bagged Variable Selection Bagged Variable Selection

DATASET Test set Predictive Model Nonlinear SVM Prediction Training set Training Validation Bootstrap sample k Tuning / Prediction Sparse Linear SVM

Reduced Data

descriptors

Random Variables

slide-11
SLIDE 11

Final Bagged Predictive Model Final Bagged Predictive Model

Achieve the better generalization performance

construct a series of non-linear SVM models use the average of all models as final prediction to

reduce variance

slide-12
SLIDE 12

CACO CACO-

  • 2 Data

2 Data

  • Human intestinal cell line

Human intestinal cell line

  • Predicts drug absorption

Predicts drug absorption

  • 27 molecules with tested permeability

27 molecules with tested permeability

  • 718 descriptors generated

718 descriptors generated

  • Electronic TAE

Electronic TAE

  • Shape/Property (PEST)

Shape/Property (PEST)

  • Traditional (MOE)

Traditional (MOE)

slide-13
SLIDE 13

Molecular Surface Properties Molecular Surface Properties

  • Electronic Properties

Electronic Properties

  • Electrostatic Potential

Electrostatic Potential

  • Electronic Kinetic Energy Density

Electronic Kinetic Energy Density

  • Electron Density Gradients

Electron Density Gradients ∇ρ ∇ρ•

  • N

N

  • Laplacian

Laplacian of the Electron Density

  • f the Electron Density
  • Local Average Ionization Potential

Local Average Ionization Potential

  • Bare Nuclear Potential (BNP)

Bare Nuclear Potential (BNP)

  • Fukui function

Fukui function F+(r) = F+(r) = ρ ρHOMO(r)

EP ( r) = Z α r − Rα

α

− ρ (r' )dr ' r − r'

K( r) = −(ψ * ∇

2ψ + ψ∇ 2ψ *)

G (r ) = −∇ ψ * .∇ ψ

L(r) = −∇

2 ρ(r) = K (r) − G(r)

PIP ( r ) = ρ i ( r ) ε i ρ ( r )

i

HOMO(r)

slide-14
SLIDE 14

Visualization Visualization of feature selection results

  • f feature selection results

To investigate the relative importance of selected descriptors and their consistency

slide-15
SLIDE 15

Caco Caco-

  • 2

2 – – 14 Features (SVM) 14 Features (SVM)

a.don KB54 SMR.VSA2 ANGLEB45 DRNB10 ABSDRN6 PEOE.VSA.FPPOS DRNB00

  • Each star

represents a descriptor

  • Each ray is a

separate bootstrap

  • The area of a

star represents the relative importance of that descriptor

  • Descriptors

shaded cyan have a negative effect

  • Unshaded ones

have a positive effect BNPB31 FUKB14 SlogP.VSA0 PEOE.VSA.FNEG ABSKMIN SIKIA

  • Hydrophobicity

Hydrophobicity -

  • a.don

a.don

  • Size and Shape

Size and Shape -

  • ABSDRN6, SMR.VSA2,

ABSDRN6, SMR.VSA2, ANGLEB45 ANGLEB45 Large is bad. Flat is bad. Globular is good. Large is bad. Flat is bad. Globular is good.

  • Polarity

Polarity – – PEOE.VSA...: negative partial charge good. PEOE.VSA...: negative partial charge good.

slide-16
SLIDE 16

Bagged SVM (RBF) Caco Bagged SVM (RBF) Caco-

  • 2

2

  • 8
  • 7
  • 6
  • 5
  • 4
  • 3
  • 8
  • 7
  • 6
  • 5
  • 4
  • 3

Train Rcv

2 = 0.93

Blind Test R2 = 0.83

Before feature selection R2=.66

slide-17
SLIDE 17

New Learning Engine: BLA New Learning Engine: BLA Boosted Latent Analysis Boosted Latent Analysis

Construct Orthogonal Latent Features and Construct Orthogonal Latent Features and corresponding predictive model for (sub) corresponding predictive model for (sub) differentiable loss functions differentiable loss functions

  • Orthogonal Boosting of linear functions

Orthogonal Boosting of linear functions

  • For least squares loss, equivalent to PLS

For least squares loss, equivalent to PLS

  • Easy to tune

Easy to tune

  • Easy to implement algorithm: small changes for

Easy to implement algorithm: small changes for different loss functions different loss functions

  • Feature selection for linear models

Feature selection for linear models

  • Kernelizable

Kernelizable for nonlinear models for nonlinear models

slide-18
SLIDE 18

Review of Review of AnyBoost AnyBoost

(Mason et al. 1999) (Mason et al. 1999)

Initial t = F

Predictive model Loss function Pseudo response:u

Negative gradient

Compute stepsize (c)

Stepsize ci, or backfitting c

T= [T ti]

Weak hypotheses (ti) Weak learning algorithm

Find t : t’u> 0

1 L i i i=

= +

F t c t

Steepest descent

slide-19
SLIDE 19

Orthogonal Orthogonal AnyBoost AnyBoost

Predictive model Loss function Pseudo response:u

Negative gradient

Compute stepsize (c)

Stepsize ci, or backfitting c

T= [T ti]

Weak hypotheses (ti) Weak learning algorithm

Find t : t’u> 0

1,..., 1

j

j i ′ = − t t =

Initial t = F

Subspace or conjugate gradient algorithm

1 L i i i

t

=

= +

F t c

slide-20
SLIDE 20

Boosted Latent Analysis Boosted Latent Analysis (Momma and Bennett 2004) (Momma and Bennett 2004)

Loss = || y-Tc || 2 Step size computation

Stepsize ci

Predictive model

T= [T ti]

Weak hypotheses (ti)

t0= e

Pseudo response:u

u = y-Tc

Weak learner (linear)

deflate X: X←(I -tt’)X Max(t) t’u> 0

⇒ w= X’u, t= Xw

Generic loss function Pseudo response:u

u = -∇Loss

Back-fitting c

1 L i i i=

= ∑ F t c

Initial t = F

  • '

x w γ = + F

slide-21
SLIDE 21

Feature Selection Feature Selection

Replace optimal Replace optimal w*= w*=X’y X’y with good sparse with good sparse w w

  • Look at largest

Look at largest q q components of components of w* w*

  • Evaluate cluster quality of

Evaluate cluster quality of q q descriptors descriptors using gap statistic using gap statistic (Gene Shaving 2000):

(Gene Shaving 2000): difference between difference between-

  • to

to-

  • within variance ratio of

within variance ratio of signed mean signed mean descr

  • descr. for real and permuted data

. for real and permuted data

  • Let

Let w =w*(i) w =w*(i) for descriptors in best cluster for descriptors in best cluster

  • therwise
  • therwise
slide-22
SLIDE 22

Leukemia Leukemia Microarray Microarray Data ( Data (Golub Golub et al 1999) et al 1999)

Acute Myeloid Leukemia (AML) versus Acute Myeloid Leukemia (AML) versus Acute Acute Lymphoblastic Lymphoblastic Leukemia (ALL) Leukemia (ALL) 7129 genes 7129 genes Train: 27 AML + 11 ALL Train: 27 AML + 11 ALL Test: 20 AML + 14 ALL Test: 20 AML + 14 ALL

slide-23
SLIDE 23

One Model One Model-

  • 4LV 36 Genes

4LV 36 Genes

  • One Error

One Error

  • 0.5
  • 0.4
  • 0.3
  • 0.2
  • 0.1

0.1 0.2

  • 0.4
  • 0.3
  • 0.2
  • 0.1

0.1 0.2 0.3 Leukemia: Test Points LV 1 - 18 Genes LV 2 - 10 Genes AML ALL

slide-24
SLIDE 24

PLS versus BLV PLS versus BLV (4 LV and 20 Bagged models) (4 LV and 20 Bagged models)

  • Linear PLS (

Linear PLS (Wold Wold et al) et al) 7032 descriptors 7032 descriptors

4 Errors on Leukemia 4 Errors on Leukemia

  • Linear BLV with Least Squares

Linear BLV with Least Squares 73 descriptors (8 73 descriptors (8-

  • 38 per model)

38 per model)

1 Error on Leukemia 1 Error on Leukemia

slide-25
SLIDE 25

Genes Used in Models Genes Used in Models

1000 2000 3000 4000 5000 6000 7000 8000 2 4 6 8 10 12 14 16 18 20 genes models Pattern of Gene Used Aross 20 Models

slide-26
SLIDE 26

Understanding Bagged Model Understanding Bagged Model

32/72 Genes appeared in at least three models

slide-27
SLIDE 27

Conclusions Conclusions

  • Robust methodology for many descriptors/few

Robust methodology for many descriptors/few points (Analyze/ points (Analyze/StripMiner StripMiner) )

  • Bagged Feature Selection

Bagged Feature Selection

  • Bagged Predictive Models

Bagged Predictive Models

  • Regularized learning engines (SVM, PLS, BLV)

Regularized learning engines (SVM, PLS, BLV)

  • Proven in

Proven in Cheminformatics Cheminformatics

  • Promising results on Bioinformatics with BLV

Promising results on Bioinformatics with BLV

slide-28
SLIDE 28

ACKNOWLEDGMENTS ACKNOWLEDGMENTS

  • Members of the DDASSL group

Members of the DDASSL group

  • Bennett Research Group (RPI Mathematics)

Bennett Research Group (RPI Mathematics)

  • Jinbo

Jinbo Bi ( Bi (Seimens Seimens) )

  • Michi

Michi Momma (Fair Isaacs) Momma (Fair Isaacs)

  • Angela Zhang

Angela Zhang

  • Breneman Research Group (RPI Chemistry)

Breneman Research Group (RPI Chemistry)

  • N.
  • N. Sukumar

Sukumar

  • M.
  • M. Sundling

Sundling

  • C. Whitehead (Pfizer)
  • C. Whitehead (Pfizer)
  • L.
  • L. Shen

Shen

  • L. Lockwood (Albany Molecular)
  • L. Lockwood (Albany Molecular)
  • M. Song
  • M. Song
  • D.
  • D. Zhuang

Zhuang

  • W.
  • W. Katt

Katt

  • Q.
  • Q. Luo

Luo

  • Embrechts Research Group (RPI DSES)

Embrechts Research Group (RPI DSES)

  • Collaborators:

Collaborators:

  • Cramer Research Group (RPI Chemical Engineering)

Cramer Research Group (RPI Chemical Engineering)

  • Funding

Funding

  • NIH

NIH

  • NSF

NSF