[PPT] - Many Features, Few Samples: Many Features, Few Samples: From PowerPoint Presentation

SLIDE 1

Many Features, Few Samples: Many Features, Few Samples:

From From cheminformatics cheminformatics to bioinformatics to bioinformatics

Kristin P. Bennett Kristin P. Bennett

Department of Mathematical Sciences Department of Mathematical Sciences Rensselaer Polytechnic Institute Rensselaer Polytechnic Institute and RPI DDASSL Project Members: and RPI DDASSL Project Members: C.

C. Breneman

Breneman, M. , M. Embrechts Embrechts, J. Bi, M. , J. Bi, M. Momma, N. Momma, N. Sukumar Sukumar, M. Song , M. Song

Interface 2004 5/04

SLIDE 2

Cheminformatics Cheminformatics Problem Problem

Given for each Molecule Given for each Molecule i i

Descriptor vector

Descriptor vector

Bioresponse

Bioresponse i

x

i

y

Catch: many descriptors/attributes (600-1000+ ) very few data points (30-200) descriptors very correlated Construct a function to predict bioresponse

( )

i i

f x y ≈

SLIDE 3

Electron Density-Derived TAE-Wavelet Descriptors

1 ) 1 ) Surface properties are encoded on 0.002 e/au Surface properties are encoded on 0.002 e/au3

3 surface

surface

Breneman Breneman, C.M. and , C.M. and Rhem Rhem, M. [1997] , M. [1997] J. Comp. Chem.

J. Comp. Chem., Vol.

, Vol. 18 18 (2), p. 182 (2), p. 182-

197

197

2 ) Histograms or wavelet encoded of surface properties give 2 ) Histograms or wavelet encoded of surface properties give TAE property descriptors TAE property descriptors

PIP (Local Ionization Potential)

Histograms Wavelet Coefficients

SLIDE 4

PEST Hybrid Property/Shape Descriptors

Surface properties and shape information are encoded into

alignment-free descriptors

PIP vs Segment Length

9 different surface properties

SLIDE 5

Many features/Little Data Many features/Little Data Issues Issues

Overfitting

Overfitting

Feature selection

Feature selection

Difficult validation

Difficult validation

Model/parameter selection

Model/parameter selection

High model variance

High model variance

Not confident in any one model

Not confident in any one model

SLIDE 6

DDASSL Learning Methodology: DDASSL Learning Methodology: One Method with Three Engines One Method with Three Engines

Method

Method

Regularized Kernel Learning Engines

Regularized Kernel Learning Engines

Bagged Feature Selection/Visualization

Bagged Feature Selection/Visualization

Bagged Final Models

Bagged Final Models

Learning Engines (linear and kernel)

Learning Engines (linear and kernel)

Support Vector Machine (SVM)

Support Vector Machine (SVM)

Partial Least Squares (PLS)

Partial Least Squares (PLS)

Boosted Latent Analysis (BLA)

Boosted Latent Analysis (BLA)

SLIDE 7

Minimize Regularized Loss Minimize Regularized Loss

Minimize the training error and capacity

Minimize the training error and capacity

f1(x) f2(x) Overfitting is likely with high-capacity functions Capacity control makes good generalization possible even in very high-dimensional input spaces

min ( ( ), ) ( )

f i i i

Loss f x y P f +

∑

SLIDE 8

Support Vector Regression Support Vector Regression (SVR)

(SVR)

Minimize the regularized empirical error:

Minimize the regularized empirical error:

training error + model complexity

training error + model complexity

∑

=

+ +

l i i i b w

w C

i i

1 2 * , , ,

|| || 2 1 ) ( min

*

ξ ξ

Overfitting is avoided by controlling the model complexity: || w ||
Add kernels to create nonlinear functions

ε-insensitive loss function:

ε

ε

ξ*

( ( )) : m ax(0,| ( ) | ) L y f x y f x

ε

ε − = − −

y-f(x) Lε

SLIDE 9

Feature Selection via Feature Selection via Sparse SVM/LP Sparse SVM/LP

Construct linear

Construct linear ν ν-

SVM using 1

SVM using 1-

norm LP:

norm LP:

Pick best C for SVM

Pick best C for SVM

Keep descriptors

Keep descriptors with nonzero coefficients with nonzero coefficients

( )

( ) ( )

*

* * , , , , 1 * 1 * * *

min ( ) . , , , 1,.. || | , |

i i w b z z i i i i i i i i i

w w C z z C x b y z st x b z i w y z z

ε

ν ε ε ε ε ε ε

=

+ + + + ⋅ + − + ≥ − ⋅ + − − ≤ ≥ =

∑

|

| 0

i

w >

SLIDE 10

Bagged Variable Selection Bagged Variable Selection

DATASET Test set Predictive Model Nonlinear SVM Prediction Training set Training Validation Bootstrap sample k Tuning / Prediction Sparse Linear SVM

…

Reduced Data

descriptors

Random Variables

SLIDE 11

Final Bagged Predictive Model Final Bagged Predictive Model

Achieve the better generalization performance

construct a series of non-linear SVM models use the average of all models as final prediction to

reduce variance

SLIDE 12

CACO CACO-

2 Data

2 Data

Human intestinal cell line

Human intestinal cell line

Predicts drug absorption

Predicts drug absorption

27 molecules with tested permeability

27 molecules with tested permeability

718 descriptors generated

718 descriptors generated

Electronic TAE

Electronic TAE

Shape/Property (PEST)

Shape/Property (PEST)

Traditional (MOE)

Traditional (MOE)

SLIDE 13

Molecular Surface Properties Molecular Surface Properties

Electronic Properties

Electronic Properties

Electrostatic Potential

Electrostatic Potential

Electronic Kinetic Energy Density

Electronic Kinetic Energy Density

Electron Density Gradients

Electron Density Gradients ∇ρ ∇ρ•

N

N

Laplacian

Laplacian of the Electron Density

f the Electron Density
Local Average Ionization Potential

Local Average Ionization Potential

Bare Nuclear Potential (BNP)

Bare Nuclear Potential (BNP)

Fukui function

Fukui function F+(r) = F+(r) = ρ ρHOMO(r)

EP ( r) = Z α r − Rα

α

∑

− ρ (r' )dr ' r − r'

∫

K( r) = −(ψ * ∇

2ψ + ψ∇ 2ψ *)

G (r ) = −∇ ψ * .∇ ψ

L(r) = −∇

2 ρ(r) = K (r) − G(r)

PIP ( r ) = ρ i ( r ) ε i ρ ( r )

i

∑

HOMO(r)

SLIDE 14

Visualization Visualization of feature selection results

f feature selection results

To investigate the relative importance of selected descriptors and their consistency

SLIDE 15

Caco Caco-

2

2 – – 14 Features (SVM) 14 Features (SVM)

a.don KB54 SMR.VSA2 ANGLEB45 DRNB10 ABSDRN6 PEOE.VSA.FPPOS DRNB00

Each star

represents a descriptor

Each ray is a

separate bootstrap

The area of a

star represents the relative importance of that descriptor

Descriptors

shaded cyan have a negative effect

Unshaded ones

have a positive effect BNPB31 FUKB14 SlogP.VSA0 PEOE.VSA.FNEG ABSKMIN SIKIA

Hydrophobicity

Hydrophobicity -

a.don

a.don

Size and Shape

Size and Shape -

ABSDRN6, SMR.VSA2,

ABSDRN6, SMR.VSA2, ANGLEB45 ANGLEB45 Large is bad. Flat is bad. Globular is good. Large is bad. Flat is bad. Globular is good.

Polarity

Polarity – – PEOE.VSA...: negative partial charge good. PEOE.VSA...: negative partial charge good.

SLIDE 16

Bagged SVM (RBF) Caco Bagged SVM (RBF) Caco-

2

2

8
7
6
5
4
3
8
7
6
5
4
3

Train Rcv

2 = 0.93

Blind Test R2 = 0.83

Before feature selection R2=.66

SLIDE 17

New Learning Engine: BLA New Learning Engine: BLA Boosted Latent Analysis Boosted Latent Analysis

Construct Orthogonal Latent Features and Construct Orthogonal Latent Features and corresponding predictive model for (sub) corresponding predictive model for (sub) differentiable loss functions differentiable loss functions

Orthogonal Boosting of linear functions

Orthogonal Boosting of linear functions

For least squares loss, equivalent to PLS

For least squares loss, equivalent to PLS

Easy to tune

Easy to tune

Easy to implement algorithm: small changes for

Easy to implement algorithm: small changes for different loss functions different loss functions

Feature selection for linear models

Feature selection for linear models

Kernelizable

Kernelizable for nonlinear models for nonlinear models

SLIDE 18

Review of Review of AnyBoost AnyBoost

(Mason et al. 1999) (Mason et al. 1999)

Initial t = F

Predictive model Loss function Pseudo response:u

Negative gradient

Compute stepsize (c)

Stepsize ci, or backfitting c

T= [T ti]

Weak hypotheses (ti) Weak learning algorithm

Find t : t’u> 0

1 L i i i=

= +

∑

F t c t

Steepest descent

SLIDE 19

Orthogonal Orthogonal AnyBoost AnyBoost

Predictive model Loss function Pseudo response:u

Negative gradient

Compute stepsize (c)

Stepsize ci, or backfitting c

T= [T ti]

Weak hypotheses (ti) Weak learning algorithm

Find t : t’u> 0

1,..., 1

j

j i ′ = − t t =

Initial t = F

Subspace or conjugate gradient algorithm

1 L i i i

t

=

= +

∑

F t c

SLIDE 20

Boosted Latent Analysis Boosted Latent Analysis (Momma and Bennett 2004) (Momma and Bennett 2004)

Loss = || y-Tc || 2 Step size computation

Stepsize ci

Predictive model

T= [T ti]

Weak hypotheses (ti)

t0= e

Pseudo response:u

u = y-Tc

Weak learner (linear)

deflate X: X←(I -tt’)X Max(t) t’u> 0

⇒ w= X’u, t= Xw

Generic loss function Pseudo response:u

u = -∇Loss

Back-fitting c

1 L i i i=

= ∑ F t c

Initial t = F

'

x w γ = + F

SLIDE 21

Feature Selection Feature Selection

Replace optimal Replace optimal w= w=X’y X’y with good sparse with good sparse w w

Look at largest

Look at largest q q components of components of w* w*

Evaluate cluster quality of

Evaluate cluster quality of q q descriptors descriptors using gap statistic using gap statistic (Gene Shaving 2000):

(Gene Shaving 2000): difference between difference between-

to

to-

within variance ratio of

within variance ratio of signed mean signed mean descr

descr. for real and permuted data

. for real and permuted data

Let

Let w =w(i) w =w(i) for descriptors in best cluster for descriptors in best cluster

therwise
therwise

SLIDE 22

Leukemia Leukemia Microarray Microarray Data ( Data (Golub Golub et al 1999) et al 1999)

Acute Myeloid Leukemia (AML) versus Acute Myeloid Leukemia (AML) versus Acute Acute Lymphoblastic Lymphoblastic Leukemia (ALL) Leukemia (ALL) 7129 genes 7129 genes Train: 27 AML + 11 ALL Train: 27 AML + 11 ALL Test: 20 AML + 14 ALL Test: 20 AML + 14 ALL

SLIDE 23

One Model One Model-

4LV 36 Genes

4LV 36 Genes

One Error

One Error

0.5
0.4
0.3
0.2
0.1

0.1 0.2

0.4
0.3
0.2
0.1

0.1 0.2 0.3 Leukemia: Test Points LV 1 - 18 Genes LV 2 - 10 Genes AML ALL

SLIDE 24

PLS versus BLV PLS versus BLV (4 LV and 20 Bagged models) (4 LV and 20 Bagged models)

Linear PLS (

Linear PLS (Wold Wold et al) et al) 7032 descriptors 7032 descriptors

4 Errors on Leukemia 4 Errors on Leukemia

Linear BLV with Least Squares

Linear BLV with Least Squares 73 descriptors (8 73 descriptors (8-

38 per model)

38 per model)

1 Error on Leukemia 1 Error on Leukemia

SLIDE 25

Genes Used in Models Genes Used in Models

1000 2000 3000 4000 5000 6000 7000 8000 2 4 6 8 10 12 14 16 18 20 genes models Pattern of Gene Used Aross 20 Models

SLIDE 26

Understanding Bagged Model Understanding Bagged Model

32/72 Genes appeared in at least three models

SLIDE 27

Conclusions Conclusions

Robust methodology for many descriptors/few

Robust methodology for many descriptors/few points (Analyze/ points (Analyze/StripMiner StripMiner) )

Bagged Feature Selection

Bagged Feature Selection

Bagged Predictive Models

Bagged Predictive Models

Regularized learning engines (SVM, PLS, BLV)

Regularized learning engines (SVM, PLS, BLV)

Proven in

Proven in Cheminformatics Cheminformatics

Promising results on Bioinformatics with BLV

Promising results on Bioinformatics with BLV

SLIDE 28

ACKNOWLEDGMENTS ACKNOWLEDGMENTS

Members of the DDASSL group

Members of the DDASSL group

Bennett Research Group (RPI Mathematics)

Bennett Research Group (RPI Mathematics)

Jinbo

Jinbo Bi ( Bi (Seimens Seimens) )

Michi

Michi Momma (Fair Isaacs) Momma (Fair Isaacs)

Angela Zhang

Angela Zhang

Breneman Research Group (RPI Chemistry)

Breneman Research Group (RPI Chemistry)

N.
N. Sukumar

Sukumar

M.
M. Sundling

Sundling

C. Whitehead (Pfizer)
C. Whitehead (Pfizer)
L.
L. Shen

Shen

L. Lockwood (Albany Molecular)
L. Lockwood (Albany Molecular)
M. Song
M. Song
D.
D. Zhuang

Zhuang

W.
W. Katt

Katt

Q.
Q. Luo

Luo

Embrechts Research Group (RPI DSES)

Embrechts Research Group (RPI DSES)

Collaborators:

Collaborators:

Cramer Research Group (RPI Chemical Engineering)

Cramer Research Group (RPI Chemical Engineering)

Funding

Funding

NIH

NIH

NSF

NSF