Knowledge discovery in large Knowledge discovery in large - - PowerPoint PPT Presentation

knowledge discovery in large knowledge discovery in large
SMART_READER_LITE
LIVE PREVIEW

Knowledge discovery in large Knowledge discovery in large - - PowerPoint PPT Presentation

Knowledge discovery in large Knowledge discovery in large biological data sets using hybrid biological data sets using hybrid classifier/evolutionary classifier/evolutionary algorithms algorithms Dr. Michael L. Raymer Dr. Michael L. Raymer


slide-1
SLIDE 1

Knowledge discovery in large biological data sets using hybrid classifier/evolutionary algorithms Knowledge discovery in large biological data sets using hybrid classifier/evolutionary algorithms

  • Dr. Michael L. Raymer

Department of Computer Science and Engineering / Biomedical Sciences Program

  • Dr. Michael L. Raymer

Department of Computer Science and Engineering / Biomedical Sciences Program

slide-2
SLIDE 2
  • M. Raymer, Interface 2004

2

EC Approaches EC Approaches

  • Knowledge Discovery

Solvation Prediction

  • Protein Structure Modeling/Prediction

Combinatorial comparative modeling

  • Knowledge Discovery

Solvation Prediction

  • Protein Structure Modeling/Prediction

Combinatorial comparative modeling

slide-3
SLIDE 3
  • M. Raymer, Interface 2004

3

Ligand Screening & Docking Ligand Screening & Docking

  • Complementarity

Shape Chemical Electrostatic

  • Complementarity

Shape Chemical Electrostatic

? ?

slide-4
SLIDE 4
  • M. Raymer, Interface 2004

4

Solvation complication Solvation complication

  • The protein surface is

highly solvated

Protein crystals are 27–77% water

  • The protein surface is

highly solvated

Protein crystals are 27–77% water

slide-5
SLIDE 5
  • M. Raymer, Interface 2004

5

Solvation conservation Solvation conservation

  • Question 1:

Given a solvated crystal structure, find those water molecules that are likely to be conserved upon protein-ligand binding.

  • Question 1:

Given a solvated crystal structure, find those water molecules that are likely to be conserved upon protein-ligand binding. Protein surface

Ligand Water molecule

slide-6
SLIDE 6
  • M. Raymer, Interface 2004

6

Water Binding Site Prediction Water Binding Site Prediction

Question 2: Given a structural model or unsolvated structure, identify likely solvent binding positions. Question 2: Given a structural model or unsolvated structure, identify likely solvent binding positions.

Unsolvated and solvated Aspartic Protease (3APR) with peptidyl inhibitor.

slide-7
SLIDE 7
  • M. Raymer, Interface 2004

7

Pattern Recognition Approach Pattern Recognition Approach

C

C

N

C N

N

Labeled training data

f1 f2 f3 f4 f5

Cube

Classifier

Classification/ prediction

f1 f2 f3 f4 f5

?

slide-8
SLIDE 8
  • M. Raymer, Interface 2004

8

Crystallographic Waters Crystallographic Waters

  • False Positives

Crystallographic interfacial waters Reduction of R-free by including water molecules

  • False negatives

Poor resolution Smeared density and computational refinement

  • False Positives

Crystallographic interfacial waters Reduction of R-free by including water molecules

  • False negatives

Poor resolution Smeared density and computational refinement

slide-9
SLIDE 9
  • M. Raymer, Interface 2004

9

Data Set Generation Data Set Generation

  • 30 Pairs of proteins: ligand-bound and unbound

Minimal conformational change upon binding (backbone RMSD < 0.5) 2.0 Å or better resolution Low residual error (R < 0.22)

  • ~3000 Water molecules in the first hydration

shell

  • 30 Pairs of proteins: ligand-bound and unbound

Minimal conformational change upon binding (backbone RMSD < 0.5) 2.0 Å or better resolution Low residual error (R < 0.22)

  • ~3000 Water molecules in the first hydration

shell

slide-10
SLIDE 10
  • M. Raymer, Interface 2004

10

Conserved and Displaced Conserved and Displaced

Rigid body superimposition

  • f aspartic protease, unbound

structure (2APR, red) along with peptidyl ligand-bound structure (3APR, cyan). Only active-site waters of bound structure shown. Rigid body superimposition

  • f aspartic protease, unbound

structure (2APR, red) along with peptidyl ligand-bound structure (3APR, cyan). Only active-site waters of bound structure shown.

slide-11
SLIDE 11
  • M. Raymer, Interface 2004

11

Probe Site Generation Probe Site Generation

Aspartic protease (2apr) with crystallographically

  • bserved and computer-generated water molecules.

Aspartic protease (2apr) with crystallographically

  • bserved and computer-generated water molecules.
slide-12
SLIDE 12
  • M. Raymer, Interface 2004

12

Feature Generation Feature Generation

  • Computable from crystal coordinates, or (less

desirable) structure factors

Empirical

  • Likely to be associated with water binding
  • Computable from crystal coordinates, or (less

desirable) structure factors

Empirical

  • Likely to be associated with water binding
slide-13
SLIDE 13
  • M. Raymer, Interface 2004

13

Atomic Density (ADN) Atomic Density (ADN)

A water molecule in the ligand-free structure of dihydrofolate reductase (1DR2). The atomic density of this water molecule is 5. A water molecule in the ligand-free structure of dihydrofolate reductase (1DR2). The atomic density of this water molecule is 5.

slide-14
SLIDE 14
  • M. Raymer, Interface 2004

14

Prediction of water molecules Prediction of water molecules

DHFR complex with biopterin, colored according to AHP. (1DR2/1DR3) Displaced water molecules from the free structure are shown as wireframe spheres. DHFR complex with biopterin, colored according to AHP. (1DR2/1DR3) Displaced water molecules from the free structure are shown as wireframe spheres.

slide-15
SLIDE 15
  • M. Raymer, Interface 2004

15

Temperature Factor (B-Value) Temperature Factor (B-Value)

The backbone of dihydrofolate reductase (1DR2) is shown as ribbons colored according to crystallographic temperature factor (B-value). The backbone of dihydrofolate reductase (1DR2) is shown as ribbons colored according to crystallographic temperature factor (B-value).

slide-16
SLIDE 16
  • M. Raymer, Interface 2004

16

Features Measured Features Measured

  • Temperature factor (BVAL)
  • Atomic Density (ADN)
  • Atomic Hydrophilicity (AHP)
  • Hydrogen bonds to protein (HBDP)
  • Hydrogen bonds to water (HBDW)
  • Mobility (MOB)
  • ABVAL
  • NBVAL
  • Temperature factor (BVAL)
  • Atomic Density (ADN)
  • Atomic Hydrophilicity (AHP)
  • Hydrogen bonds to protein (HBDP)
  • Hydrogen bonds to water (HBDW)
  • Mobility (MOB)
  • ABVAL
  • NBVAL

avg w avg w

Occ Occ B B = MOB

avg w avg w

Occ Occ B B = MOB

slide-17
SLIDE 17
  • M. Raymer, Interface 2004

17

Highly overlapping distributions Highly overlapping distributions

B-value, H-Bonds, AHP, rotated to show distribution. PCA shows similar overlap among 1st two components. LDA obtains nearly random (55%) two-class accuracy. B-value, H-Bonds, AHP, rotated to show distribution. PCA shows similar overlap among 1st two components. LDA obtains nearly random (55%) two-class accuracy.

slide-18
SLIDE 18
  • M. Raymer, Interface 2004

18

Knowledge Discovery Knowledge Discovery

Classifier

The black box classifier does not help elucidate why the water molecules bind where they do. The black box classifier does not help elucidate why the water molecules bind where they do.

Unsolvated and solvated Aspartic Protease (3APR) with peptidyl inhibitor.

slide-19
SLIDE 19
  • M. Raymer, Interface 2004

19

EC: Feature Extraction EC: Feature Extraction

Classifier

(KNN)

Feature Space Projection

(EA)

Large n, moderate d database

slide-20
SLIDE 20
  • M. Raymer, Interface 2004

20

Feature Weighted knn Feature Weighted knn

Feature 2 Feature 1

a.

Feature 2

Feature 1

Scale Extended b.

Class 1 Class 2 Unknown

slide-21
SLIDE 21
  • M. Raymer, Interface 2004

21

GA & knn Interaction GA & knn Interaction

Genetic Algorithm

W1 W2 W3 W4 W5 W1 W2 W3 W4 W5 W1 W2 W3 W4 W5 W1 W2 W3 W4 W5

... ...

KNN Classifier

W1 W2

Masked Weight Vector & k Masked Weight Vector & k Fitness — How is it calculated? Fitness — How is it calculated?

slide-22
SLIDE 22
  • M. Raymer, Interface 2004

22

Weighting and Masking Weighting and Masking

  • How do we sample feature subsets?

Weight below a threshold value: slow sampling Masking:

  • Classifier parameters (k) on the chromosome
  • How do we sample feature subsets?

Weight below a threshold value: slow sampling Masking:

  • Classifier parameters (k) on the chromosome

73.2 W1 W2 W3 W4 W5 M1 M2 M3 M4 M5 k

slide-23
SLIDE 23
  • M. Raymer, Interface 2004

23

The Cost Function The Cost Function

  • We can direct the search toward any objective.

Classification accuracy Class balance Feature subset parsimony (reduce d)

  • The GA minimizes the cost function:
  • We can direct the search toward any objective.

Classification accuracy Class balance Feature subset parsimony (reduce d)

  • The GA minimizes the cost function:

) , ( ) , ( _ ) ( ) , ( ) , cost( k w bal C k w votes incorrect C w nonzero C k w err C k w

bal vote pars acc

v v v v v × + × + × + × = ) , ( ) , ( _ ) ( ) , ( ) , cost( k w bal C k w votes incorrect C w nonzero C k w err C k w

bal vote pars acc

v v v v v × + × + × + × =

slide-24
SLIDE 24
  • M. Raymer, Interface 2004

24

Data Partitioning Data Partitioning

Classifier Training Classifier Training

Tuning/Fitness Calculation Tuning/Fitness Calculation Validation Validation

slide-25
SLIDE 25
  • M. Raymer, Interface 2004

25

Cross Validation Results Cross Validation Results

Classifier Balance Total non-site site Logistic 69.331 65.496 73.164 7.668 NeuralNetwork 69.293 66.003 72.582 6.579 VotedPerceptron 69.246 66.754 71.737 4.983 SMO 69.068 57.759 76.470 18.711 Accuracy (%) Classifier Balance Total disp cons NeuralNetwork 66.618 44.174 80.705 36.531 j48 66.023 37.061 84.200 47.138 ADTree 65.969 44.268 79.589 35.321 VotedPerceptron 65.742 36.453 84.141 47.688 Accuracy (%)

Solvation site prediction Ligand-binding conservation prediction

slide-26
SLIDE 26
  • M. Raymer, Interface 2004

26

Solvated vs. Non-solvated Solvated vs. Non-solvated

Bootstrap Accuracy (%) Balance

Solvated Non-solvated

Total Mean StDev K ADN AHP HBDP HBDW ABVAL NBVAL 68.60 67.75 68.18 3.50 2.66 23 0.284 0.000 0.524 0.000 0.000 0.193 66.87 68.74 67.81 3.50 2.71 75 0.545 0.000 0.235 0.000 0.000 0.219 65.93 65.93 65.93 3.20 2.51 53 0.530 0.000 0.197 0.000 0.000 0.274 65.21 70.44 67.83 5.49 3.68 45 0.194 0.000 0.721 0.000 0.086 0.000 66.19 69.40 67.79 4.39 3.27 35 0.332 0.000 0.643 0.000 0.024 0.000 67.32 67.74 67.53 3.52 2.42 61 0.651 0.000 0.289 0.000 0.060 0.000 Feature Weights

Higher weights Lower weights 0.000

slide-27
SLIDE 27
  • M. Raymer, Interface 2004

27

Conserved vs. Non-Conserved Conserved vs. Non-Conserved

Bootstrap Accuracy (%) Balance Feature Weights Disp Cons Total Mean StDev K ADN AHP BVAL HBDP HBDW MOB ABVAL NBVAL 65.44 62.96 64.20 3.88 2.94 65 0.000 0.000 0.413 0.135 0.137 0.315 0.000 0.000 65.16 62.08 63.62 3.80 3.19 29 0.000 0.000 0.667 0.000 0.000 0.333 0.000 0.000 65.56 60.77 63.16 5.35 3.55 25 0.000 0.000 0.463 0.000 0.000 0.323 0.000 0.214 62.08 64.14 63.11 4.15 2.75 37 0.000 0.000 0.891 0.000 0.000 0.109 0.000 0.000 63.49 62.52 63.00 3.52 2.56 77 0.000 0.000 0.308 0.163 0.225 0.304 0.000 0.000 65.30 60.45 62.87 5.36 3.72 17 0.000 0.000 0.841 0.000 0.000 0.159 0.000 0.000 58.76 66.19 62.47 7.74 4.24 97 0.459 0.291 0.250 0.000 0.000 0.000 0.000 0.000 61.79 62.94 62.36 3.27 2.47 27 0.000 0.371 0.629 0.000 0.000 0.000 0.000 0.000 62.86 61.45 62.16 3.79 2.49 23 0.000 0.000 0.372 0.240 0.000 0.203 0.000 0.184 62.04 62.26 62.15 3.50 2.34 7 0.000 0.000 0.571 0.156 0.000 0.273 0.000 0.000 60.68 63.36 62.02 4.26 3.06 87 0.000 0.118 0.558 0.323 0.000 0.000 0.000 0.000 62.68 60.76 61.72 4.14 3.30 17 0.000 0.252 0.352 0.397 0.000 0.000 0.000 0.000 62.93 60.47 61.70 4.20 3.25 67 0.000 0.000 0.421 0.000 0.000 0.579 0.000 0.000 61.00 62.16 61.58 3.58 2.50 13 0.018 0.388 0.441 0.000 0.000 0.000 0.000 0.153 60.40 62.52 61.46 3.84 2.79 63 0.000 0.000 0.227 0.000 0.773 0.000 0.000 0.000 61.99 60.86 61.42 3.18 2.45 19 0.000 0.051 0.417 0.058 0.000 0.474 0.000 0.000 61.13 61.59 61.36 3.39 2.55 15 0.000 0.000 0.392 0.293 0.000 0.207 0.000 0.108 57.71 64.60 61.16 7.14 3.81 19 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 62.33 58.90 60.62 4.58 3.57 57 0.000 0.000 0.881 0.000 0.000 0.000 0.000 0.119 60.65 59.95 60.30 2.78 2.24 75 0.000 0.317 0.000 0.000 0.000 0.000 0.000 0.683 59.83 60.68 60.25 3.40 2.48 69 0.000 0.000 0.000 0.336 0.000 0.000 0.000 0.664

Higher weights Lower weights 0.000

slide-28
SLIDE 28
  • M. Raymer, Interface 2004

28

Cosine-based kNN Classifier Cosine-based kNN Classifier

|| || || || ) , cos(

j i j i j i

x x x x x x

  • =

Feature 2

Feature 1

Class A Class B Test Pattern

=

=

k i i i

x c x x q

1

) ( ) , cos(

where c(xi) = 1 if xi belongs to the positive class; -1 if xi belongs to the negative class. If q is positive, the query point is assigned to the positive class, otherwise it is assigned to the negative class.

slide-29
SLIDE 29
  • M. Raymer, Interface 2004

29

Feature Extraction Techniques Feature Extraction Techniques

Shifting the origin in the search space may affect classification.

Class A Class B Test Pattern

Feature 2 Feature 1 Feature 2 Feature 1 Origin Shift Feature 2 Feature 2 (extended)

Assigning different weights to each feature may also affect classification, to a lesser extent.

slide-30
SLIDE 30
  • M. Raymer, Interface 2004

30

GA/Classifier Hybrid Architecture GA/Classifier Hybrid Architecture

Genetic Algorithm

... ...

Population

  • f feature

weight,

  • ffsets, & K

Cosine KNN Classifier

W1, O1 Weight Vector

Weights to use for each feature axis during classification

... ...

Offset Vector

Feature offsets for cosine point of reference during classification

Fitness

Based on the number of correct predictions using the weight vector & the number of masked features

... ...

W1W2...W8 O1O2...O8 K W1W2...W8 O1O2...O8 K W1W2...W8 O1O2...O8 K W1W2...W8 O1O2...O8 K

K is also

  • ptimized

W2, O2

slide-31
SLIDE 31
  • M. Raymer, Interface 2004

31

Cosine Knn & Feature Selection Cosine Knn & Feature Selection

Dataset Overall Class1 Class2 Balance K # F Water Conservation 65.286 66.568 64.003 4.101 48 4 Water Solvation 69.910 67.778 72.041 4.359 80 5 Dataset Overall Class1 Class2 Balance K # F Water Conservation 60.722 60.490 60.953 3.506 43 4 Water Solvation 68.764 66.165 71.363 5.328 65 5

GA-trained vs. SFFS-selected Cosine Knn Classification

Mean GA-trained Bootstrap Accuracy Mean SFFS-selected Bootstrap Accuracy

slide-32
SLIDE 32
  • M. Raymer, Interface 2004

32

Well-studied data Well-studied data

Dataset Overall Class1 Class2 Balance K # F Water Conservation 65.286 66.568 64.003 4.101 48 4 Water Solvation 69.910 67.778 72.041 4.359 80 5 Pima Diabetes 76.720 75.000 78.384 8.504 12 7 Breast Cancer 97.647 96.882 98.411 2.705 7 4 Heart-Statlog 87.318 80.909 93.727 15.545 10 8 Hypothyroid 98.598 97.533 99.641 2.088 2 8 Ionosphere 89.818 86.500 92.941 10.044 2 19 Dataset Overall Class1 Class2 Balance K # F Water Conservation 60.722 60.490 60.953 3.506 43 4 Water Solvation 68.764 66.165 71.363 5.328 65 5 Pima Diabetes 59.013 64.567 53.605 12.696 21 2 Breast Cancer 87.661 82.617 92.705 10.735 19 5 Heart-Statlog 72.590 59.545 85.636 26.636 23 8 Hypothyroid 96.038 95.484 96.589 1.538 23 2 Ionosphere 87.121 79.312 94.470 15.488 3 5

GA-trained vs. SFFS-selected Cosine Knn Classification

Mean GA-trained Bootstrap Accuracy Mean SFFS-selected Bootstrap Accuracy

slide-33
SLIDE 33
  • M. Raymer, Interface 2004

33

Results Results

slide-34
SLIDE 34
  • M. Raymer, Interface 2004

34

Class-conditional Distributions Class-conditional Distributions

P(x) x

  • The Bayes Classifier is:

Faster More space efficient Well studied

  • The Bayes Classifier is:

Faster More space efficient Well studied

( )

1

ω | x P

( )

2

ω | x P

slide-35
SLIDE 35
  • M. Raymer, Interface 2004

35

Hybridizing the Bayes Classifier Hybridizing the Bayes Classifier

  • Unfortunately

the Bayes classifier is invariant to feature weighting

  • Unfortunately

the Bayes classifier is invariant to feature weighting

20 40 60 80 100 120 140 160 180 200 +

Feature Value Proportion of Training Samples

2 4 6 8 10 12 14 16 18 20 + P(80 < x < 100) = 0.045 P(8 < x < 10) = 0.045

slide-36
SLIDE 36
  • M. Raymer, Interface 2004

36

Bayes Discriminant Function Bayes Discriminant Function

( )

( )

j x P x P x

j i i

∀ > v v v | | : if decide , given ω ω ω

  • Bayes Decision Rule:
  • Bayes Decision Rule:

( ) ( ) ( )

x P x P x g v v v | |

2 1

ω ω − =

  • Two-class Discriminant Function:

( ) ( ) ( ) ( ) ( ) ( )

=

× × − × =

2 1 2 2 1 1

| | |

i i i

P x P P x P P x P ω ω ω ω ω ω v v v

( ) ( ) ( ) ( )

2 2 1 1

| | ω ω ω ω P x P P x P × − × = v v

slide-37
SLIDE 37
  • M. Raymer, Interface 2004

37

Naïve Bayes Discriminant Naïve Bayes Discriminant

( ) ( ) ( ) ( )

2 2 1 1

| | ω ω ω ω P x P P x P × − × = v v

( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )

2 2 2 2 2 1 1 1 1 2 1 1

| | | | | | ω ω ω ω ω ω ω ω P x x x P P x x x P

d d

× × × − × × × = L L

  • Independence Assumption:

) log( ) log( b a b a > ⇒ > ) log( ) log( b a b a > ⇒ >

( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )

2 2 1 1

log | log log | log ω ω ω ω P x P P x P x g − − + = v v v

slide-38
SLIDE 38
  • M. Raymer, Interface 2004

38

A Parameterized Discriminant A Parameterized Discriminant

( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )

i i d d i i i

P x P C x P C x P C x P ω ω ω ω ω log | log | log | log |

2 2 1 1 *

+ + + = L v

( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )

i i d d i i i

P x P C x P C x P C x P ω ω ω ω ω log | log | log | log |

2 2 1 1 *

+ + + = L v

  • C1, C2 … Cd are optimized by an evolutionary

algorithm.

  • Priors can also be optimized
  • Perhaps even the assumed covariance matrix
  • C1, C2 … Cd are optimized by an evolutionary

algorithm.

  • Priors can also be optimized
  • Perhaps even the assumed covariance matrix
slide-39
SLIDE 39
  • M. Raymer, Interface 2004

39

EC Approaches EC Approaches

  • Knowledge Discovery

Solvation Prediction Mass Spectroscopy Analysis

  • Protein Structure Modeling/Prediction

Combinatorial comparative modeling

  • Knowledge Discovery

Solvation Prediction Mass Spectroscopy Analysis

  • Protein Structure Modeling/Prediction

Combinatorial comparative modeling

slide-40
SLIDE 40
  • M. Raymer, Interface 2004

40

Fold Recognition Fold Recognition

  • Compare the target protein (unknown structure) to all
  • f the candidate template proteins – pick the best few
  • Compare using sequence, SS, SA, burial

Sequence – Psi-Blast variant (Yona & Levitt) Secondary Structure – Jnet vs. dssp Burial – hydrophobic conservation vs. shielding & dssp Average features over multiple structures

  • Compare the target protein (unknown structure) to all
  • f the candidate template proteins – pick the best few
  • Compare using sequence, SS, SA, burial

Sequence – Psi-Blast variant (Yona & Levitt) Secondary Structure – Jnet vs. dssp Burial – hydrophobic conservation vs. shielding & dssp Average features over multiple structures

slide-41
SLIDE 41
  • M. Raymer, Interface 2004

41

Combinatorial Fold Recognition Combinatorial Fold Recognition

  • Current fold recognition is generally at the fold

family or domain level.

  • PSSM profile comparison
  • Current fold recognition is generally at the fold

family or domain level.

  • PSSM profile comparison
slide-42
SLIDE 42
  • M. Raymer, Interface 2004

42

The first four strands of OB-folds... The first four strands of OB-folds...

slide-43
SLIDE 43
  • M. Raymer, Interface 2004

43

The fifth strand, clustered The fifth strand, clustered

slide-44
SLIDE 44
  • M. Raymer, Interface 2004

44

The second helix, clustered The second helix, clustered

slide-45
SLIDE 45
  • M. Raymer, Interface 2004

45

Combinatoric Modeling Combinatoric Modeling

  • CMPare – Combinatoric Modeling of Proteins
  • Create populations of chimeric proteins, by swapping

fragments between the family members of selected structures

  • Test each by Multiple Sequence Threading (Taylor)
  • Reduce the workload by recombining only

representative canonical structures

  • CMPare – Combinatoric Modeling of Proteins
  • Create populations of chimeric proteins, by swapping

fragments between the family members of selected structures

  • Test each by Multiple Sequence Threading (Taylor)
  • Reduce the workload by recombining only

representative canonical structures

slide-46
SLIDE 46
  • M. Raymer, Interface 2004

46

Acknowledgments Acknowledgments

Wright State University

  • Dr. Travis Doom
  • Dr. Dan Krane
  • Dr. Jerry Alter

Michael Peterson Deacon Sweeney Michigan State University

  • Dr. Bill Punch
  • Dr. Leslie Kuhn