Knowledge discovery in large Knowledge discovery in large - PowerPoint PPT Presentation

Knowledge discovery in large Knowledge discovery in large biological data sets using hybrid biological data sets using hybrid classifier/evolutionary classifier/evolutionary algorithms algorithms Dr. Michael L. Raymer Dr. Michael L. Raymer Department of Computer Science Department of Computer Science and Engineering / and Engineering / Biomedical Sciences Program Biomedical Sciences Program

EC Approaches EC Approaches • Knowledge Discovery • Knowledge Discovery � Solvation Prediction � Solvation Prediction • Protein Structure Modeling/Prediction • Protein Structure Modeling/Prediction � Combinatorial comparative modeling � Combinatorial comparative modeling M. Raymer, Interface 2004 2

Ligand Screening & Docking Ligand Screening & Docking ? ? • Complementarity • Complementarity � Shape � Shape � Chemical � Chemical � Electrostatic � Electrostatic M. Raymer, Interface 2004 3

Solvation complication Solvation complication • The protein surface is • The protein surface is highly solvated highly solvated � Protein crystals are � Protein crystals are 27–77% water 27–77% water M. Raymer, Interface 2004 4

Solvation conservation Solvation conservation • Question 1: • Question 1: Given a Given a Protein surface solvated crystal solvated crystal structure, find structure, find those water those water molecules that Water molecules that molecule are likely to be are likely to be conserved upon conserved upon Ligand protein-ligand protein-ligand binding. binding. M. Raymer, Interface 2004 5

Water Binding Site Prediction Water Binding Site Prediction Unsolvated and solvated Aspartic Protease (3APR) with peptidyl inhibitor. Question 2: Given a structural model or unsolvated Question 2: Given a structural model or unsolvated structure, identify likely solvent binding positions. structure, identify likely solvent binding positions. M. Raymer, Interface 2004 6

Pattern Recognition Approach Pattern Recognition Approach f1 Cube N f2 C N f3 C f4 N C f5 Labeled training data f1 ? f2 f3 f4 Classification/ Classifier f5 prediction M. Raymer, Interface 2004 7

Crystallographic Waters Crystallographic Waters • False Positives • False Positives � Crystallographic � Crystallographic interfacial waters interfacial waters � Reduction of R-free by � Reduction of R-free by including water molecules including water molecules • False negatives • False negatives � Poor resolution � Poor resolution � Smeared density and � Smeared density and computational refinement computational refinement M. Raymer, Interface 2004 8

Data Set Generation Data Set Generation • 30 Pairs of proteins: ligand-bound and unbound • 30 Pairs of proteins: ligand-bound and unbound � Minimal conformational change upon binding � Minimal conformational change upon binding (backbone RMSD < 0.5) (backbone RMSD < 0.5) � 2.0 Å or better resolution � 2.0 Å or better resolution � Low residual error (R < 0.22) � Low residual error (R < 0.22) • ~3000 Water molecules in the first hydration • ~3000 Water molecules in the first hydration shell shell M. Raymer, Interface 2004 9

Conserved and Displaced Conserved and Displaced Rigid body superimposition Rigid body superimposition of aspartic protease, unbound of aspartic protease, unbound structure (2APR, red) along structure (2APR, red) along with peptidyl ligand-bound with peptidyl ligand-bound structure (3APR, cyan). Only structure (3APR, cyan). Only active-site waters of bound active-site waters of bound structure shown. structure shown. M. Raymer, Interface 2004 10

Probe Site Generation Probe Site Generation Aspartic protease (2apr) with crystallographically Aspartic protease (2apr) with crystallographically observed and computer-generated water molecules. observed and computer-generated water molecules. M. Raymer, Interface 2004 11

Feature Generation Feature Generation • Computable from crystal coordinates, or (less • Computable from crystal coordinates, or (less desirable) structure factors desirable) structure factors � Empirical � Empirical • Likely to be associated with water binding • Likely to be associated with water binding M. Raymer, Interface 2004 12

Atomic Density (ADN) Atomic Density (ADN) A water A water molecule in the molecule in the ligand-free ligand-free structure of structure of dihydrofolate dihydrofolate reductase reductase (1DR2). (1DR2). The atomic The atomic density of this density of this water molecule water molecule is 5. is 5. M. Raymer, Interface 2004 13

Prediction of water molecules Prediction of water molecules DHFR complex DHFR complex with biopterin, with biopterin, colored according colored according to AHP. to AHP. (1DR2/1DR3) (1DR2/1DR3) Displaced water Displaced water molecules from molecules from the free structure the free structure are shown as are shown as wireframe wireframe spheres. spheres. M. Raymer, Interface 2004 14

Temperature Factor (B-Value) Temperature Factor (B-Value) The backbone of The backbone of dihydrofolate dihydrofolate reductase (1DR2) is reductase (1DR2) is shown as ribbons shown as ribbons colored according to colored according to crystallographic crystallographic temperature factor temperature factor (B-value). (B-value). M. Raymer, Interface 2004 15

Features Measured Features Measured • Temperature factor (BVAL) • Temperature factor (BVAL) • Atomic Density (ADN) • Atomic Density (ADN) • Atomic Hydrophilicity (AHP) • Atomic Hydrophilicity (AHP) • Hydrogen bonds to protein (HBDP) • Hydrogen bonds to protein (HBDP) • Hydrogen bonds to water (HBDW) • Hydrogen bonds to water (HBDW) • Mobility (MOB) • Mobility (MOB) B B w w • ABVAL • ABVAL B B = MOB = MOB avg avg Occ Occ • NBVAL • NBVAL w w Occ Occ avg avg M. Raymer, Interface 2004 16

Highly overlapping distributions Highly overlapping distributions B-value, H-Bonds, AHP, B-value, H-Bonds, AHP, rotated to show rotated to show distribution. distribution. PCA shows similar overlap PCA shows similar overlap among 1 st two components. among 1 st two components. LDA obtains nearly LDA obtains nearly random (55%) two-class random (55%) two-class accuracy. accuracy. M. Raymer, Interface 2004 17

Knowledge Discovery Knowledge Discovery Classifier The black box classifier does The black box classifier does not help elucidate why the not help elucidate why the water molecules bind where water molecules bind where they do. they do. Unsolvated and solvated Aspartic Protease (3APR) with peptidyl inhibitor. M. Raymer, Interface 2004 18

EC: Feature Extraction EC: Feature Extraction Feature Space Classifier Projection (KNN) (EA) Large n , moderate d database M. Raymer, Interface 2004 19

Feature Weighted knn Feature Weighted knn Class 1 a. Class 2 Feature 1 Unknown Feature 2 b. Feature 1 Feature 2 Scale Extended M. Raymer, Interface 2004 20

GA & knn Interaction GA & knn Interaction Genetic Algorithm Masked Weight Vector & k Masked Weight Vector & k W 1 W 2 W 3 W 4 W 5 KNN Classifier W 1 W 2 W 3 W 4 W 5 W 1 W 2 W 3 W 4 W 5 W 1 W 2 W 3 W 4 W 5 W 2 ... ... W 1 Fitness — How is it Fitness — How is it calculated? calculated? M. Raymer, Interface 2004 21

Weighting and Masking Weighting and Masking • How do we sample feature subsets? • How do we sample feature subsets? � Weight below a threshold value: slow sampling � Weight below a threshold value: slow sampling � Masking: � Masking: W 1 W 2 W 3 W 4 W 5 M 1 M 2 M 3 M 4 M 5 k 73.2 0 • Classifier parameters ( k ) on the chromosome • Classifier parameters ( k ) on the chromosome M. Raymer, Interface 2004 22

The Cost Function The Cost Function • We can direct the search toward any objective. • We can direct the search toward any objective. � Classification accuracy � Classification accuracy � Class balance � Class balance � Feature subset parsimony (reduce d ) � Feature subset parsimony (reduce d ) • The GA minimizes the cost function: • The GA minimizes the cost function: v v v v v v = × + × = × + × cost( , ) ( , ) ( ) cost( , ) ( , ) ( ) w k C err w k C nonzero w w k C err w k C nonzero w acc pars acc pars v v v v + × + × + × + × _ ( , ) ( , ) _ ( , ) ( , ) C incorrect votes w k C bal w k C incorrect votes w k C bal w k vote bal vote bal M. Raymer, Interface 2004 23

Data Partitioning Data Partitioning Validation Validation Classifier Classifier Training Training Tuning/Fitness Calculation Tuning/Fitness Calculation M. Raymer, Interface 2004 24

Cross Validation Results Cross Validation Results Classifier Accuracy (%) Balance Total non-site site Logistic 69.331 65.496 73.164 7.668 NeuralNetwork 69.293 66.003 72.582 6.579 VotedPerceptron 69.246 66.754 71.737 4.983 SMO 69.068 57.759 76.470 18.711 Solvation site prediction Classifier Accuracy (%) Balance Total disp cons NeuralNetwork 66.618 44.174 80.705 36.531 j48 66.023 37.061 84.200 47.138 ADTree 65.969 44.268 79.589 35.321 VotedPerceptron 65.742 36.453 84.141 47.688 Ligand-binding conservation prediction M. Raymer, Interface 2004 25

Knowledge discovery in large Knowledge discovery in large - PowerPoint PPT Presentation

Knowledge discovery in large Knowledge discovery in large biological data sets using hybrid biological data sets using hybrid classifier/evolutionary classifier/evolutionary algorithms algorithms Dr. Michael L. Raymer Dr. Michael L. Raymer

UNESCO Discovery Centre reference image of education space UNESCO Discovery Centre Discovery

Knowledge-Based Agents knowledge knowledge representation, knowledge base, types of knowledge

From Search to Discovery in our Future Library From Search to Discovery W e see a spectrum of

Watson Discovery Spring 2020 Discovery pipeline Using NLU, document conversion, and UI tools

Tunnel End-point Discovery Tunnel End-point Discovery draft-palet-v6ops-tun-auto-disc-03.txt

VPN Discovery VPN Discovery Design Team Discussions and Options Design Team Discussions and

Plan for today Knowledge-based systems 1 Explicit knowledge Knowledge Representation Inferred

Plan for today Knowledge-based systems 1 Tacit knowledge Knowledge Representation Inferred

26:198:722 Expert Systems I Knowledge representation I Knowledge acquisition I Machine learning I

Tricks for Statistical Semantic Tricks for Statistical Semantic Knowledge Discovery: Knowledge

Mining the Semantic Web: the Knowledge Discovery Process in the SW Claudia d'Amato Department of

Knowledge acquisition Development cycle of a knowledge-based system Knowledge acquisition G53KRR

OUTLINE CAPITALIZATION OF COLLECTIVE KNOWLEDGE: Knowledge management and Knowledge

Knowledge Model Basics Challenges in knowledge modeling Basic knowledge-modeling constructs

KNOWLEDGE ACQUISITION AND CONSTRUCTION Transfer of Knowledge Knowledge acquisition is the

AUTOMATING KNOWLEDGE WORK WITH LARGE-SCALE KNOWLEDGE GRAPHS 2018 Strata Data Conference, New

Convolutional Neural Networks For Modeling Temporal Biomarkers And Disease Predictions Narges

Biomarkers and clinical trials Patricia woo UCL Definitions Surrogate biomarker: A

Dr. Stephanie Groves Ph.D Midwest Grape and Wine Industry Institute Iowa State University

greateyes THE EYE CANT SEE The Berlin Company for Scientific Cameras and Inspection Systems

What is a Biosensor? What is a Biosensor? Definitions Configuration Transduction

Graphene biosensor presentation update Archer Materials Limited (Archer, the Company) is pleased

Using sensors to monitor sugar levels during fermentation Tadro Abbott Project Engineer, AWRI

ARSOlux Arsenic Biosensor based on Bioreporter Bacteria Andreas Klsch, September 5 th 2012