VISUALIZING THE PROTEIN SEQUENCE UNIVERSE L.STANBERRY 1 , R.HIGDON 1 - PowerPoint PPT Presentation

VISUALIZING THE PROTEIN SEQUENCE UNIVERSE L.STANBERRY 1 , R.HIGDON 1 , W.HAYNES 1 , N.KOLKER 1 , W.BROOMALL 1 , S.EKANAYAKE 2 , A.HUGHES 2 , Y.RUAN 2 , J.QIU 2 , E.KOLKER 1 , G.FOX 2 1 SEATTLE CHILDREN’S, 2 INDIANA UNIVERSITY ECMLS 2012, 18 June 2012

Outline 2  A 4th paradigm problem in biology  Assigning functions (annotating) proteins  Challenge  Our goal  PSU: methods & initial results  Conclusions Visualizing PSU ECMLS 2012

Grand Challenge of Functional Genomics 3  New technologies produce peta- and exabytes of data  Protein Sequence Universe (PSU), the protein sequence space, expand exponentially  EMP , i5K, iPlant, NEON  30% of existing sequenced proteins unannotated  Existing resources overwhelmed, many unsupported: COG, Systers, ClusTr, eggNOG. Visualizing PSU ECMLS 2012

Ultimate Goal: Annotate All Proteins 4 Our approach:  Revitalize, expand & enhance protein annotation resources.  Develop sustainable software framework.  Use HPC and most powerful CI – grids & clouds.  Provide rigorous and reliable tools to annotate protein sequences. Visualizing PSU ECMLS 2012

COG: Clusters of Orthologous Groups 5  COG database was developed by NCBI.  Proteins classified into groups with common function encoded in complete genomes.  Prokaryotes (COG): 66 genomes, 200K proteins, 5K clusters.  Eukaryotes (KOG): 7 genomes, 113K proteins, 5K clusters.  Valuable scientific resource: 5K citations.  Last updated: 2006. Visualizing PSU ECMLS 2012

Clustering 10 million UniRef100 6  UniRef100: 10M proteins including 5.3M bacterial & archaeal  BLAST - common sequence alignment approach  All vs. All alignment on Azure  475 eight-core virtual machines produced 3+ billion filtered records in 6 days Visualizing PSU ECMLS 2012

Clustering 10 million UniRef100 7  Use prokaryotic COG as a starting point.  Expand COGs ~20 fold (3.5 million proteins).  Cluster 2M proteins into 500K functional groups  Single linkage clustering with MapReduce framework on Hadoop Visualizing PSU ECMLS 2012

Promise and Challenge of Annotation 8  Clustering facilitates mass annotation BUT  Takes considerable efforts and expertise  Multiple cloud systems and compute solutions Visualizing PSU ECMLS 2012

Public Resources 9  Struggle to cope with the influx of data  Provide limited interactive and analytic capabilities  Many no longer supported (SYSTERS, CluSTr, COG)  Biological community needs scalable, sustainable and efficient approach to visualize, explore and annotate new data. Visualizing PSU ECMLS 2012

Protein Sequence Universe 10  PSU Goal: Enhance annotation resources with analytic and visualization (browser) tools.  Project sequence data into 3D using multidimensional scaling (MDS).  MDS interpolation allows expanding the universe without time consuming all vs all O(N 2 )  3D map allows much faster interpolation Visualizing PSU ECMLS 2012

Multi-Dimensional Scaling (MDS) 11  Sammon‘s objective function     2 n f ( ) d ( x , x )  ij i j  H  f ( )  ij i j  is dissimilarity measure between sequences i and j  ij  d is Euclidean distance between projections x i and x j  Denominator: larger contribution from smaller dissimilarities  f is monotone transformation of dissimilarity measure chosen “artistically” Visualizing PSU ECMLS 2012

Typical Metagenomic MDS 12 Visualizing PSU ECMLS 2012

MDS Details 13  f chosen heuristically to increase the ratio of standard deviation to mean for and to increase the range of f  ( ) ij dissimilarity measures.  O(n 2 ) complexity to map n sequences into 3D.  MDS can be solved using EM (SMACOF – fastest but limited) or directly by Newton's method (it’s just  2 )  Used robust implementation of nonlinear  2 minimization with Levenberg-Marquardt  3D projections visualized in PlotViz Visualizing PSU ECMLS 2012

MDS Details 14  Input Data: 100K sequences from well-characterized prokaryotic COGs.  Proximity measure: sequence alignment scores  Scores calculated using Needleman-Wunsch  Scores “ sqrt 4D” transformed and fed into MDS  Analytic form for transformation to 4D   ij n decreases dimension n > 1; increases n < 1  “sqrt 4D” reduced dimension of distance data from 244 for  ij to14 for f (  ij )  Hence more uniform coverage of Euclidean space Visualizing PSU ECMLS 2012

3D View of 100K COG Sequences 15 Visualizing PSU ECMLS 2012

Implementation 16  NW computed in parallel on 100 node 8-core system.  Used Twister (IU) in the Reduce phase of MapReduce  MDS Calculations performed on 768 core MS HPC cluster (32 nodes)  Scaling, parallel MPI with threading intranode  Parallel efficiency of the code approximately 70%  Lost efficiency due memory bandwidth saturation  NW required 1 day, MDS job - 3 days. Visualizing PSU ECMLS 2012

Cluster Annotation 17 COG Annotation Uniref100 COG1131 ABC-type multidrug transport system, ATPase component 14406 ABC-type antimicrobial peptide transport system, ATPase COG1136 component 7306 COG1126 ABC-type polar amino acid transport system, ATPase component 4061 COG3839 ABC-type sugar transport systems, ATPase component 4121 ABC-type dipeptide/oligopeptide/nickel transport system ATPase COG0444 comp 3520 COG4608 ABC-type oligopeptide transport system, ATPase component 3074 COG3842 ABC-type spermidine/putrescine transport systems, ATPase comp 3665 COG0333 Ribosomal protein L32 1148 COG0454 Histone acetyltransferase HPA2 and Related acetyltransferases 14085 COG0477 Permeases of the major facilitator superfamily 48590 COG1028 Dehydrogenases with different specificities 37461 Visualizing PSU ECMLS 2012

Heatmap of NW vs Euclidean Distances 18 Visualizing PSU ECMLS 2012

Dendrogram of Cluster Centroids 19 Visualizing PSU ECMLS 2012

Selected Clusters 20 Visualizing PSU ECMLS 2012

Heatmap for Selected Clusters 21 Visualizing PSU ECMLS 2012

Future Steps 22  Comparison Needleman-Wunsch v. Blast v. PSIBlast  NW easier as complete; Blast has missing distances  Different Transformations distance  monotonic function(distance) to reduce formal starting dimension (increase sigma/mean)  Automate cluster consensus finding as sequence that minimizes maximum distance to other sequences  Improve O(N 2 ) to O(N) complexity by interpolating new sequences to original set and only doing small regions with O(N 2 )  Successful in metagenomics  Can use Oct-tree from 3D mapping or set of consensus vectors  Some clusters diffuse? Visualizing PSU ECMLS 2012

Blast 6 23 Visualizing PSU ECMLS 2012

24 Full Data Blast 6 Original run has 0.96 cut Visualizing PSU ECMLS 2012

25 Cluster Data Blast 6 Original run has 0.96 cut Visualizing PSU ECMLS 2012

Use Barnes Hut OctTree originally developed to make O(N 2 ) astrophysics O(NlogN) 26

OctTree for 100K sample of Fungi 27 We use OctTree for logarithmic interpolation

440K Interpolated 28

Conclusions 29  Data Knowledge: protein annotation →  Overwhelming influx of new sequences  Annotation is an immense challenge.  HPC and advanced analytics needed.  PSU as tool to facilitate annotation:  Interactive visualization and exploration  Integrates info on function, pathways, structure, and environment  MDS preserves grouping structure of protein space  MDS can use different proximities and biological data  Parallel MDS handles large-scale data  MDS interpolation quickly maps new sequences into existing space Visualizing PSU ECMLS 2012

DELSA: Data → Knowledge → Action 30 Data-Enabled Life Sciences Alliance International  Collective innovation to tackle modern biological challenges through best computational practices and advanced cyberinfrastructure.  Harness expertise and resources across disciplines  Promote accurate, sustainable, scalable approaches  Facilitate translation of data influx into tangible innovations and groundbreaking discoveries Visualizing PSU ECMLS 2012

References and Resources 31  COG data is available at the NCBI site ftp://ftp.ncbi.nih.gov/pub/wolf/COGs/COG0303/  MDS results are available at http://manxcatcogblog.blogspot.com/  All software used to analyze and visualize the data is an open source.  DELSA: http://www.delsaglobal.org  Protein Global Atlas and Data Accessibility Projects Visualizing PSU ECMLS 2012

Acknowledgements 32 Grant support  NSF: under DBI: 0969929 (EK) and 0910818 (GF)  NIH: 5 RC2 HG 005806- 02 (GF); NIGMS grant R01 GM-076680-04 (EK); NIDDK grants U01-DK- 089571 and U01-DK-072473 (EK) Visualizing PSU ECMLS 2012

25 Thank you for your attention Visualizing PSU ECMLS 2012

VISUALIZING THE PROTEIN SEQUENCE UNIVERSE L.STANBERRY 1 , R.HIGDON 1 - PowerPoint PPT Presentation

VISUALIZING THE PROTEIN SEQUENCE UNIVERSE L.STANBERRY 1 , R.HIGDON 1 , W.HAYNES 1 , N.KOLKER 1 , W.BROOMALL 1 , S.EKANAYAKE 2 , A.HUGHES 2 , Y.RUAN 2 , J.QIU 2 , E.KOLKER 1 , G.FOX 2 1 SEATTLE CHILDRENS, 2 INDIANA UNIVERSITY ECMLS 2012, 18 June

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Protein Sequence Analysis Protein Sequence Analysis Domain review Domain review What is a

Protein design Chris Bystroff Biology 12 Apr 2016 1 Protein folding/ protein design folding

Protein-Protein interactions Reducing the complexity Why are protein-protein interactions

Outline - Tasks - Map projections - Visualizing area data - Visualizing point data -

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

CSE182-L16 Non-coding RNA Biol. Data analysis: Review Assembly Protein Sequence Sequence

N C C C protein sequence but is not fully rigid C C peptide C C bond

ORF Calling ORF Calling Why? Need to know protein sequence Protein sequence is usually

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or

Animal protein production in a Animal protein production in a Animal protein production in a

DNA RNA Protein synthesis AMINO ACIDS PROTEIN Protein degradation FUNCTION Some properties

CSE182-L7 CSE182-L7 Protein structure Basics Protein structure Basics Protein sequencing via MS

Dynamics of Protein-Protein Interactions: A Probabilistic Model Toward Protein Function Amir

Using NMR relaxation data to improve the dynamics of methyl groups in AMBER and CHARMM force

Supervised Ensembles of Prediction Methods for Subcellular Localization APBC 2008 Johannes

ISOMAP and LLE 2020 Fisher 1922 ... the objective of statistical methods is the

Replica-exchange in molecular dynamics Part of 2014 SeSE course in Advanced molecular dynamics

Simulation of rare events by Adaptive Multilevel Splitting algorithms Charles-Edouard Brhier

Gaussian Accelerated Molecular Dynamics (GaMD) Yinglong Miao Center for Computational Biology

PTT 207 Biomolecular and Genetic Engineering Semester 2 2013/2014 BY: PUAN NURUL AIN HARMIZA

Multiscale Methods: Dictionary Learning, Regression, Measure Estimation for data near low

VISUALIZING THE PROTEIN SEQUENCE UNIVERSE L.STANBERRY 1 , R.HIGDON 1 - PowerPoint PPT Presentation

VISUALIZING THE PROTEIN SEQUENCE UNIVERSE L.STANBERRY 1 , R.HIGDON 1 , W.HAYNES 1 , N.KOLKER 1 , W.BROOMALL 1 , S.EKANAYAKE 2 , A.HUGHES 2 , Y.RUAN 2 , J.QIU 2 , E.KOLKER 1 , G.FOX 2 1 SEATTLE CHILDRENS, 2 INDIANA UNIVERSITY ECMLS 2012, 18 June

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Protein Sequence Analysis Protein Sequence Analysis Domain review Domain review What is a

Protein design Chris Bystroff Biology 12 Apr 2016 1 Protein folding/ protein design folding

Protein-Protein interactions Reducing the complexity Why are protein-protein interactions

Outline - Tasks - Map projections - Visualizing area data - Visualizing point data -

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

CSE182-L16 Non-coding RNA Biol. Data analysis: Review Assembly Protein Sequence Sequence

N C C C protein sequence but is not fully rigid C C peptide C C bond

ORF Calling ORF Calling Why? Need to know protein sequence Protein sequence is usually

SEQUENCE ANALYSIS The term &quot; sequence analysis &quot; in biology implies subjecting a DNA or

Animal protein production in a Animal protein production in a Animal protein production in a

DNA RNA Protein synthesis AMINO ACIDS PROTEIN Protein degradation FUNCTION Some properties

CSE182-L7 CSE182-L7 Protein structure Basics Protein structure Basics Protein sequencing via MS

Dynamics of Protein-Protein Interactions: A Probabilistic Model Toward Protein Function Amir

Using NMR relaxation data to improve the dynamics of methyl groups in AMBER and CHARMM force

Supervised Ensembles of Prediction Methods for Subcellular Localization APBC 2008 Johannes

ISOMAP and LLE 2020 Fisher 1922 ... the objective of statistical methods is the

Replica-exchange in molecular dynamics Part of 2014 SeSE course in Advanced molecular dynamics

Simulation of rare events by Adaptive Multilevel Splitting algorithms Charles-Edouard Brhier

Gaussian Accelerated Molecular Dynamics (GaMD) Yinglong Miao Center for Computational Biology

PTT 207 Biomolecular and Genetic Engineering Semester 2 2013/2014 BY: PUAN NURUL AIN HARMIZA

Multiscale Methods: Dictionary Learning, Regression, Measure Estimation for data near low

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or