VISUALIZING THE PROTEIN SEQUENCE UNIVERSE L.STANBERRY 1 , R.HIGDON 1 - - PowerPoint PPT Presentation

visualizing the protein sequence universe
SMART_READER_LITE
LIVE PREVIEW

VISUALIZING THE PROTEIN SEQUENCE UNIVERSE L.STANBERRY 1 , R.HIGDON 1 - - PowerPoint PPT Presentation

VISUALIZING THE PROTEIN SEQUENCE UNIVERSE L.STANBERRY 1 , R.HIGDON 1 , W.HAYNES 1 , N.KOLKER 1 , W.BROOMALL 1 , S.EKANAYAKE 2 , A.HUGHES 2 , Y.RUAN 2 , J.QIU 2 , E.KOLKER 1 , G.FOX 2 1 SEATTLE CHILDRENS, 2 INDIANA UNIVERSITY ECMLS 2012, 18 June


slide-1
SLIDE 1

VISUALIZING THE PROTEIN SEQUENCE UNIVERSE

L.STANBERRY1, R.HIGDON1, W.HAYNES1, N.KOLKER1, W.BROOMALL1, S.EKANAYAKE2, A.HUGHES2, Y.RUAN2, J.QIU2, E.KOLKER1, G.FOX2

1SEATTLE CHILDREN’S, 2INDIANA UNIVERSITY

ECMLS 2012, 18 June 2012

slide-2
SLIDE 2

Outline

Visualizing PSU

2  A 4th paradigm problem in biology

Assigning functions (annotating) proteins

 Challenge  Our goal  PSU: methods & initial results  Conclusions

ECMLS 2012

slide-3
SLIDE 3

Grand Challenge of Functional Genomics

Visualizing PSU

3  New technologies produce peta- and exabytes of

data

 Protein Sequence Universe (PSU), the protein

sequence space, expand exponentially

EMP

, i5K, iPlant, NEON

30% of existing sequenced proteins unannotated

 Existing resources overwhelmed, many unsupported:

COG, Systers, ClusTr, eggNOG.

ECMLS 2012

slide-4
SLIDE 4

Ultimate Goal: Annotate All Proteins

Visualizing PSU

4

Our approach:

 Revitalize, expand & enhance protein annotation

resources.

 Develop sustainable software framework.  Use HPC and most powerful CI – grids & clouds.  Provide rigorous and reliable tools to annotate

protein sequences.

ECMLS 2012

slide-5
SLIDE 5

COG: Clusters of Orthologous Groups

Visualizing PSU

5  COG database was developed by NCBI.  Proteins classified into groups with common function

encoded in complete genomes.

 Prokaryotes (COG): 66 genomes, 200K proteins, 5K

clusters.

 Eukaryotes (KOG): 7 genomes, 113K proteins, 5K

clusters.

 Valuable scientific resource: 5K citations.  Last updated: 2006.

ECMLS 2012

slide-6
SLIDE 6

Clustering 10 million UniRef100

Visualizing PSU

6  UniRef100: 10M proteins including 5.3M bacterial &

archaeal

 BLAST - common sequence alignment approach  All vs. All alignment on Azure  475 eight-core virtual machines produced 3+

billion filtered records in 6 days

ECMLS 2012

slide-7
SLIDE 7

Clustering 10 million UniRef100

Visualizing PSU

7  Use prokaryotic COG as a starting point.  Expand COGs ~20 fold (3.5 million proteins).  Cluster 2M proteins into 500K functional groups  Single linkage clustering with MapReduce

framework on Hadoop

ECMLS 2012

slide-8
SLIDE 8

Promise and Challenge of Annotation

Visualizing PSU

8

 Clustering facilitates mass annotation

BUT

 Takes considerable efforts and expertise  Multiple cloud systems and compute solutions

ECMLS 2012

slide-9
SLIDE 9

Public Resources

ECMLS 2012 Visualizing PSU

9

 Struggle to cope with the influx of data  Provide limited interactive and analytic capabilities  Many no longer supported (SYSTERS, CluSTr, COG)  Biological community needs scalable, sustainable

and efficient approach to visualize, explore and annotate new data.

slide-10
SLIDE 10

Protein Sequence Universe

Visualizing PSU

10  PSU Goal: Enhance annotation resources with

analytic and visualization (browser) tools.

 Project sequence data into 3D using

multidimensional scaling (MDS).

 MDS interpolation allows expanding the universe

without time consuming all vs all O(N2)

 3D map allows much faster interpolation

ECMLS 2012

slide-11
SLIDE 11

Multi-Dimensional Scaling (MDS)

Visualizing PSU

11  Sammon‘s objective function 

is dissimilarity measure between sequences i and j

 d is Euclidean distance between projections xi and xj  Denominator: larger contribution from smaller

dissimilarities

 f is monotone transformation of dissimilarity measure

chosen “artistically”

ECMLS 2012

 

 

n j i ij j i ij

f x x d f H ) ( ) , ( ) (

2

 

ij

slide-12
SLIDE 12

Typical Metagenomic MDS

ECMLS 2012 Visualizing PSU

12

slide-13
SLIDE 13

MDS Details

ECMLS 2012 Visualizing PSU

13  f chosen heuristically to increase the ratio of standard

deviation to mean for and to increase the range of dissimilarity measures.

 O(n2) complexity to map n sequences into 3D.  MDS can be solved using EM (SMACOF – fastest but

limited) or directly by Newton's method (it’s just 2 )

 Used robust implementation of nonlinear 2 minimization

with Levenberg-Marquardt

 3D projections visualized in PlotViz

) (

ij

f 

slide-14
SLIDE 14

MDS Details

ECMLS 2012 Visualizing PSU

14  Input Data: 100K sequences from well-characterized

prokaryotic COGs.

 Proximity measure: sequence alignment scores  Scores calculated using Needleman-Wunsch  Scores “sqrt 4D” transformed and fed into MDS

 Analytic form for transformation to 4D  ij

n decreases dimension n > 1; increases n < 1

 “sqrt 4D” reduced dimension of distance data from

244 for ij to14 for f(ij)

 Hence more uniform coverage of Euclidean space

slide-15
SLIDE 15

3D View of 100K COG Sequences

Visualizing PSU

15

ECMLS 2012

slide-16
SLIDE 16

Implementation

Visualizing PSU

16  NW computed in parallel on 100 node 8-core

system.

 Used Twister (IU) in the Reduce phase of MapReduce  MDS Calculations performed on 768 core MS HPC

cluster (32 nodes)

 Scaling, parallel MPI with threading intranode  Parallel efficiency of the code approximately 70%  Lost efficiency due memory bandwidth saturation  NW required 1 day, MDS job - 3 days.

ECMLS 2012

slide-17
SLIDE 17

Cluster Annotation

Visualizing PSU

17

COG Annotation Uniref100 COG1131 ABC-type multidrug transport system, ATPase component 14406 COG1136 ABC-type antimicrobial peptide transport system, ATPase component 7306 COG1126 ABC-type polar amino acid transport system, ATPase component 4061 COG3839 ABC-type sugar transport systems, ATPase component 4121 COG0444 ABC-type dipeptide/oligopeptide/nickel transport system ATPase comp 3520 COG4608 ABC-type oligopeptide transport system, ATPase component 3074 COG3842 ABC-type spermidine/putrescine transport systems, ATPase comp 3665 COG0333 Ribosomal protein L32 1148 COG0454 Histone acetyltransferase HPA2 and Related acetyltransferases 14085 COG0477 Permeases of the major facilitator superfamily 48590 COG1028 Dehydrogenases with different specificities 37461

ECMLS 2012

slide-18
SLIDE 18

Heatmap of NW vs Euclidean Distances

Visualizing PSU

18

ECMLS 2012

slide-19
SLIDE 19

Dendrogram of Cluster Centroids

Visualizing PSU

19

ECMLS 2012

slide-20
SLIDE 20

Selected Clusters

Visualizing PSU

20

ECMLS 2012

slide-21
SLIDE 21

Heatmap for Selected Clusters

Visualizing PSU

21

ECMLS 2012

slide-22
SLIDE 22

Future Steps

 Comparison Needleman-Wunsch v. Blast v. PSIBlast

 NW easier as complete; Blast has missing distances

 Different Transformations distance  monotonic function(distance)

to reduce formal starting dimension (increase sigma/mean)

 Automate cluster consensus finding as sequence that minimizes

maximum distance to other sequences

 Improve O(N2) to O(N) complexity by interpolating new sequences

to original set and only doing small regions with O(N2)

 Successful in metagenomics  Can use Oct-tree from 3D mapping or set of consensus vectors  Some clusters diffuse?

ECMLS 2012 Visualizing PSU

22

slide-23
SLIDE 23

ECMLS 2012 Visualizing PSU

23

Blast6

slide-24
SLIDE 24

ECMLS 2012 Visualizing PSU

24

Full Data Blast6 Original run has 0.96 cut

slide-25
SLIDE 25

ECMLS 2012 Visualizing PSU

25

Cluster Data Blast6 Original run has 0.96 cut

slide-26
SLIDE 26

26

Use Barnes Hut OctTree

  • riginally developed to

make O(N2) astrophysics O(NlogN)

slide-27
SLIDE 27

27

OctTree for 100K sample of Fungi We use OctTree for logarithmic interpolation

slide-28
SLIDE 28

440K Interpolated

28

slide-29
SLIDE 29

Conclusions

Visualizing PSU

29

 Data Knowledge: protein annotation  Overwhelming influx of new sequences  Annotation is an immense challenge.  HPC and advanced analytics needed.  PSU as tool to facilitate annotation:  Interactive visualization and exploration  Integrates info on function, pathways, structure, and environment  MDS preserves grouping structure of protein space  MDS can use different proximities and biological data  Parallel MDS handles large-scale data  MDS interpolation quickly maps new sequences into existing space

ECMLS 2012

slide-30
SLIDE 30

DELSA: Data → Knowledge → Action

Visualizing PSU

30

Data-Enabled Life Sciences Alliance International

 Collective innovation to tackle modern biological

challenges through best computational practices and advanced cyberinfrastructure.

 Harness expertise and resources across disciplines  Promote accurate, sustainable, scalable approaches  Facilitate translation of data influx into tangible

innovations and groundbreaking discoveries

ECMLS 2012

slide-31
SLIDE 31

References and Resources

Visualizing PSU

31  COG data is available at the NCBI site

ftp://ftp.ncbi.nih.gov/pub/wolf/COGs/COG0303/

 MDS results are available at

http://manxcatcogblog.blogspot.com/

 All software used to analyze and visualize the

data is an open source.

 DELSA: http://www.delsaglobal.org

 Protein Global Atlas and Data Accessibility Projects

ECMLS 2012

slide-32
SLIDE 32

Acknowledgements

ECMLS 2012 Visualizing PSU

32

Grant support

 NSF: under DBI: 0969929 (EK) and 0910818 (GF)  NIH: 5 RC2 HG 005806- 02 (GF); NIGMS grant

R01 GM-076680-04 (EK); NIDDK grants U01-DK- 089571 and U01-DK-072473 (EK)

slide-33
SLIDE 33

Visualizing PSU

25

Thank you for your attention

ECMLS 2012