Protein Clustering: Parallelizing an Expensive, Irregular - PowerPoint PPT Presentation

Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus EPFL AACBB February 23, 2019 San Diego, CA

Protein Clustering: Parallelizing an Expensive, Irregular Computation PhD research PhD “Parallel and Scalable Bioinformatics”, April 2020 Stuart Byma James Larus, EPFL 2

What’s a protein? 3 James Larus, EPFL § Linear polymer of amino acids • Fold into complex 3D structures Protein Clustering: Parallelizing an Expensive, Irregular Computation § Perform many biological functions

Central dogma of molecular 4 James Larus, EPFL biology DNA RNA § Gene Expression Transcription • DNA à Protein Protein Clustering: Parallelizing an Expensive, Irregular Computation Translation Protein § Encoded by genes in genome § 19,000 – 20,000 proteins in humans • 1.5% of human genome § Composed of 20 amino acids

Protein Clustering: Parallelizing an Expensive, Irregular Computation Transcription § Transcribe DNA to RNA inside the nucleus James Larus, EPFL 5

Protein Clustering: Parallelizing an Expensive, Irregular Computation Translation § Once in cytoplasm, mRNA is translated to polypeptide https://en.wikipedia.org/wiki/Translation_(biology)#/media/File:Ribosome_mRNA_translation_en.svg James Larus, EPFL 6

Protein Clustering: Parallelizing an Expensive, Irregular Computation § Polypeptides fold spontaneously, or are assisted by chaperone proteins Folding James Larus, EPFL 7

Proteins & evolution 8 James Larus, EPFL § Homologous – similar due to shared ancestry § Ortholog – similar proteins diverged through speciation Protein Clustering: Parallelizing an Expensive, Irregular Computation § Similarities between proteins are proxies for similarities between genes • Infer function of new protein because of its similarity to known protein § Extrapolation from small number of model organisms • Infer evolutionary relationships between species § X evolved from Y § X, Y have common ancestor § Several of 100 most-cited scientific papers are sequence homology

Sequence homology 9 James Larus, EPFL Human (Homo Sapiens) Bonobo (Pan Paniscus) Alignment showing protein similarity between hemoglobin α-subunits from human and bonobo proteins Protein Clustering: Parallelizing an Expensive, Irregular Computation

Identifying similar 10 James Larus, EPFL proteins § Input à sequenced proteins § Output à sets of homologous proteins Protein Clustering: Parallelizing an Expensive, Irregular Computation § All-against-all comparison • O(n 2 ) in number of sequences • Sequence comparison also O(n 2 ) in length of sequences (Smith-Waterman) § OMA protein database contains proteins from 2000 genomes • Required more than 10 million CPU hours

Improvement needed! 11 James Larus, EPFL “ Computing orthologs between all complete proteomes has recently gone from typically a Protein Clustering: Parallelizing an Expensive, Irregular Computation matter of CPU weeks to hundreds of CPU years, and new, faster algorithms and methods are called for. ” – Quest for Orthologs consortium, 2014

Incremental greedy 12 James Larus, EPFL protein clustering § Speeding up all-against-all protein comparisons while maintaining sensitivity by considering subsequence-level homology , PeerJ , 2014, Wittwer, Pilizota, Altenhoff, Dessimoz. Protein Clustering: Parallelizing an Expensive, Irregular Computation § Cluster similar proteins, then perform all-against-all comparison within each cluster § Reduces computation time by ~75% § Identify >99.6% of pairs found by all-vs-all Cluster 0 … Cluster N

Cluster representative 13 James Larus, EPFL § Input sequences compared against a cluster representative • Homologies are transitive Protein Clustering: Parallelizing an Expensive, Irregular Computation § A, B homologous; B, C homologous è A, C homologous § No matches? Create a new cluster! Cluster Cluster Cluster … R 1 R n R n+1 S 1 … S m S 1 … S m S 1

Protein Clustering: Parallelizing an Expensive, Irregular Computation Proteins not transitive James Larus, EPFL 14

Clustering, v2 15 James Larus, EPFL § Multiple representatives § Ensure all sequences in a cluster are covered (± T residues) Protein Clustering: Parallelizing an Expensive, Irregular Computation Cluster R 1 … R n S 1 … S m >> n

Incremental greedy 16 James Larus, EPFL protein clustering § Reduction in computation time of ~75% • Clusters are small, on average Protein Clustering: Parallelizing an Expensive, Irregular Computation § Accuracy is excellent • Maintain >99.6% of all pairs identified by all-against-all (naive)

But, 17 James Larus, EPFL § Algorithm is not easily parallelized Protein Clustering: Parallelizing an Expensive, Irregular Computation § Order in which clusters and representatives are chosen affects result § Data (clusters) is shared – difficult to distribute

Our approach: 18 James Larus, EPFL precise clustering § Precise clustering (PC) • All significant pairs are members of at least one cluster Protein Clustering: Parallelizing an Expensive, Irregular Computation • Compare within cluster and find similarity § A pair of proteins is significant if their similarity is above a threshold • 𝑔 𝑞 1 , 𝑞 2 > 𝑈 § PC is not a partition – a protein can be in more than one cluster • Relation 𝑔 is not transitive, i.e. similarity is not equivalence p 3 p 2 p 2 p 4 p 1

Cluster representative 19 James Larus, EPFL § Each cluster has a unique representative R C • ∀e ∈ C, f (e, R C ) > T Protein Clustering: Parallelizing an Expensive, Irregular Computation § Two elements in cluster may not be similar: e 1 , e 2 ∈ C ⊬ f (e 1 , e 2 ) > T p 1 p 2

Approach 1 20 James Larus, EPFL § New element e is compared against cluster representatives • If similar, e is added to cluster Protein Clustering: Parallelizing an Expensive, Irregular Computation § This does not work! • e, other than representative, will not be compared against subsequent elements • Because f is not transitive, clustering will not be precise – may miss matches p 5 p 3 p 2 p 2 p 4 p 1

Transitive similarity 21 James Larus, EPFL § Transitivity R( R(e 1 , , e 2 ) ) implies e 2 will be similar to e 3 if e 1 is similar to e 3 Protein Clustering: Parallelizing an Expensive, Irregular Computation • ∀ 𝑗, 𝑘, 𝑙 ∈ 𝑇, 𝑆 𝑗, 𝑘 ⇒ 𝑔 𝑗, 𝑙 > 𝑈 ⋀ 𝑔 𝑘, 𝑙 > 𝑈 𝑆 𝑞 1 , 𝑞 3 p 3 𝑔 𝑞 2 , 𝑞 3 p 1 p 2 𝑆 𝑞 1 , 𝑞 2

Protein similarity 22 James Larus, EPFL § Similarity function f • Smith Waterman alignment >T (threshold parameter) Protein Clustering: Parallelizing an Expensive, Irregular Computation Seq A (rep) § Not transitive f > T Seq B f > T Seq C § Comparison order matters A B C B A C

Protein transitivity 23 James Larus, EPFL X S-W score Y Protein Clustering: Parallelizing an Expensive, Irregular Computation uX uY Uncovered Subsequence R(X, Y) score > minT, uY < maxU R(Y, X) score > minT, uX < maxU

Incremental greedy 24 James Larus, EPFL precise clustering § Construct clusters one element at a time § First element becomes cluster representative Protein Clustering: Parallelizing an Expensive, Irregular Computation p 1

Incremental greedy 25 James Larus, EPFL precise clustering § Compare subsequent elements against cluster representative Protein Clustering: Parallelizing an Expensive, Irregular Computation R? f? p 2 p 1

Incremental greedy 26 James Larus, EPFL precise clustering § If transitively similar , add to cluster Protein Clustering: Parallelizing an Expensive, Irregular Computation R p 2 ➔ p 1 p 1 p 2

Incremental greedy 27 James Larus, EPFL precise clustering § If only similar , add to cluster and create a new cluster Protein Clustering: Parallelizing an Expensive, Irregular Computation f p 2 ➔ p 1 p 1 p 2 p 2

Protein Clustering: Parallelizing an Expensive, Irregular Computation § Continue until all elements clustered precise clustering Incremental greedy … James Larus, EPFL 28

Parallelism 29 James Larus, EPFL § Unlike original Wittwer algorithm, order does not matter for precise clustering Protein Clustering: Parallelizing an Expensive, Irregular Computation § Clusters can be constructed independently and merged R() ?

Protein Clustering: Parallelizing an Expensive, Irregular Computation Merging clusters R() ? James Larus, EPFL 30

Protein Clustering: Parallelizing an Expensive, Irregular Computation Merging clusters R() ? f() ? f() James Larus, EPFL 31

Protein Clustering: Parallelizing an Expensive, Irregular Computation Set 2 Set 1 Merging sets of clusters 1 2 3 4 1 4 2 & 3 James Larus, EPFL 32

Protein Clustering: Parallelizing an Expensive, Irregular Computation Cluster merge James Larus, EPFL 33

Protein Clustering: Parallelizing an Expensive, Irregular Computation Parallelization 1 James Larus, EPFL 34

Parallelization 2 35 James Larus, EPFL § Parallelize merge of two large sets § Each computation is a partial merge Protein Clustering: Parallelizing an Expensive, Irregular Computation

Protein Clustering: Parallelizing an Expensive, Irregular - PowerPoint PPT Presentation

Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus EPFL AACBB February 23, 2019 San Diego, CA Protein Clustering: Parallelizing an Expensive, Irregular Computation PhD research PhD Parallel and Scalable

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Protein-Protein interactions Reducing the complexity Why are protein-protein interactions

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Animal protein production in a Animal protein production in a Animal protein production in a

DNA RNA Protein synthesis AMINO ACIDS PROTEIN Protein degradation FUNCTION Some properties

CSE182-L7 CSE182-L7 Protein structure Basics Protein structure Basics Protein sequencing via MS

Dynamics of Protein-Protein Interactions: A Probabilistic Model Toward Protein Function Amir

Irregular Migration, Human Irregular Migration, Human Smuggling and Informal Smuggling and

Analyzing Irregular Mutual Analyzing Irregular Mutual Exclusion in Parallel Programs Exclusion

Concatenated Irregular Variable Length Coding and Irregular Unity Rate Coding R. G. Maunder and

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Within Structural Bioinformatics Plant Bioinformatics, Systems and Synthetic Biology Summer School

Details of Protein Structure Function, evolution & experimental methods Thomas Blicher,

Choosing the Right 1. Diagnosis screening 2. Staging of disease Treatment Regimen 3.

10. Enterprise-wide Optimization 11. Batch Scheduling TOTAL (110 pts) 1. Biosystems Engineering

NMR Spectroscopy CH.EMBnet course 28.9.2004 Biozentrum, Basel D. Hussinger Overview 1. Basic

FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES Robert Hirt Department of Zoology, The Natural

Specificity of Protein-DNA recognition of a long DNA binding motif Francisco Melo Ledermann EMBO

Supervised Convolutional GSN for Protein Secondary Structure Prediction Jian Zhou Olga

Protein Clustering: Parallelizing an Expensive, Irregular - PowerPoint PPT Presentation

Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus EPFL AACBB February 23, 2019 San Diego, CA Protein Clustering: Parallelizing an Expensive, Irregular Computation PhD research PhD Parallel and Scalable

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Protein-Protein interactions Reducing the complexity Why are protein-protein interactions

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Animal protein production in a Animal protein production in a Animal protein production in a

DNA RNA Protein synthesis AMINO ACIDS PROTEIN Protein degradation FUNCTION Some properties

CSE182-L7 CSE182-L7 Protein structure Basics Protein structure Basics Protein sequencing via MS

Dynamics of Protein-Protein Interactions: A Probabilistic Model Toward Protein Function Amir

Irregular Migration, Human Irregular Migration, Human Smuggling and Informal Smuggling and

Analyzing Irregular Mutual Analyzing Irregular Mutual Exclusion in Parallel Programs Exclusion

Concatenated Irregular Variable Length Coding and Irregular Unity Rate Coding R. G. Maunder and

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Within Structural Bioinformatics Plant Bioinformatics, Systems and Synthetic Biology Summer School

Details of Protein Structure Function, evolution &amp; experimental methods Thomas Blicher,

Choosing the Right 1. Diagnosis screening 2. Staging of disease Treatment Regimen 3.

10. Enterprise-wide Optimization 11. Batch Scheduling TOTAL (110 pts) 1. Biosystems Engineering

NMR Spectroscopy CH.EMBnet course 28.9.2004 Biozentrum, Basel D. Hussinger Overview 1. Basic

FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES Robert Hirt Department of Zoology, The Natural

Specificity of Protein-DNA recognition of a long DNA binding motif Francisco Melo Ledermann EMBO

Supervised Convolutional GSN for Protein Secondary Structure Prediction Jian Zhou Olga

Details of Protein Structure Function, evolution & experimental methods Thomas Blicher,