protein clustering parallelizing an expensive irregular
play

Protein Clustering: Parallelizing an Expensive, Irregular - PowerPoint PPT Presentation

Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus EPFL AACBB February 23, 2019 San Diego, CA Protein Clustering: Parallelizing an Expensive, Irregular Computation PhD research PhD Parallel and Scalable


  1. Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus EPFL AACBB February 23, 2019 San Diego, CA

  2. Protein Clustering: Parallelizing an Expensive, Irregular Computation PhD research PhD “Parallel and Scalable Bioinformatics”, April 2020 Stuart Byma James Larus, EPFL 2

  3. What’s a protein? 3 James Larus, EPFL § Linear polymer of amino acids • Fold into complex 3D structures Protein Clustering: Parallelizing an Expensive, Irregular Computation § Perform many biological functions

  4. Central dogma of molecular 4 James Larus, EPFL biology DNA RNA § Gene Expression Transcription • DNA à Protein Protein Clustering: Parallelizing an Expensive, Irregular Computation Translation Protein § Encoded by genes in genome § 19,000 – 20,000 proteins in humans • 1.5% of human genome § Composed of 20 amino acids

  5. Protein Clustering: Parallelizing an Expensive, Irregular Computation Transcription § Transcribe DNA to RNA inside the nucleus James Larus, EPFL 5

  6. Protein Clustering: Parallelizing an Expensive, Irregular Computation Translation § Once in cytoplasm, mRNA is translated to polypeptide https://en.wikipedia.org/wiki/Translation_(biology)#/media/File:Ribosome_mRNA_translation_en.svg James Larus, EPFL 6

  7. Protein Clustering: Parallelizing an Expensive, Irregular Computation § Polypeptides fold spontaneously, or are assisted by chaperone proteins Folding James Larus, EPFL 7

  8. Proteins & evolution 8 James Larus, EPFL § Homologous – similar due to shared ancestry § Ortholog – similar proteins diverged through speciation Protein Clustering: Parallelizing an Expensive, Irregular Computation § Similarities between proteins are proxies for similarities between genes • Infer function of new protein because of its similarity to known protein § Extrapolation from small number of model organisms • Infer evolutionary relationships between species § X evolved from Y § X, Y have common ancestor § Several of 100 most-cited scientific papers are sequence homology

  9. Sequence homology 9 James Larus, EPFL Human (Homo Sapiens) Bonobo (Pan Paniscus) Alignment showing protein similarity between hemoglobin α-subunits from human and bonobo proteins Protein Clustering: Parallelizing an Expensive, Irregular Computation

  10. Identifying similar 10 James Larus, EPFL proteins § Input à sequenced proteins § Output à sets of homologous proteins Protein Clustering: Parallelizing an Expensive, Irregular Computation § All-against-all comparison • O(n 2 ) in number of sequences • Sequence comparison also O(n 2 ) in length of sequences (Smith-Waterman) § OMA protein database contains proteins from 2000 genomes • Required more than 10 million CPU hours

  11. Improvement needed! 11 James Larus, EPFL “ Computing orthologs between all complete proteomes has recently gone from typically a Protein Clustering: Parallelizing an Expensive, Irregular Computation matter of CPU weeks to hundreds of CPU years, and new, faster algorithms and methods are called for. ” – Quest for Orthologs consortium, 2014

  12. Incremental greedy 12 James Larus, EPFL protein clustering § Speeding up all-against-all protein comparisons while maintaining sensitivity by considering subsequence-level homology , PeerJ , 2014, Wittwer, Pilizota, Altenhoff, Dessimoz. Protein Clustering: Parallelizing an Expensive, Irregular Computation § Cluster similar proteins, then perform all-against-all comparison within each cluster § Reduces computation time by ~75% § Identify >99.6% of pairs found by all-vs-all Cluster 0 … Cluster N

  13. Cluster representative 13 James Larus, EPFL § Input sequences compared against a cluster representative • Homologies are transitive Protein Clustering: Parallelizing an Expensive, Irregular Computation § A, B homologous; B, C homologous è A, C homologous § No matches? Create a new cluster! Cluster Cluster Cluster … R 1 R n R n+1 S 1 … S m S 1 … S m S 1

  14. Protein Clustering: Parallelizing an Expensive, Irregular Computation Proteins not transitive James Larus, EPFL 14

  15. Clustering, v2 15 James Larus, EPFL § Multiple representatives § Ensure all sequences in a cluster are covered (± T residues) Protein Clustering: Parallelizing an Expensive, Irregular Computation Cluster R 1 … R n S 1 … S m >> n

  16. Incremental greedy 16 James Larus, EPFL protein clustering § Reduction in computation time of ~75% • Clusters are small, on average Protein Clustering: Parallelizing an Expensive, Irregular Computation § Accuracy is excellent • Maintain >99.6% of all pairs identified by all-against-all (naive)

  17. But, 17 James Larus, EPFL § Algorithm is not easily parallelized Protein Clustering: Parallelizing an Expensive, Irregular Computation § Order in which clusters and representatives are chosen affects result § Data (clusters) is shared – difficult to distribute

  18. Our approach: 18 James Larus, EPFL precise clustering § Precise clustering (PC) • All significant pairs are members of at least one cluster Protein Clustering: Parallelizing an Expensive, Irregular Computation • Compare within cluster and find similarity § A pair of proteins is significant if their similarity is above a threshold • 𝑔 𝑞 1 , 𝑞 2 > 𝑈 § PC is not a partition – a protein can be in more than one cluster • Relation 𝑔 is not transitive, i.e. similarity is not equivalence p 3 p 2 p 2 p 4 p 1

  19. Cluster representative 19 James Larus, EPFL § Each cluster has a unique representative R C • ∀e ∈ C, f (e, R C ) > T Protein Clustering: Parallelizing an Expensive, Irregular Computation § Two elements in cluster may not be similar: e 1 , e 2 ∈ C ⊬ f (e 1 , e 2 ) > T p 1 p 2

  20. Approach 1 20 James Larus, EPFL § New element e is compared against cluster representatives • If similar, e is added to cluster Protein Clustering: Parallelizing an Expensive, Irregular Computation § This does not work! • e, other than representative, will not be compared against subsequent elements • Because f is not transitive, clustering will not be precise – may miss matches p 5 p 3 p 2 p 2 p 4 p 1

  21. Transitive similarity 21 James Larus, EPFL § Transitivity R( R(e 1 , , e 2 ) ) implies e 2 will be similar to e 3 if e 1 is similar to e 3 Protein Clustering: Parallelizing an Expensive, Irregular Computation • ∀ 𝑗, 𝑘, 𝑙 ∈ 𝑇, 𝑆 𝑗, 𝑘 ⇒ 𝑔 𝑗, 𝑙 > 𝑈 ⋀ 𝑔 𝑘, 𝑙 > 𝑈 𝑆 𝑞 1 , 𝑞 3 p 3 𝑔 𝑞 2 , 𝑞 3 p 1 p 2 𝑆 𝑞 1 , 𝑞 2

  22. Protein similarity 22 James Larus, EPFL § Similarity function f • Smith Waterman alignment >T (threshold parameter) Protein Clustering: Parallelizing an Expensive, Irregular Computation Seq A (rep) § Not transitive f > T Seq B f > T Seq C § Comparison order matters A B C B A C

  23. Protein transitivity 23 James Larus, EPFL X S-W score Y Protein Clustering: Parallelizing an Expensive, Irregular Computation uX uY Uncovered Subsequence R(X, Y) score > minT, uY < maxU R(Y, X) score > minT, uX < maxU

  24. Incremental greedy 24 James Larus, EPFL precise clustering § Construct clusters one element at a time § First element becomes cluster representative Protein Clustering: Parallelizing an Expensive, Irregular Computation p 1

  25. Incremental greedy 25 James Larus, EPFL precise clustering § Compare subsequent elements against cluster representative Protein Clustering: Parallelizing an Expensive, Irregular Computation R? f? p 2 p 1

  26. Incremental greedy 26 James Larus, EPFL precise clustering § If transitively similar , add to cluster Protein Clustering: Parallelizing an Expensive, Irregular Computation R p 2 ➔ p 1 p 1 p 2

  27. Incremental greedy 27 James Larus, EPFL precise clustering § If only similar , add to cluster and create a new cluster Protein Clustering: Parallelizing an Expensive, Irregular Computation f p 2 ➔ p 1 p 1 p 2 p 2

  28. Protein Clustering: Parallelizing an Expensive, Irregular Computation § Continue until all elements clustered precise clustering Incremental greedy … James Larus, EPFL 28

  29. Parallelism 29 James Larus, EPFL § Unlike original Wittwer algorithm, order does not matter for precise clustering Protein Clustering: Parallelizing an Expensive, Irregular Computation § Clusters can be constructed independently and merged R() ?

  30. Protein Clustering: Parallelizing an Expensive, Irregular Computation Merging clusters R() ? James Larus, EPFL 30

  31. Protein Clustering: Parallelizing an Expensive, Irregular Computation Merging clusters R() ? f() ? f() James Larus, EPFL 31

  32. Protein Clustering: Parallelizing an Expensive, Irregular Computation Set 2 Set 1 Merging sets of clusters 1 2 3 4 1 4 2 & 3 James Larus, EPFL 32

  33. Protein Clustering: Parallelizing an Expensive, Irregular Computation Cluster merge James Larus, EPFL 33

  34. Protein Clustering: Parallelizing an Expensive, Irregular Computation Parallelization 1 James Larus, EPFL 34

  35. Parallelization 2 35 James Larus, EPFL § Parallelize merge of two large sets § Each computation is a partial merge Protein Clustering: Parallelizing an Expensive, Irregular Computation

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend