Protein Clustering: Parallelizing an Expensive, Irregular Computation
James Larus EPFL
AACBB
February 23, 2019 San Diego, CA
Protein Clustering: Parallelizing an Expensive, Irregular - - PowerPoint PPT Presentation
Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus EPFL AACBB February 23, 2019 San Diego, CA Protein Clustering: Parallelizing an Expensive, Irregular Computation PhD research PhD Parallel and Scalable
James Larus EPFL
AACBB
February 23, 2019 San Diego, CA
Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL 2
Stuart Byma PhD “Parallel and Scalable Bioinformatics”, April 2020
§ Linear polymer of amino acids
§ Perform many biological functions
Protein Clustering: Parallelizing an Expensive, Irregular Computation 3 James Larus, EPFL
§ Gene Expression
§ Encoded by genes in genome § 19,000 – 20,000 proteins in humans
§ Composed of 20 amino acids
Protein Clustering: Parallelizing an Expensive, Irregular Computation 4 James Larus, EPFL
DNA RNA Protein
Transcription Translation
Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL 5
§ Transcribe DNA to RNA inside the nucleus
§ Once in cytoplasm, mRNA is translated to polypeptide
Protein Clustering: Parallelizing an Expensive, Irregular Computation 6 James Larus, EPFL
https://en.wikipedia.org/wiki/Translation_(biology)#/media/File:Ribosome_mRNA_translation_en.svg
§ Polypeptides fold spontaneously, or are assisted by chaperone proteins
Protein Clustering: Parallelizing an Expensive, Irregular Computation 7 James Larus, EPFL
§ Homologous – similar due to shared ancestry § Ortholog – similar proteins diverged through speciation § Similarities between proteins are proxies for similarities between genes
§ Extrapolation from small number of model organisms
§ X evolved from Y § X, Y have common ancestor
§ Several of 100 most-cited scientific papers are sequence homology
Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL 8
Protein Clustering: Parallelizing an Expensive, Irregular Computation 9 James Larus, EPFL
Human (Homo Sapiens) Bonobo (Pan Paniscus)
Alignment showing protein similarity between hemoglobin α-subunits from human and bonobo proteins
§ Input à sequenced proteins § Output à sets of homologous proteins § All-against-all comparison
§ OMA protein database contains proteins from 2000 genomes
Protein Clustering: Parallelizing an Expensive, Irregular Computation 10 James Larus, EPFL
Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL 11
“Computing orthologs between all complete proteomes has recently gone from typically a matter of CPU weeks to hundreds of CPU years, and new, faster algorithms and methods are called for.” – Quest for Orthologs consortium, 2014
§ Speeding up all-against-all protein comparisons while maintaining sensitivity by
considering subsequence-level homology, PeerJ, 2014, Wittwer, Pilizota, Altenhoff, Dessimoz.
§ Cluster similar proteins, then perform all-against-all comparison within
each cluster
§ Reduces computation time by ~75% § Identify >99.6% of pairs found by all-vs-all
Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL 12
… Cluster 0 Cluster N
§ Input sequences compared against a cluster representative
§ A, B homologous; B, C homologous è A, C homologous
§ No matches? Create a new cluster!
Cluster R1 S1 … Sm Cluster Rn S1 … Sm
Protein Clustering: Parallelizing an Expensive, Irregular Computation 13 James Larus, EPFL
Cluster Rn+1 S1
Protein Clustering: Parallelizing an Expensive, Irregular Computation 14 James Larus, EPFL
§ Multiple representatives § Ensure all sequences in a cluster are covered (± T residues)
Cluster R1 … Rn S1 … Sm >> n
Protein Clustering: Parallelizing an Expensive, Irregular Computation 15 James Larus, EPFL
§ Reduction in computation time of ~75%
§ Accuracy is excellent
Protein Clustering: Parallelizing an Expensive, Irregular Computation 16 James Larus, EPFL
§ Algorithm is not easily parallelized § Order in which clusters and representatives are chosen affects result § Data (clusters) is shared – difficult to distribute
Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL 17
§ Precise clustering (PC)
§ A pair of proteins is significant if their similarity is above a threshold
§ PC is not a partition – a protein can be in more than one cluster
18 Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL
p1 p2 p3 p2 p4
§ Each cluster has a unique representative RC
§ Two elements in cluster may not be similar: e1, e2 ∈ C ⊬ f (e1, e2) > T
Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL 19
p1 p2
§ New element e is compared against cluster representatives
§ This does not work!
elements
matches
Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL 20
p5 p1 p2 p3 p2 p4
§ Transitivity R(
R(e1, , e2) ) implies e2 will be similar to e3 if e1 is similar to e3
⇒ 𝑔 𝑗, 𝑙 > 𝑈 ⋀ 𝑔 𝑘, 𝑙 > 𝑈
21 Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL
p1 p2 p3
𝑆 𝑞1, 𝑞3 𝑆 𝑞1, 𝑞2 𝑔 𝑞2, 𝑞3
§ Similarity function f
§ Not transitive § Comparison order matters
22 Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL
Seq A (rep) Seq B
Seq C
f > T f > T
A B C B A C
Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL 23
Uncovered Subsequence
S-W score
§ Construct clusters one element at a time § First element becomes cluster representative
24 Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL
p1
§ Compare subsequent elements against cluster representative
Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL 25
R? f?
p1 p2
§ If transitively similar, add to cluster
26 Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL
R
p1 p2 p1 p2
§ If only similar, add to cluster and create a new cluster
27 Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL
f
p1 p2 p1 p2
p2
§ Continue until all elements clustered
28
Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL
§ Unlike original Wittwer algorithm, order does not matter for precise
clustering
§ Clusters can be constructed independently and merged
29
R() ?
Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL
Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL 30
R() ?
Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL 31
R() ? f() ? f()
Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL 32
Set 1 Set 2 1 2 3 4 1 4 2 & 3
Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL 33
Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL 34
§ Parallelize merge of two large sets § Each computation is a partial merge
35 Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL
Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL 36
Set Merge Queue
Partial Merge Queue
ThreadPool Cluster Sets
done?
Partial Merge
Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL 37
Set Merge Queue
Work Queue Cluster Sets
Batch / Split
Workers Batch? Split Partial? Partial Merge
Partial Merge ID Partially Merged State ID State
Add new seqs
§ Every remote worker has copy of all sequences
§ Workers cache copies of sets and only transfer diffs § Careful queue management § Aggressive load balancing
Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL 38
§ Recall
§ Scalability / performance
39 Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL
§ Dataset
§ Similarity
§ Transitivity
§ Increment greedy clustering (1 / 3 representatives)
§ Precise cluster merge (Shared-CM/Dist-CM)
40 Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL
Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL 41
§ Smaller data set (28,600 sequences) § Incremental greedy clustering [Wittwer] (1 / 3 representatives)
§ Original clustering (1 representative)
§ Shared-CM (48 thread)
42
60.2x speedup
Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL
Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL 43
48 threads Hyperthreading of no benefit
Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL 44
Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL 45
Vary number of nodes Dist-CM 604x on 32 nodes (768 cores) 79% efficiency 1,400x over Wittwer
Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL 46
§ Dataset of 13 bacterial genomes
§ Dataset of 33 closely related Streptococcus bacteria genomes
§ Closely related ⇒ fewer clusters
§ Shared-CM (48 threads)
Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL 47
§ Larger, more diverse datasets (w/ friends from UNIL) § Seeding clusters with known significant pairs § Hardware acceleration of Smith-Waterman comparison
48 Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL
§ Think beyond DNA!
§ Hardware acceleration is premature if your application does not have
near-linear speedup on a cluster
§ Keeping cores busy is key to efficient parallelism
Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL 49
James Larus