Protein Clustering: Parallelizing an Expensive, Irregular - - PowerPoint PPT Presentation

protein clustering parallelizing an expensive irregular
SMART_READER_LITE
LIVE PREVIEW

Protein Clustering: Parallelizing an Expensive, Irregular - - PowerPoint PPT Presentation

Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus EPFL AACBB February 23, 2019 San Diego, CA Protein Clustering: Parallelizing an Expensive, Irregular Computation PhD research PhD Parallel and Scalable


slide-1
SLIDE 1

Protein Clustering: Parallelizing an Expensive, Irregular Computation

James Larus EPFL

AACBB

February 23, 2019 San Diego, CA

slide-2
SLIDE 2

PhD research

Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL 2

Stuart Byma PhD “Parallel and Scalable Bioinformatics”, April 2020

slide-3
SLIDE 3

§ Linear polymer of amino acids

  • Fold into complex 3D structures

§ Perform many biological functions

What’s a protein?

Protein Clustering: Parallelizing an Expensive, Irregular Computation 3 James Larus, EPFL

slide-4
SLIDE 4

§ Gene Expression

  • DNA à Protein

§ Encoded by genes in genome § 19,000 – 20,000 proteins in humans

  • 1.5% of human genome

§ Composed of 20 amino acids

Central dogma of molecular biology

Protein Clustering: Parallelizing an Expensive, Irregular Computation 4 James Larus, EPFL

DNA RNA Protein

Transcription Translation

slide-5
SLIDE 5

Transcription

Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL 5

§ Transcribe DNA to RNA inside the nucleus

slide-6
SLIDE 6

Translation

§ Once in cytoplasm, mRNA is translated to polypeptide

Protein Clustering: Parallelizing an Expensive, Irregular Computation 6 James Larus, EPFL

https://en.wikipedia.org/wiki/Translation_(biology)#/media/File:Ribosome_mRNA_translation_en.svg

slide-7
SLIDE 7

§ Polypeptides fold spontaneously, or are assisted by chaperone proteins

Folding

Protein Clustering: Parallelizing an Expensive, Irregular Computation 7 James Larus, EPFL

slide-8
SLIDE 8

§ Homologous – similar due to shared ancestry § Ortholog – similar proteins diverged through speciation § Similarities between proteins are proxies for similarities between genes

  • Infer function of new protein because of its similarity to known protein

§ Extrapolation from small number of model organisms

  • Infer evolutionary relationships between species

§ X evolved from Y § X, Y have common ancestor

§ Several of 100 most-cited scientific papers are sequence homology

Proteins & evolution

Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL 8

slide-9
SLIDE 9

Sequence homology

Protein Clustering: Parallelizing an Expensive, Irregular Computation 9 James Larus, EPFL

Human (Homo Sapiens) Bonobo (Pan Paniscus)

Alignment showing protein similarity between hemoglobin α-subunits from human and bonobo proteins

slide-10
SLIDE 10

§ Input à sequenced proteins § Output à sets of homologous proteins § All-against-all comparison

  • O(n2) in number of sequences
  • Sequence comparison also O(n2) in length of sequences (Smith-Waterman)

§ OMA protein database contains proteins from 2000 genomes

  • Required more than 10 million CPU hours

Identifying similar proteins

Protein Clustering: Parallelizing an Expensive, Irregular Computation 10 James Larus, EPFL

slide-11
SLIDE 11

Improvement needed!

Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL 11

“Computing orthologs between all complete proteomes has recently gone from typically a matter of CPU weeks to hundreds of CPU years, and new, faster algorithms and methods are called for.” – Quest for Orthologs consortium, 2014

slide-12
SLIDE 12

§ Speeding up all-against-all protein comparisons while maintaining sensitivity by

considering subsequence-level homology, PeerJ, 2014, Wittwer, Pilizota, Altenhoff, Dessimoz.

§ Cluster similar proteins, then perform all-against-all comparison within

each cluster

§ Reduces computation time by ~75% § Identify >99.6% of pairs found by all-vs-all

Incremental greedy protein clustering

Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL 12

… Cluster 0 Cluster N

slide-13
SLIDE 13

§ Input sequences compared against a cluster representative

  • Homologies are transitive

§ A, B homologous; B, C homologous è A, C homologous

§ No matches? Create a new cluster!

Cluster representative

Cluster R1 S1 … Sm Cluster Rn S1 … Sm

Protein Clustering: Parallelizing an Expensive, Irregular Computation 13 James Larus, EPFL

Cluster Rn+1 S1

slide-14
SLIDE 14

Proteins not transitive

Protein Clustering: Parallelizing an Expensive, Irregular Computation 14 James Larus, EPFL

slide-15
SLIDE 15

§ Multiple representatives § Ensure all sequences in a cluster are covered (± T residues)

Clustering, v2

Cluster R1 … Rn S1 … Sm >> n

Protein Clustering: Parallelizing an Expensive, Irregular Computation 15 James Larus, EPFL

slide-16
SLIDE 16

§ Reduction in computation time of ~75%

  • Clusters are small, on average

§ Accuracy is excellent

  • Maintain >99.6% of all pairs identified by all-against-all (naive)

Incremental greedy protein clustering

Protein Clustering: Parallelizing an Expensive, Irregular Computation 16 James Larus, EPFL

slide-17
SLIDE 17

§ Algorithm is not easily parallelized § Order in which clusters and representatives are chosen affects result § Data (clusters) is shared – difficult to distribute

But,

Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL 17

slide-18
SLIDE 18

§ Precise clustering (PC)

  • All significant pairs are members of at least one cluster
  • Compare within cluster and find similarity

§ A pair of proteins is significant if their similarity is above a threshold

  • 𝑔 𝑞1, 𝑞2 > 𝑈

§ PC is not a partition – a protein can be in more than one cluster

  • Relation 𝑔 is not transitive, i.e. similarity is not equivalence

Our approach: precise clustering

18 Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL

p1 p2 p3 p2 p4

slide-19
SLIDE 19

§ Each cluster has a unique representative RC

  • ∀e ∈ C, f (e, RC ) > T

§ Two elements in cluster may not be similar: e1, e2 ∈ C ⊬ f (e1, e2) > T

Cluster representative

Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL 19

p1 p2

slide-20
SLIDE 20

§ New element e is compared against cluster representatives

  • If similar, e is added to cluster

§ This does not work!

  • e, other than representative, will not be compared against subsequent

elements

  • Because f is not transitive, clustering will not be precise – may miss

matches

Approach 1

Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL 20

p5 p1 p2 p3 p2 p4

slide-21
SLIDE 21

§ Transitivity R(

R(e1, , e2) ) implies e2 will be similar to e3 if e1 is similar to e3

  • ∀ 𝑗, 𝑘, 𝑙 ∈ 𝑇, 𝑆 𝑗, 𝑘

⇒ 𝑔 𝑗, 𝑙 > 𝑈 ⋀ 𝑔 𝑘, 𝑙 > 𝑈

Transitive similarity

21 Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL

p1 p2 p3

𝑆 𝑞1, 𝑞3 𝑆 𝑞1, 𝑞2 𝑔 𝑞2, 𝑞3

slide-22
SLIDE 22

§ Similarity function f

  • Smith Waterman alignment >T (threshold parameter)

§ Not transitive § Comparison order matters

Protein similarity

22 Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL

Seq A (rep) Seq B

Seq C

f > T f > T

A B C B A C

slide-23
SLIDE 23

Protein transitivity

Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL 23

X Y

Uncovered Subsequence

S-W score

uX uY

R(X, Y) score > minT, uY < maxU R(Y, X) score > minT, uX < maxU

slide-24
SLIDE 24

§ Construct clusters one element at a time § First element becomes cluster representative

Incremental greedy precise clustering

24 Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL

p1

slide-25
SLIDE 25

§ Compare subsequent elements against cluster representative

Incremental greedy precise clustering

Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL 25

R? f?

p1 p2

slide-26
SLIDE 26

§ If transitively similar, add to cluster

Incremental greedy precise clustering

26 Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL

R

p1 p2 p1 p2

slide-27
SLIDE 27

§ If only similar, add to cluster and create a new cluster

Incremental greedy precise clustering

27 Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL

f

p1 p2 p1 p2

p2

slide-28
SLIDE 28

§ Continue until all elements clustered

Incremental greedy precise clustering

28

Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL

slide-29
SLIDE 29

§ Unlike original Wittwer algorithm, order does not matter for precise

clustering

§ Clusters can be constructed independently and merged

Parallelism

29

R() ?

Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL

slide-30
SLIDE 30

Merging clusters

Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL 30

R() ?

slide-31
SLIDE 31

Merging clusters

Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL 31

R() ? f() ? f()

slide-32
SLIDE 32

Merging sets of clusters

Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL 32

Set 1 Set 2 1 2 3 4 1 4 2 & 3

slide-33
SLIDE 33

Cluster merge

Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL 33

slide-34
SLIDE 34

Parallelization 1

Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL 34

slide-35
SLIDE 35

§ Parallelize merge of two large sets § Each computation is a partial merge

Parallelization 2

35 Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL

slide-36
SLIDE 36

Shared-memory (Shared-CM)

Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL 36

Set Merge Queue

Partial Merge Queue

ThreadPool Cluster Sets

done?

Partial Merge

slide-37
SLIDE 37

Distributed (Dist-CM)

Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL 37

Set Merge Queue

Work Queue Cluster Sets

Batch / Split

Workers Batch? Split Partial? Partial Merge

Partial Merge ID Partially Merged State ID State

Add new seqs

slide-38
SLIDE 38

§ Every remote worker has copy of all sequences

  • Sequences named by index (4-byte)

§ Workers cache copies of sets and only transfer diffs § Careful queue management § Aggressive load balancing

Dist-CM optimization

Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL 38

slide-39
SLIDE 39

§ Recall

  • Number of significant pairs, relative to all-against-all

§ Scalability / performance

Evaluation

39 Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL

slide-40
SLIDE 40

§ Dataset

  • 13 bacterial genomes, ~59,000 sequences

§ Similarity

  • S-W threshold of 181 with PAM250 substitution matrix (Wittwer)

§ Transitivity

  • mT = 250, mU = 15

§ Increment greedy clustering (1 / 3 representatives)

  • 99.6% / 99.9% recall (compared all-vs-all)

§ Precise cluster merge (Shared-CM/Dist-CM)

  • 99.8 ± 0.01% recall
  • Missed 10-6 significant pairs, mainly low scoring ones (avg. 191, median 235)

Recall

40 Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL

slide-41
SLIDE 41

Sensitivity analysis

Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL 41

slide-42
SLIDE 42

§ Smaller data set (28,600 sequences) § Incremental greedy clustering [Wittwer] (1 / 3 representatives)

  • 4x / 2x faster than all-vs-all

§ Original clustering (1 representative)

  • 89,486 seconds = 24.9 hours

§ Shared-CM (48 thread)

  • 1,486 seconds = 0.41 hours

Shared-memory speedup

42

60.2x speedup

Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL

slide-43
SLIDE 43

Shared-memory scalability

Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL 43

  • 24 core Xeon

48 threads Hyperthreading of no benefit

slide-44
SLIDE 44

Shared-memory scalability

Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL 44

slide-45
SLIDE 45

Distributed –strong scaling

Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL 45

  • Dataset fixed

Vary number of nodes Dist-CM 604x on 32 nodes (768 cores) 79% efficiency 1,400x over Wittwer

slide-46
SLIDE 46

Distributed –weak scaling

Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL 46

  • Dataset grows ~ 𝑜
slide-47
SLIDE 47

§ Dataset of 13 bacterial genomes

  • 59,013 sequences

§ Dataset of 33 closely related Streptococcus bacteria genomes

  • 69,648 sequences

§ Closely related ⇒ fewer clusters

  • Closer to O(n lgn) performance

§ Shared-CM (48 threads)

  • Streptococcus 283 sec. and 10,500 clusters
  • (vs 1,486 sec. and 33,562 clusters)

Dataset composition

Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL 47

slide-48
SLIDE 48

§ Larger, more diverse datasets (w/ friends from UNIL) § Seeding clusters with known significant pairs § Hardware acceleration of Smith-Waterman comparison

  • Proteins are long (300 - 30,000 amino acids)
  • Alphabet is richer (20 amino acids)
  • More complex scoring function.

Improvements / Future work

48 Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL

slide-49
SLIDE 49

§ Think beyond DNA!

  • Proteins are richer and more challenging than DNA

§ Hardware acceleration is premature if your application does not have

near-linear speedup on a cluster

  • Bioinformatics need parallel algorithms and implementations

§ Keeping cores busy is key to efficient parallelism

  • Communications efficiency
  • Work distribution and load balancing

Conclusion

Protein Clustering: Parallelizing an Expensive, Irregular Computation James Larus, EPFL 49

slide-50
SLIDE 50

Merci

James Larus