Information-Theoretic Analysis of Molecular (Co)Evolution Using - - PowerPoint PPT Presentation

information theoretic analysis of molecular co evolution
SMART_READER_LITE
LIVE PREVIEW

Information-Theoretic Analysis of Molecular (Co)Evolution Using - - PowerPoint PPT Presentation

Information-Theoretic Analysis of Molecular (Co)Evolution Using Graphics Processing Units Michael Waechter, Kathrin Jaeger, Stephanie Weissgraeber, Sven Widmer, Michael Goesele, and Kay Hamacher . . . AEERYAEYKEAFTLFDSDGD. . . . . .


slide-1
SLIDE 1

Information-Theoretic Analysis

  • f

Molecular (Co)Evolution Using Graphics Processing Units

June 18, 2012 | ECMLS 2012 | Michael Waechter & Stephanie Weissgraeber 1

. . . AEERYAEYKEAFTLFDSDGD. . . . . . TEEQGRQFRQM FEM

  • FDKNGD. . .

. . . TDEQQRQYRQM

  • FETFDKDGN. . .

. . . TKEQVEEFKQAFSM

  • FDTDGD. . .

. . . SEEQVAEFKEAFDRFDKNKD. . . . . . SKEQVAKFKEAFDRI DKNKD. . . . . . SPEQVAEFKQAFSRFDKNGD. . . . . . SEEQVAKFKAAFSRFDTNGD. . . . . . PPEQVAKFKEVFSRFDKNGD. . .

Michael Waechter, Kathrin Jaeger, Stephanie Weissgraeber, Sven Widmer, Michael Goesele, and Kay Hamacher

. . . AEERYAEYKEAFTLFDSDGD. . . . . . TEEQGRQFRQM FEM

  • FDKNGD. . .

. . . TDEQQRQYRQM

  • FETFDKDGN. . .

. . . TKEQVEEFKQAFSM

  • FDTDGD. . .

. . . SEEQVAEFKEAFDRFDKNKD. . . . . . SKEQVAKFKEAFDRI DKNKD. . . . . . SPEQVAEFKQAFSRFDKNGD. . . . . . SEEQVAKFKAAFSRFDTNGD. . . . . . PPEQVAKFKEVFSRFDKNGD. . .

slide-2
SLIDE 2

June 18, 2012 | ECMLS 2012 | Michael Waechter & Stephanie Weissgraeber 2

  • Huge amount of Multiple Sequence Alignments (MSAs)

available, some of them really large

  • E.g., HIV protease [1]:

> 45,000 sequences of length > 1400

  • Put them to use for coevolutionary and structural analysis
  • But: Our computations take >25 days

[1] Pan et. al.:“The HIV positive selection mutation database”

Motivation

slide-3
SLIDE 3

June 18, 2012 | ECMLS 2012 | Michael Waechter & Stephanie Weissgraeber 3

  • In this talk we will show…
  • MSA analysis using Mutual Information
  • GPU parallelization & speed improvements
  • 3-point Mutual Information
  • an application to a well-known protein
  • that the use of this is beneficial

Outline

contributions

slide-4
SLIDE 4

June 18, 2012 | ECMLS 2012 | Michael Waechter & Stephanie Weissgraeber 4

  • Given an MSA:
  • Mutual Information between two columns

(correlation  coevolution):

  • Iteration over all column pairs  MI matrix:

Introduction – Mutual Information

Sequence 1: AEERYAEYKEAFTLFDSDGD. . . Sequence 2: TEEQGRQFRQM FEM

  • FDKNGD. . .

Sequence 3: TDEQQRQYRQM

  • FETFDKDGN. . .

Sequence 4: TKEQVEEFKQAFSM

  • FDTDGD. . .

Sequence 5: SEEQVAEFKEAFDRFDKNKD. . . Sequence 6: SKEQVAKFKEAFDRI DKNKD. . . Sequence 7: SPEQVAEFKQAFSRFDKNGD. . . Sequence 8: SEEQVAKFKAAFSRFDTNGD. . .

slide-5
SLIDE 5

June 18, 2012 | ECMLS 2012 | Michael Waechter & Stephanie Weissgraeber

  • MI is sensitive to underlying amino acid distribution
  • Computational Normalization: Shuffling Null-Model [2]
  • Is MI distinguishable from “random evolution” MI?

[2] K. Hamacher: “Relating sequence evolution of HIV1-protease to its underlying molecular mechanics”

Introduction – Shuffling Null-Model

slide-6
SLIDE 6

June 18, 2012 | ECMLS 2012 | Michael Waechter & Stephanie Weissgraeber 6

  • Compute original MI
  • Iterate 10,000 times:
  • Shuffle each MSA column
  • Compute rand. MI matrix
  • Normalize original MI

using random MI:

Introduction – Shuffling Null-Model

6

  • AEER. . .
  • TEEQ. . .
  • TDEQ. . .
  • SEEQ. . .
  • SKEQ. . .
  • PPEQ. . .
  • SEEQ. . .
  • SEEQ. . .
  • PEEQ. . .
  • TPEQ. . .
  • AKEQ. . .
  • TDER. . .
  • SEEQ. . .
  • TDER. . .
  • TKEQ. . .
  • SEEQ. . .
  • APEQ. . .
  • PEEQ. . .

. . . . . .

slide-7
SLIDE 7

June 18, 2012 | ECMLS 2012 | Michael Waechter & Stephanie Weissgraeber 7

  • Highly compute intensive
  • HIV-1 protease on single core:
  • MI computation for all column pairs: ~3.5 min
  • Repeat for 10,000 iterations: > 25 days
  • But:
  • Computation of each MI matrix entry independent of all others
  • Shuffling of each MSA column independent of all others
  • Parallelizable (to hundreds of thousands of threads)

Massive parallelism

slide-8
SLIDE 8

June 18, 2012 | ECMLS 2012 | Michael Waechter & Stephanie Weissgraeber 8

  • Iterate 10,000 times:
  • Shuffling

– Map MSA columns to blocks of threads – Shuffle columns (GPU suited algorithm) – Synchronize

  • MI computation

– Map MI matrix entries to blocks of threads (suitable for MSA access pattern) – Compute MI matrix entries – Synchronize

  • Combine results & normalize orig. MI with randomized MI

GPU Implementation

. . . AEERYA. . . . . . TEEQGR. . . . . . TDEQQR. . . . . . TKEQVE. . . . . . SEEQVA. . . . . . SKEQVA. . . . . . SPEQVA. . .

slide-9
SLIDE 9

June 18, 2012 | ECMLS 2012 | Michael Waechter & Stephanie Weissgraeber 9

  • Problem size dependent

Speed Results

GeForce GTX 480 4 threads

  • n Core i7‐960

Calmodulin 753 sequences

  • f length 264

1.1 min 13.4 min ~ 12x speed‐up HIV‐1 protease > 45,000 seqs.

  • f length > 1400

1.85 days 7.3 days ~ 4x speed‐up

slide-10
SLIDE 10

June 18, 2012 | ECMLS 2012 | Michael Waechter & Stephanie Weissgraeber 10

  • One order of magnitude speed-up
  • Quickly redo previous steps (e.g., alignment) and recompute

MI

  • New analysis tool feasible:

3-point MI: Coevolution of a ‘3-clique’ of MSA columns

  • Can we deduce more information from 3-point MI than we

could from 2-point MI alone?

Implications

slide-11
SLIDE 11

June 18, 2012 | ECMLS 2012 | Michael Waechter & Stephanie Weissgraeber 11

  • 149 amino acids
  • Ca2+ binding

 conformational change

  • Regulates various signaling pathways

Calmodulin

slide-12
SLIDE 12

June 18, 2012 | ECMLS 2012 | Michael Waechter & Stephanie Weissgraeber 12

  • Finding coevolving pairs of

amino acids

  • Structural or functional

connection

  • Here: Coevolution within N-

and C-terminus

  • Ca2+ binding
  • Propagation of conformational

change

  • Conserved inner helix
  • No coevolution without variation

Coevolution in Calmodulin – 2-point MI

slide-13
SLIDE 13

June 18, 2012 | ECMLS 2012 | Michael Waechter & Stephanie Weissgraeber 13

  • ‘3-cliques’ of amino acids
  • Higher order correlations
  • Concerted motions
  • Binding sites

Coevolution in Calmodulin – 3-point MI

slide-14
SLIDE 14

June 18, 2012 | ECMLS 2012 | Michael Waechter & Stephanie Weissgraeber 14

  • ‘3-cliques’ of amino acids
  • Color indicates the frequency

with which an amino acid contributes to the ‘3-cliques’ set

  • Key residues for important

functions

Coevolution in Calmodulin – 3-point MI

slide-15
SLIDE 15

June 18, 2012 | ECMLS 2012 | Michael Waechter & Stephanie Weissgraeber 15

  • MI for coevolutionary analysis
  • GPU implementation ~10x faster on typical MSAs
  • 3-point MI analysis possible in acceptable time
  • 3-point MI does reveal new insights
  • Next step could be k-point MI
  • It may be possible to detect key residues in unknown proteins

Conclusions

slide-16
SLIDE 16

June 18, 2012 | ECMLS 2012 | Michael Waechter & Stephanie Weissgraeber 16

  • Multi-GPU parallelization:
  • Distribute Shuffling Null-Model iterations among GPUs
  • First tests: 32 GPUs  ~32x speed-up (on top of basic GPU speed-

up!)

What happened since?

slide-17
SLIDE 17

June 18, 2012 | ECMLS 2012 | Michael Waechter & Stephanie Weissgraeber 17

Thank you.

Please visit tinyurl.com/tud‐comic for code & documentation

  • r contact us.