sequence structure alignment a general formulation
play

Sequence-Structure Alignment A General Formulation Unifying view on - PowerPoint PPT Presentation

Sequence-Structure Alignment A General Formulation Unifying view on Edit Distance, SA&F, ... IN S 1 , . . . , S k P 1 , . . . , P k { 1 , . . . , | S i |} : sets of basepairs score on alignments OUT Alignment


  1. Sequence-Structure Alignment — A General Formulation “Unifying view on Edit Distance, SA&F, ...” IN • S 1 , . . . , S k ∈ Σ • P 1 , . . . , P k ∈ { 1 , . . . , | S i |} : sets of basepairs • score on alignments OUT Alignment A = ( S ∗ 1 , P ∗ 1 , . . . , S ∗ k , P ∗ k ) that maximizes score( A ), where S ∗ i | Σ = S i , “ P ∗ i | Σ ” ⊆ P i , . . . Exact conditions and score vary S.Will, 18.417, Fall 2011 problem classes: restrict input and output structures, score

  2. Alignment with Fixed Input Structures Jiang et al. A General Edit Distance between RNA Structures. JCB , 2002. • “ P ∗ i | Σ ” = P i , i.e. output structure = input structure • score is rather general edit distance (breaking of basepairs) • only pairwise, k = 2 • efficient only for NESTED/CROSSING with “not so general score” S.Will, 18.417, Fall 2011

  3. Alignment with Fixed Input Structures – Pseudoknots • CROSSING/CROSSING, i.e. pseudoknots allowed • restricted pseudoknots: e.g., no crossing of 3 basepairs Patricia A. Evans. Finding common RNA pseudoknot structures in polynomial time. CPM 2006. a) a three−knot b) interleaved left−right endpoints M¨ ohl, Will, Backofen. Lifting prediction to alignment of RNA pseudoknots. RECOMB 2009. • general crossing: S.Will, 18.417, Fall 2011 M¨ ohl, Will, Backofen. Fixed parameter tractable alignment of RNA structures including arbitrary pseudoknots. CPM 2008

  4. Simultaneous Alignment and Folding (SA&F) David Sankoff. Simultaneous solution of the RNA folding, alignment and protosequence problems. SIAM J. Appl. Math. , 1985. • “ P ∗ i | Σ ” ⊆ P i • input structures crossing (all potential basepairs) • output structures non-crossing Example Input: P 1 = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 S 1 = ACGGACUUACGGACUUGACUCGGACU S 2 = CGGAACGUAUACGGACUCCAGACUACGUGCA S.Will, 18.417, Fall 2011 P 2 = 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

  5. Example SA&F IN: P 1 = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 S 1 = ACGGACUUACGGACUUGACUCGGACU S 2 = CGGAACGUAUACGGACUCCAGACUACGUGCA P 2 = 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 OUT: P ∗ 1 ≡ ----.(.((..(........)..)).)...---- S ∗ 1 = ----ACGGACUUACGGACUUGACUCGGACU---- S.Will, 18.417, Fall 2011 S ∗ 2 = CGGAACGUAUACGGACUCCAGACUACG---UGCA P ∗ 2 ≡ .....(.((..(........)..)).)---....

  6. Incomplete history of SA&F • 1985 Sankoff. Computationally heavy, no implementation • 1997 Foldalign (Gorodkin et only stems, simpler energy • 2002 Dynalign (Mathews, Turner) first “full” implementation • 2004 PMcomp (Hofacker et al.) clever simplification • 2007 FoldalignM Mc (Torarinsson et al.), PMcomp implementation • 2007 LocARNA (Will, et al.), PMcomp-based, more time and space efficient, optionally local • 2008 RAF (Do, et al. ), PMcomp-based, sequence-sparsity, machine learning S.Will, 18.417, Fall 2011 • 2011 LocARNA-P (Will, et al.), efficient partition function

  7. PMcomp: A Realistic Nussinov-style Sankoff-Algorithm Idea: • Simplify Energy Model of SA&F: Loop-based (Zuker-style) ⇒ Base-pair-based (Nussinov-style) • Advantage? • Problem? • Add realistic energy scoring again!: McCaskill pair probabilities S.Will, 18.417, Fall 2011

  8. PMcomp: Nussinov-style Sankoff — Recursion  M i j − 1; k l − 1 + σ ( A j , B l )    M i j − 1; k l + γ   M i j ; k l = max M i j ; k l − 1 + γ    max j ′ l ′ M i j ′ − 1; k l ′ − 1 + D j ′ j ; l ′ l   D i j ; k l = M i +1 j − 1; k +1 l − 1 + τ ( i , j , k , l ) j’ j i k l’ l S.Will, 18.417, Fall 2011

  9. PMcomp: Nussinov-style Sankoff — Recursion  M i j − 1; k l − 1 + σ ( A j , B l )    M i j − 1; k l + γ   M i j ; k l = max M i j ; k l − 1 + γ    max j ′ l ′ M i j ′ − 1; k l ′ − 1 + D j ′ j ; l ′ l   D i j ; k l = M i +1 j − 1; k +1 l − 1 + τ ( i , j , k , l ) j’ j i k l’ l S.Will, 18.417, Fall 2011

  10. PMcomp — Scoring  M i j − 1; k l − 1 + σ ( A j , B l )    M i j − 1; k l + γ   M i j ; k l = max M i j ; k l − 1 + γ    max j ′ l ′ M i j ′ − 1; k l ′ − 1 + D j ′ j ; l ′ l   D i j ; k l = M i +1 j − 1; k +1 l − 1 + τ ( i , j , k , l ) Idea: • τ ( i , j , k , l ) = Ψ A ij + Ψ B kl • Ψ A ij , Ψ B kl : log odds scores for base-pairs • “McCaskill”-basepair probabilities vs. background S.Will, 18.417, Fall 2011 Hofacker et al. Alignment of RNA base pairing probability matrices. Bioinformatics , 2004.

  11. Complexity PMcomp  M i j − 1; k l − 1 + σ ( A j , B l )    M i j − 1; k l + γ   M i j ; k l = max M i j ; k l − 1 + γ    max j ′ l ′ M i j ′ − 1; k l ′ − 1 + D j ′ j ; l ′ l   D i j ; k l = M i +1 j − 1; k +1 l − 1 + τ ( i , j , k , l ) • O ( n 2 · m 2 ) entries in M • per entry: O ( nm ) time Total Complexity: O ( n 3 m 3 ) time, O ( n 2 m 2 ) space S.Will, 18.417, Fall 2011

  12. LocARNA: Making PMcomp/Sankoff practical Ideas: • follow PMcomp idea for scoring • only consider significant base pairs: “cut-off probability” 1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364 • reformulate recursion S.Will, 18.417, Fall 2011 • profit in time and space complexity

  13. Effect of Base-Pair Filtering p cutoff = 0 . 005 1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364 S.Will, 18.417, Fall 2011

  14. Effect of Base-Pair Filtering p cutoff = 0 . 01 1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364 S.Will, 18.417, Fall 2011

  15. Effect of Base-Pair Filtering p cutoff = 0 . 05 1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364 S.Will, 18.417, Fall 2011

  16. Effect of Base-Pair Filtering p cutoff = 0 . 1 1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364 S.Will, 18.417, Fall 2011

  17. Locarna Basic Algorithm: Matrices b1 b2 b3 b4 D a2 a3 A a1 a1 1 n a2 1 m b1 B b4 a3 b2 b3 S.Will, 18.417, Fall 2011

  18. Locarna Basic Algorithm: Matrices 1 m b1 b2 b3 b4 D M 1 a2 a3 A a1 a1 1 n a2 1 m b1 B b4 a3 b2 b3 n S.Will, 18.417, Fall 2011

  19. Locarna Basic Algorithm: Matrices 1 m b1 b2 b3 b4 D M 1 a2 a3 A a1 a1 1 n a2 1 m b1 B b4 a3 b2 b3 n S.Will, 18.417, Fall 2011

  20. Locarna Basic Algorithm: Recursion a=(al,ar) a a al ar al ar al ar = + bl br bl br bl br b=(bl,br) b b D(a,b) M(a,b;ar−1,br−1) tau(a,b) S.Will, 18.417, Fall 2011

  21. Locarna Basic Algorithm: Recursion a al+1 i M(a,b;i−1,j−1) + sigma(Ai,Bj) bl+1 j b a i al+1 M(a,b;i,j−1) + gamma a=(al,ar) bl+1 j al+1 i = max b a bl+1 j al+1 i b=(bl,br) M(a,b;i−1,j) + gamma M(a,b;i,j) bl+1 j b a a’ al+1 i max a’b’: M(a,b;a’l−1,b’l−1) + D(a’,b’) where a’r=i, b’r=j bl+1 j S.Will, 18.417, Fall 2011 b’ b

  22. Locarna Basic Algorithm: Recursion  M a b ( i − 1 , j − 1) + σ ( A i , B j )    M a b ( i − 1 , j ) + γ      M a b ( i , j − 1) + γ M a b ( i , j ) = max a ′ b ′ M a b ( a ′ l − 1 , b ′ l − 1) + D ( a ′ , b ′ )  max       where a ′ r = i , b ′ r = j  D ( a , b ) = M a b ( a r − 1 , b r − 1) + τ ( a , b ) S.Will, 18.417, Fall 2011

  23. Complexity LocARNA  M a b ( i − 1 , j − 1) + σ ( A i , B j )    M a b ( i − 1 , j ) + γ      M a b ( i , j ) = max M a b ( i , j − 1) + γ a ′ b ′ M a b ( a ′ l − 1 , b ′ l − 1) + D ( a ′ , b ′ )  max       where a ′ r = i , b ′ r = j  D ( a , b ) = M a b ( a r − 1 , b r − 1) + τ ( a , b ) • compute D ( a , b ) for all base-pairs edges: a ∈ P 1 , b ∈ P 2 [and a , b compatible] = ⇒ O ( | P 1 || P 2 | ) • combine D ( a , b )-computation for common ( a l , b l ) ⇒ O ( nm ) S.Will, 18.417, Fall 2011 • per ( a l , b l ): O ( nm · rdeg 1 rdeg 2 ) Total Complexity: O ( nm | P 1 || P 2 | ) time, O ( | P 1 || P 2 | + nm ) space

  24. Affine Gap Cost • Basic algorithm: linear gap cost • Affine gap cost g ( k ) = α + β · k : ala Gotoh 1 m M F 1 n S.Will, 18.417, Fall 2011 E

  25. Affine Gap Cost  M a b ( i − 1 , j − 1) + σ ( A i , B j )    E a b  ( j )  i    M a b ( i , j ) = max F a b i j a ′ b ′ M a b ( a ′ l − 1 , b ′ l − 1) + D ( a ′ , b ′ )  max       where a ′ r = i , b ′ r = j  D ( a , b ) = M a b ( a r − 1 , b r − 1) + τ ( a , b ) E a b ( j )= max { E a b i − 1 ( j ) + β, M a b ( i − 1 , j ) + α + β } i F a b i j = max { F a b i j − 1 + β, M a b ( i , j − 1) + α + β } S.Will, 18.417, Fall 2011

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend