multiple sequence alignment
play

Multiple Sequence Alignment Mirko Riesterer, 10.09.18 Agenda. 1 - PowerPoint PPT Presentation

Cost Partitioning Techniques for Multiple Sequence Alignment Mirko Riesterer, 10.09.18 Agenda. 1 Introduction 2 Formal Definition 3 Solving MSA 4 Combining Multiple Pattern Databases 5 Cost Partitioning 6 Experiments 7 Conclusion


  1. Cost Partitioning Techniques for Multiple Sequence Alignment Mirko Riesterer, 10.09.18

  2. Agenda. 1 Introduction 2 Formal Definition 3 Solving MSA 4 Combining Multiple Pattern Databases 5 Cost Partitioning 6 Experiments 7 Conclusion Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 2

  3. Introduction Multiple Sequence Alignment − Biological sequences mutate during evolution − Insertion, deletion, substitution − Some mutations are more likely (A ↔ G / C ↔T ) − Observe phylogenetic relationships Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 3

  4. Introduction Multiple Sequence Alignment Sequences Alignment − Insert gaps within sequences ACGTG ACGT-G − Maximize correspondence between ACTAG AC-TAG letters in columns CGTAG -CGTAG Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 4

  5. Introduction Judging the alignment quality − Count matches/mismatches − Score matrix − Point accepted mutation ( 𝑄𝐵𝑁 𝑜 ) matrix (Dayhoff et al., 1978) − Blocks substitution matrix (BLOSUM) (Henikoff and Henikoff, 1992) Score matrix: A C T G – A 0 4 2 2 3 C 1 4 3 3 T 0 6 3 G 1 3 - 0 Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 5

  6. Agenda. 1 Introduction 2 Formal Definition 3 Solving MSA 4 Combining Multiple Pattern Databases 5 Cost Partitioning 6 Experiments 7 Conclusion Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 6

  7. Formal Definition Family of sequences 𝑇 = {𝑡 1 , … , 𝑡 𝑜 } over alphabet Σ and Σ ′ = Σ ∪ − Score matrix: Alignment Matrix 𝐵 𝑜×𝑛 = a ij , where A C T G – A 0 4 2 2 3 C 1 4 3 3 − 𝑏 𝑗𝑘 ∈ Σ′ T 0 6 3 − 𝑏 𝑗 without − is exactly 𝑡 𝑗 G 1 3 − No column contains only − - 0 Alignment 𝐵 : Sequences: ACT A C T _ CTG _ C T G ___________ 𝐷 𝐵 =3+1+0+3=7 Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 7

  8. Formal Definition Score matrix can be viewed as function 𝑡𝑣𝑐 ∶ Σ ′ × Σ ′ → ℕ Given alignment 𝐵 and score matrix 𝑡𝑣𝑐. 𝒏 Pair score 𝑩 = ෍ 𝑫 𝒋𝒌 𝒕𝒗𝒄(𝒃 𝒋𝒍 , 𝒃 𝒌𝒍 ) 𝒍=𝟐 Sum of pairs score 𝑫 𝑩 = 𝑩 ෍ 𝑫 𝒋𝒌 𝟐≤𝒋<𝒌≤𝒐 Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 8

  9. Formal Definition Shortest Path Problem Directed acyclic graph 𝐻 = 𝑊, 𝐹 𝑊 = 𝑦 1 , … , 𝑦 𝑜 𝑦 𝑗 = 0, … , 𝑚 𝑗 } 𝐹 = ∪ 𝑓∈ 0,1 𝑜 𝑤, 𝑤 + 𝑓 𝑤, 𝑤 + 𝑓 ∈ 𝑊, 𝑓 ≠ 0}. Figure: 3D edge structure Figure: 2D graph alignment (http://www.csbio.unc.edu/mcmillan/Comp555S16/Lecture14.html) Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 9

  10. Agenda. 1 Introduction 2 Formal Definition 3 Solving MSA 4 Combining Multiple Pattern Databases 5 Cost Partitioning 6 Experiments 7 Conclusion Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 10

  11. Solving MSA Needleman-Wunsch algorithm Dynamic programming approach Generates zero-based index table with optimal scores Dim 𝑜 , lengths 𝑚 : Complexity 𝑃 𝑚 𝑜 Figure 2: Needleman-Wunsch score table using a score matrix Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 11

  12. Solving MSA Pattern databases Family of sequences 𝑇 = {𝑡 1 , … , 𝑡 𝑜 } . A pattern is a subset 𝑄 ⊆ 𝑇, 𝑄 ≥ 2 . A pattern database (PDB) is the perfect heuristic ℎ ∗ for the subproblem induced by pattern P. Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 12

  13. Solving MSA Heuristic search estimators Family of sequences 𝑇 = {𝑡 1 , … , 𝑡 𝑜 } . 𝒊 𝒒𝒃𝒋𝒔 (Ikeda and Imai, 1994): ℎ 𝑗𝑘 (𝑤) ℎ 𝑞𝑏𝑗𝑠 𝑤 = ෍ 1≤𝑗<𝑘≤𝑜 − Uses the information of every 2-dimensional PDB Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 13

  14. Solving MSA Heuristic search estimators Family of sequences 𝑇 = {𝑡 1 , … , 𝑡 𝑜 } . 𝒊 𝒃𝒎𝒎,𝒍 (Kobayashi and Imai, 1998): 1 ℎ 𝑦 1 ,…,𝑦 𝑙 𝑤 ℎ 𝑏𝑚𝑚,𝑙 𝑤 = ෍ 𝑜−2 𝑙−2 1≤𝑦 1 <⋯<𝑦 𝑙 ≤𝑜 − Uses the information of every 3-dimensional PDB − Every pair of sequences appears 𝑜−2 𝑙−2 times → normalize − If 𝑙 = 3 , lenghts ~ 500, each PDB contains 10 8 vertices! − Branching factor 2 𝑜 − 1 Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 14

  15. Solving MSA Heuristic search estimators Family of sequences 𝑇 = {𝑡 1 , … , 𝑡 𝑜 } . 𝒊 𝒑𝒐𝒇,𝒍 (Kobayashi and Imai, 1998): 𝑙 𝑜 ℎ 𝑝𝑜𝑓,𝑙 𝑤 = ℎ 𝑦 1 ,…,𝑦 𝑙 𝑤 + ℎ 𝑦 𝑙+1 ,…,𝑦 𝑜 𝑤 + ෍ ℎ 𝑦 𝑗 ,𝑦 𝑘 (𝑤) ෍ 𝑗=1 𝑘=𝑙+1 − 1 or 2 higher-dimensional PDBs + remaining 2-dimensional PDBs − Avoids normalization by choosing PDBs carefully 𝒊 𝒒𝒃𝒋𝒔 ≤ 𝒊 𝒑𝒐𝒇,𝒍 ≤ 𝒊 𝒃𝒎𝒎,𝒍 Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 15

  16. Agenda. 1 Introduction 2 Formal Definition 3 Solving MSA 4 Combining Multiple Pattern Databases 5 Cost Partitioning 6 Experiments 7 Conclusion Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 16

  17. Combining Multiple Pattern Databases Additivity − A pattern collection of 𝑇 = 𝑡 1 , … , 𝑡 𝑜 is a collection 𝑄 = 𝑄 1 , … , 𝑄 𝑛 , P i ⊆ 𝑇 . − 𝑄 is non-conflicting, if no pair of elements of 𝑄 conflict. − Then the sum of PDBs is additive Pattern collection heuristic 𝑛 ℎ 𝑄 𝑤 = ෍ ℎ 𝑄 𝑗 (𝑤) 𝑗=1 − Admissible, if 𝑄 is non-conflicting Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 17

  18. Combining Multiple Pattern Databases − Conflicting pattern collections may violate admissibility − Parts may still be useful? Canonical PDB heuristic (Haslum et al., 2007) ℎ CAN v = max ℎ 𝑄 (𝑤) s∈𝑁𝑂𝑇 ෍ 𝑄∈𝑇 Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 18

  19. Combining Multiple Pattern Databases Post-hoc optimization (Pommerening et al., 2013) − Use linear programming to solve constrained problem − Pattern collection is strictly conflicting if ∩ 𝑗=0 𝑛 𝑄 𝑗 > 1 Let 𝑥 1 , … , 𝑥 𝑛 be the solution to the linear program that maximizes 𝑛 ℎ 𝑄𝐼𝑃 𝑤 = ෍ 𝑥 𝑗 ℎ 𝑄 𝑗 (𝑤) 𝑗 𝑥 𝑗 ≤ 1 for all strictly conflicting pattern collections S ′ ⊆ 𝑄 𝑡. 𝑢. ෍ 𝑗:𝑄 𝑗 ∈𝑇 ′ 𝑡. 𝑢. 0 ≤ 𝑥 𝑗 ≤ 1 for all P i Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 19

  20. Combining Multiple Pattern Databases Score matrix: Post-hoc optimization (Pommerening et al., 2013) ℎ 𝑏𝑚𝑚,𝑙 equals ℎ 𝑄𝐼𝑃 if we choose the same patterns A C T G – A 1 1 1 0 1 C 1 1 0 1 Proof sketch: T 1 0 1 G 1 1 Four sequences 𝑇 = 𝑡 1 , 𝑡 2 , 𝑡 3 , 𝑡 4 of length 1 - 1 𝑡 1 = 𝐵, 𝑡 2 = 𝐷, 𝑡 3 = 𝑈, 𝑡 4 = 𝐻 𝑄 = 𝑄 1 = 𝑡 1 , 𝑡 2 , 𝑡 3 , 𝑄 2 = 𝑡 1 , 𝑡 2 , 𝑡 4 , , 𝑄 3 = 𝑡 1 , 𝑡 3 , 𝑡 4 , , 𝑄 4 = 𝑡 2 , 𝑡 3 , 𝑡 4 ℎ 𝑄 1 𝑡 = 3 ℎ 𝑄 2 𝑡 = ℎ 𝑄 3 𝑡 = ℎ 𝑄 4 𝑡 = 1 → ℎ 𝑄𝐼𝑃 𝑡 = 1 ∗ 3 + 0 ∗ 1 + 0 ∗ 1 + 0 ∗ 1 = 𝟒 = 3+1+1+1 = ℎ 𝑏𝑚𝑚,3 2 Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 20

  21. Combining Multiple Pattern Databases A factored representation of MSA with operators 𝑗,𝑘 𝑃 = 𝑝 𝑦,𝑧 → 𝑦 ′ ,𝑧 ′ 1 ≤ 𝑗 < 𝑘 ≤ 𝑜, 0 ≤ 𝑦 ≤ 𝑚 𝑗 , 0 ≤ 𝑧 ≤ 𝑚 𝑘 } affects heuristic ℎ 𝑄 if 𝑡 𝑗 , 𝑡 𝑗,𝑘 An operator 𝑝 𝑦,𝑧 → 𝑦 ′ ,𝑧 ′ 𝑘 ∈ 𝑄 Example: e.g. edge 3,3,5 → 4,3,6 is factored into 3 operators: 1,2 1,3 2,3 {𝑝 3,3 → 4,3 , 𝑝 3,5 → 4,6 , 𝑝 3,5 → 3,6 } − Basic factors for opreators in higher dimensions − Less operators than defining all operators Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 21

  22. Agenda. 1 Introduction 2 Formal Definition 3 Solving MSA 4 Combining Multiple Pattern Databases 5 Cost Partitioning 6 Experiments 7 Conclusion Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 22

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend