Multiple Sequence Alignment Mirko Riesterer, 10.09.18 Agenda. 1 - - PowerPoint PPT Presentation

multiple sequence alignment
SMART_READER_LITE
LIVE PREVIEW

Multiple Sequence Alignment Mirko Riesterer, 10.09.18 Agenda. 1 - - PowerPoint PPT Presentation

Cost Partitioning Techniques for Multiple Sequence Alignment Mirko Riesterer, 10.09.18 Agenda. 1 Introduction 2 Formal Definition 3 Solving MSA 4 Combining Multiple Pattern Databases 5 Cost Partitioning 6 Experiments 7 Conclusion


slide-1
SLIDE 1

Cost Partitioning Techniques for Multiple Sequence Alignment

Mirko Riesterer, 10.09.18

slide-2
SLIDE 2

Agenda.

Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 2

1 Introduction 2 Formal Definition 3 Solving MSA 4 Combining Multiple Pattern Databases 5 Cost Partitioning 6 Experiments 7 Conclusion

slide-3
SLIDE 3

Introduction

Multiple Sequence Alignment − Biological sequences mutate during evolution − Insertion, deletion, substitution − Some mutations are more likely (A↔G / C↔T) − Observe phylogenetic relationships

Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 3

slide-4
SLIDE 4

Introduction

Multiple Sequence Alignment − Insert gaps within sequences − Maximize correspondence between letters in columns

Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 4

Sequences ACGTG ACTAG CGTAG Alignment ACGT-G AC-TAG

  • CGTAG
slide-5
SLIDE 5

Introduction

Judging the alignment quality − Count matches/mismatches − Score matrix − Point accepted mutation (𝑄𝐵𝑁𝑜) matrix (Dayhoff et al., 1978) − Blocks substitution matrix (BLOSUM) (Henikoff and Henikoff, 1992)

Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 5

Score matrix: A C T G – A 0 4 2 2 3 C 1 4 3 3 T 0 6 3 G 1 3

slide-6
SLIDE 6

Agenda.

Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 6

1 Introduction 2 Formal Definition 3 Solving MSA 4 Combining Multiple Pattern Databases 5 Cost Partitioning 6 Experiments 7 Conclusion

slide-7
SLIDE 7

Formal Definition

Family of sequences 𝑇 = {𝑡1, … , 𝑡𝑜} over alphabet Σ and Σ′ = Σ ∪ − Alignment Matrix 𝐵𝑜×𝑛 = aij , where − 𝑏𝑗𝑘 ∈ Σ′ − 𝑏𝑗 without − is exactly 𝑡𝑗 − No column contains only −

Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 7

Sequences: ACT CTG Alignment 𝐵: A C T _ _ C T G ___________ 𝐷𝐵=3+1+0+3=7 Score matrix: A C T G – A 0 4 2 2 3 C 1 4 3 3 T 0 6 3 G 1 3

slide-8
SLIDE 8

Formal Definition

Score matrix can be viewed as function 𝑡𝑣𝑐 ∶ Σ′ × Σ′ → ℕ Given alignment 𝐵 and score matrix 𝑡𝑣𝑐. Pair score 𝑫𝒋𝒌

𝑩 = ෍ 𝒍=𝟐 𝒏

𝒕𝒗𝒄(𝒃𝒋𝒍, 𝒃𝒌𝒍) Sum of pairs score 𝑫𝑩 = ෍

𝟐≤𝒋<𝒌≤𝒐

𝑫𝒋𝒌

𝑩

Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 8

slide-9
SLIDE 9

Formal Definition

Shortest Path Problem Directed acyclic graph 𝐻 = 𝑊, 𝐹 𝑊 = 𝑦1, … , 𝑦𝑜 𝑦𝑗 = 0, … , 𝑚𝑗} 𝐹 = ∪𝑓∈ 0,1 𝑜 𝑤, 𝑤 + 𝑓 𝑤, 𝑤 + 𝑓 ∈ 𝑊, 𝑓 ≠ 0}.

Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 9

Figure: 2D graph alignment Figure: 3D edge structure (http://www.csbio.unc.edu/mcmillan/Comp555S16/Lecture14.html)

slide-10
SLIDE 10

Agenda.

Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 10

1 Introduction 2 Formal Definition 3 Solving MSA 4 Combining Multiple Pattern Databases 5 Cost Partitioning 6 Experiments 7 Conclusion

slide-11
SLIDE 11

Solving MSA

Needleman-Wunsch algorithm Dynamic programming approach Generates zero-based index table with optimal scores Dim 𝑜, lengths 𝑚: Complexity 𝑃 𝑚𝑜

Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 11

Figure 2: Needleman-Wunsch score table using a score matrix

slide-12
SLIDE 12

Solving MSA

Pattern databases Family of sequences 𝑇 = {𝑡1, … , 𝑡𝑜}. A pattern is a subset 𝑄 ⊆ 𝑇, 𝑄 ≥ 2. A pattern database (PDB) is the perfect heuristic ℎ∗ for the subproblem induced by pattern P.

Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 12

slide-13
SLIDE 13

Solving MSA

Heuristic search estimators Family of sequences 𝑇 = {𝑡1, … , 𝑡𝑜}. 𝒊𝒒𝒃𝒋𝒔 (Ikeda and Imai, 1994): ℎ𝑞𝑏𝑗𝑠 𝑤 = ෍

1≤𝑗<𝑘≤𝑜

ℎ𝑗𝑘(𝑤) − Uses the information of every 2-dimensional PDB

Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 13

slide-14
SLIDE 14

Solving MSA

Heuristic search estimators Family of sequences 𝑇 = {𝑡1, … , 𝑡𝑜}. 𝒊𝒃𝒎𝒎,𝒍 (Kobayashi and Imai, 1998): ℎ𝑏𝑚𝑚,𝑙 𝑤 = 1

𝑜−2 𝑙−2

1≤𝑦1<⋯<𝑦𝑙≤𝑜

ℎ𝑦1,…,𝑦𝑙 𝑤 − Uses the information of every 3-dimensional PDB − Every pair of sequences appears

𝑜−2 𝑙−2 times → normalize

− If 𝑙 = 3, lenghts ~ 500, each PDB contains 108 vertices! − Branching factor 2𝑜 − 1

Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 14

slide-15
SLIDE 15

Solving MSA

Heuristic search estimators Family of sequences 𝑇 = {𝑡1, … , 𝑡𝑜}. 𝒊𝒑𝒐𝒇,𝒍 (Kobayashi and Imai, 1998): ℎ𝑝𝑜𝑓,𝑙 𝑤 = ℎ𝑦1,…,𝑦𝑙 𝑤 + ℎ𝑦𝑙+1,…,𝑦𝑜 𝑤 + ෍

𝑗=1 𝑙

𝑘=𝑙+1 𝑜

ℎ𝑦𝑗,𝑦𝑘 (𝑤) − 1 or 2 higher-dimensional PDBs + remaining 2-dimensional PDBs − Avoids normalization by choosing PDBs carefully 𝒊𝒒𝒃𝒋𝒔 ≤ 𝒊𝒑𝒐𝒇,𝒍 ≤ 𝒊𝒃𝒎𝒎,𝒍

Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 15

slide-16
SLIDE 16

Agenda.

Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 16

1 Introduction 2 Formal Definition 3 Solving MSA 4 Combining Multiple Pattern Databases 5 Cost Partitioning 6 Experiments 7 Conclusion

slide-17
SLIDE 17

Combining Multiple Pattern Databases

Additivity − A pattern collection of 𝑇 = 𝑡1, … , 𝑡𝑜 is a collection 𝑄 = 𝑄

1, … , 𝑄 𝑛 , P i ⊆ 𝑇.

− 𝑄 is non-conflicting, if no pair of elements of 𝑄 conflict. − Then the sum of PDBs is additive Pattern collection heuristic ℎ𝑄 𝑤 = ෍

𝑗=1 𝑛

ℎ𝑄𝑗(𝑤) − Admissible, if 𝑄 is non-conflicting

Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 17

slide-18
SLIDE 18

Combining Multiple Pattern Databases

− Conflicting pattern collections may violate admissibility − Parts may still be useful? Canonical PDB heuristic (Haslum et al., 2007) ℎCAN v = max

s∈𝑁𝑂𝑇 ෍ 𝑄∈𝑇

ℎ𝑄(𝑤)

Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 18

slide-19
SLIDE 19

Combining Multiple Pattern Databases

Post-hoc optimization (Pommerening et al., 2013) − Use linear programming to solve constrained problem − Pattern collection is strictly conflicting if ∩𝑗=0

𝑛

𝑄𝑗 > 1 Let 𝑥1, … , 𝑥𝑛 be the solution to the linear program that maximizes ℎ𝑄𝐼𝑃 𝑤 = ෍

𝑗 𝑛

𝑥𝑗 ℎ𝑄𝑗(𝑤) 𝑡. 𝑢. ෍

𝑗:𝑄𝑗∈𝑇′

𝑥𝑗 ≤ 1 for all strictly conflicting pattern collections S′ ⊆ 𝑄 𝑡. 𝑢. 0 ≤ 𝑥𝑗 ≤ 1 for all P

i

Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 19

slide-20
SLIDE 20

Combining Multiple Pattern Databases

Post-hoc optimization (Pommerening et al., 2013) ℎ𝑏𝑚𝑚,𝑙 equals ℎ𝑄𝐼𝑃 if we choose the same patterns Proof sketch: Four sequences 𝑇 = 𝑡1, 𝑡2, 𝑡3, 𝑡4 of length 1 𝑡1 = 𝐵, 𝑡2 = 𝐷, 𝑡3 = 𝑈, 𝑡4 = 𝐻 𝑄 = 𝑄

1 = 𝑡1, 𝑡2, 𝑡3 , 𝑄2 = 𝑡1, 𝑡2, 𝑡4, , 𝑄3 = 𝑡1, 𝑡3, 𝑡4, , 𝑄 4 = 𝑡2, 𝑡3, 𝑡4

ℎ𝑄1 𝑡 = 3 ℎ𝑄2 𝑡 = ℎ𝑄3 𝑡 = ℎ𝑄4 𝑡 = 1 → ℎ𝑄𝐼𝑃 𝑡 = 1 ∗ 3 + 0 ∗ 1 + 0 ∗ 1 + 0 ∗ 1 = 𝟒 = 3+1+1+1

2

= ℎ𝑏𝑚𝑚,3

Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 20

Score matrix: A C T G – A 1 1 1 0 1 C 1 1 0 1 T 1 0 1 G 1 1

  • 1
slide-21
SLIDE 21

Combining Multiple Pattern Databases

A factored representation of MSA with operators 𝑃 = 𝑝 𝑦,𝑧 → 𝑦′,𝑧′

𝑗,𝑘

1 ≤ 𝑗 < 𝑘 ≤ 𝑜, 0 ≤ 𝑦 ≤ 𝑚𝑗, 0 ≤ 𝑧 ≤ 𝑚𝑘} An operator 𝑝 𝑦,𝑧 → 𝑦′,𝑧′

𝑗,𝑘

affects heuristic ℎ𝑄 if 𝑡𝑗, 𝑡

𝑘 ∈ 𝑄

Example: e.g. edge 3,3,5 → 4,3,6 is factored into 3 operators: {𝑝 3,3 → 4,3

1,2

, 𝑝 3,5 → 4,6

1,3

, 𝑝 3,5 → 3,6

2,3

} − Basic factors for opreators in higher dimensions − Less operators than defining all operators

Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 21

slide-22
SLIDE 22

Agenda.

Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 22

1 Introduction 2 Formal Definition 3 Solving MSA 4 Combining Multiple Pattern Databases 5 Cost Partitioning 6 Experiments 7 Conclusion

slide-23
SLIDE 23

Cost Partitioning

Better estimates than using max. among available? − Patterns only consider parts of the problem − Combine multiple heuristic values − Distribute operator costs among them Formal (Seipp et al., 2017) Given a pattern collection of size m Cost partitioning is a tuple 𝐷 = 𝑑1, … , 𝑑𝑛 𝑡. 𝑢. σ𝑗=1

𝑛 𝑑𝑗 𝑝 ≤ 𝑑 𝑝

CP heuristic is ℎ𝐷 𝑤 ≔ σ𝑗=1

𝑛 ℎ𝑄𝑗,𝑑𝑗(𝑤)

Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 23

slide-24
SLIDE 24

Cost Partitioning

Greedy zero-one cost partitioning (Haslum et al. 2005; Edelkamp, 2006) − Assign full costs to at most one PDB − Multiple PDBs affected? Greedily chose from ordering − Assign full costs to ci 𝑝 if 𝑝 ∈ 𝑏𝑔𝑔(ℎ𝑄𝑗) and 𝑝 ∉ ∪𝑘=1

𝑗−1 𝑏𝑔𝑔(ℎ𝑄𝑘)

Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 24

slide-25
SLIDE 25

Cost Partitioning

Saturated cost partitioning (Seipp and Helmert, 2014) − Assign exploitable parts of the costs to components − Remainder can contribute to other components Saturated cost function Assigns the least possible cost without changing outcome Formal: 𝑡𝑏𝑢𝑣𝑠𝑏𝑢𝑓(ℎ𝑄, 𝑑) is the minimal cost function 𝑑′ ≤ 𝑑 with ℎ𝑄,𝑑′ 𝑤 = ℎ𝑄,𝑑(𝑤) Saturated cost partitioning 𝐷 = 𝑑1, … , 𝑑𝑛 remaining cost functions ҧ 𝑑0, … , ҧ 𝑑𝑛 ҧ 𝑑0 = 𝑑 𝑑𝑗 = 𝑡𝑏𝑢𝑣𝑠𝑏𝑢𝑓 ℎ𝑄𝑗, ҧ 𝑑𝑗−1 ҧ 𝑑𝑗 = ҧ 𝑑𝑗−1 − 𝑑𝑗

Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 25

slide-26
SLIDE 26

Cost Partitioning

Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 26

slide-27
SLIDE 27

Agenda.

Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 27

1 Introduction 2 Formal Definition 3 Solving MSA 4 Combining Multiple Pattern Databases 5 Cost Partitioning 6 Experiments 7 Conclusion

slide-28
SLIDE 28

Experiments

− Using MSASolver Java program by Matthew Hatem − BAliBASE Benchmark Reference Set 1 (Thompson et al., 1999)

Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 28

slide-29
SLIDE 29

Experiments

Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 29

slide-30
SLIDE 30

Agenda.

Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 30

1 Introduction 2 Formal Definition 3 Solving MSA 4 Combining Multiple Pattern Databases 5 Cost Partitioning 6 Experiments 7 Conclusion

slide-31
SLIDE 31

Conclusion

GZOCP − No benefit over existing heuristics PHO − LP Solver in every search step − Expensively computed PDB may be left unused Future Work − Implement other cost partitioning techniques − Generate PDBs automatically e.g. like Haslum et al. (2007) − M&S heuristics (Dräger et al., 2009; Helmert et al., 2014)

Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 31

slide-32
SLIDE 32

Thank you for your attention.

slide-33
SLIDE 33

Bibliography

MO Dayhoff, RM Schwartz, and BC Orcutt. A model of evolutionary change in proteins. In Atlas of protein sequence and structure, volume 5, pages 345–352. National Biomedical Research Foundation Silver Spring, MD, 1978. Steven Henikoff and Jorja G Henikoff. Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences, 89:10915–10919, 1992. Saul B Needleman and Christian D Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of molecular biology, 48 (3):443–453, 1970. Takahiro Ikeda and Hiroshi Imai. Fast a algorithms for multiple sequence alignment. Genome Informatics, 5:90–99, 1994. Hirotada Kobayashi and Hiroshi Imai. Improvement of the A* algorithm for multiple sequence alignment. Genome Informatics, 9:120– 130, 1998. Patrik Haslum, Adi Botea, Malte Helmert, Blai Bonet, Sven Koenig, et al. Domainindependent construction of pattern database heuristics for cost-optimal planning. In AAAI, volume 7, pages 1007–1012, 2007. Florian Pommerening, Gabriele Röger, and Malte Helmert. Getting the most out of pattern databases for classical planning. In Francesca Rossi, editor, Proceedings of the 23rd International Joint Conference on Artificial Intelligence (IJCAI 2013), pages 2357– 2364, 2013. Jendrik Seipp, Thomas Keller, and Malte Helmert. A comparison of cost partitioning algorithms for optimal classical planning. In Proceedings of the Twenty-Seventh International Conference on Automated Planning and Scheduling (ICAPS 2017). AAAI Press, 2017. Patrik Haslum, Blai Bonet, Héctor Geffner, et al. New admissible heuristics for domainindependent planning. In AAAI, volume 5, pages 9–13, 2005.

Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 33

slide-34
SLIDE 34

Bibliography

Stefan Edelkamp. Automated creation of pattern database search heuristics. In International Workshop on Model Checking and Artificial Intelligence, pages 35–50. Springer, 2006. Jendrik Seipp and Malte Helmert. Diverse and additive cartesian abstraction heuristics. In Proceedings of the Twenty-Fourth International Conference on Automated Planning and Scheduling (ICAPS 2014). AAAI Press, pages 289–297, 2014. Julie D. Thompson, Frédéric Plewniak, and Olivier Poch. Balibase: a benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics (Oxford, England), 15:87–88, 1999. Klaus Dräger, Bernd Finkbeiner, and Andreas Podelski. Directed model checking with distance-preserving abstractions. International Journal on Software Tools for Technology Transfer, 11(1):27–37, 2009. Malte Helmert, Patrik Haslum, Jörg Hoffmann, and Raz Nissim. Merge-and-shrink abstraction: A method for generating lower bounds in factored state spaces. Journal of the ACM (JACM), 61(3):16, 2014.

Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 34