CSI5126 . Algorithms in bioinformatics RNA Secondary Structure Search - - PowerPoint PPT Presentation

csi5126 algorithms in bioinformatics
SMART_READER_LITE
LIVE PREVIEW

CSI5126 . Algorithms in bioinformatics RNA Secondary Structure Search - - PowerPoint PPT Presentation

. Preamble . . . . . . . . . . Inference problem . Search problem Preamble Inference problem Search problem CSI5126 . Algorithms in bioinformatics RNA Secondary Structure Search Problem Marcel Turcotte School of Electrical


slide-1
SLIDE 1

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

  • CSI5126. Algorithms in bioinformatics

RNA Secondary Structure Search Problem Marcel Turcotte

School of Electrical Engineering and Computer Science (EECS) University of Ottawa

Version November 20, 2018

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-2
SLIDE 2

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Summary

We learnt that RNA evolves so as to preserve bair pairs patterns more than sequence. We discussed the impact on traditional bioinformatics approaches. Finally, we derived a dynamic programming algorithm to solve the inference problem. In this lecture, we will consider the search problem. General objective

Implement a pattern matching algorithm using context free grammars specifically to detect sequences who could fold into a specific structure.

Reading

Richard Durbin, Sean R. Eddy, Anders Krogh, and Graeme Mitchinson (1998). Biological sequence analysis. Probabilistic models of proteins and nucleic acids. Cambridge University Press. Pages 277-297.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-3
SLIDE 3

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Project

Presentations: 20 minutes

Tuesday, November 27, 2018 Thursday, November 29, 2018 Tuesday, December 4, 2018

https://docs.google.com/document/d/1gfcGDWWF4iLxpxLEAaBHDi-aY6Ome_p9D5RE2evLJE0 Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-4
SLIDE 4

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Summary

RNA molecules play important cellular roles Secondary structure is more preserved than sequence Nussinov-Jacobson is an O(n3) algorithm that maximizes the total number of base pairs MFOLD (by Zuker) is an O(n3) algorithm that minimizes the free energy The accessible pairs, cycles and order notation are key to understand the recurrence equations of MFE methods Consensus methods*, based on Sankofg 1985 algorithm, perform more consistently, but have a high time/space complexity

*Simultaneous alignment and folding

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-5
SLIDE 5

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Summary

RNA molecules play important cellular roles Secondary structure is more preserved than sequence Nussinov-Jacobson is an O(n3) algorithm that maximizes the total number of base pairs MFOLD (by Zuker) is an O(n3) algorithm that minimizes the free energy The accessible pairs, cycles and order notation are key to understand the recurrence equations of MFE methods Consensus methods*, based on Sankofg 1985 algorithm, perform more consistently, but have a high time/space complexity

*Simultaneous alignment and folding

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-6
SLIDE 6

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Summary

RNA molecules play important cellular roles Secondary structure is more preserved than sequence Nussinov-Jacobson is an O(n3) algorithm that maximizes the total number of base pairs MFOLD (by Zuker) is an O(n3) algorithm that minimizes the free energy The accessible pairs, cycles and order notation are key to understand the recurrence equations of MFE methods Consensus methods*, based on Sankofg 1985 algorithm, perform more consistently, but have a high time/space complexity

*Simultaneous alignment and folding

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-7
SLIDE 7

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Summary

RNA molecules play important cellular roles Secondary structure is more preserved than sequence Nussinov-Jacobson is an O(n3) algorithm that maximizes the total number of base pairs MFOLD (by Zuker) is an O(n3) algorithm that minimizes the free energy The accessible pairs, cycles and order notation are key to understand the recurrence equations of MFE methods Consensus methods*, based on Sankofg 1985 algorithm, perform more consistently, but have a high time/space complexity

*Simultaneous alignment and folding

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-8
SLIDE 8

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Summary

RNA molecules play important cellular roles Secondary structure is more preserved than sequence Nussinov-Jacobson is an O(n3) algorithm that maximizes the total number of base pairs MFOLD (by Zuker) is an O(n3) algorithm that minimizes the free energy The accessible pairs, cycles and order notation are key to understand the recurrence equations of MFE methods Consensus methods*, based on Sankofg 1985 algorithm, perform more consistently, but have a high time/space complexity

*Simultaneous alignment and folding

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-9
SLIDE 9

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Summary

RNA molecules play important cellular roles Secondary structure is more preserved than sequence Nussinov-Jacobson is an O(n3) algorithm that maximizes the total number of base pairs MFOLD (by Zuker) is an O(n3) algorithm that minimizes the free energy The accessible pairs, cycles and order notation are key to understand the recurrence equations of MFE methods Consensus methods*, based on Sankofg 1985 algorithm, perform more consistently, but have a high time/space complexity

*Simultaneous alignment and folding

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-10
SLIDE 10

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

RNA secondary structure

C A G C A C G A C A C U A G C A G U C A G U G U C A G A C U G C A I A C A G C A C G A C A C U A G C A G U C A G U G U C A G A C U G C A I A C A G C A C G A C A C U A G C A G U C A G U G U C A G A C U G C A I A 1 10 20 30 40 50 60 70 80 90 100 105

GCACGACACUAGCAGUCAGUGUCAGACUGCAIACAGCACGACACUAGCAGUCAGUGUCAGACUGCAIACAGCACGACACUAGCAGUCAGUGUCAGACUGCA (((((...(((((...(((((...(((((.....)))))...))))).....(((((...(((((.....)))))...))))).....)))))...))))) Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-11
SLIDE 11

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Inference problem: Nussinov-Jacobson

i j j−1 i j i+1 j−1 i+1 j i+1 i i j i+1 i j j i j−1 j−1 i j i k j k+1 k k+1 Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-12
SLIDE 12

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Nussinov-Jacobson algorithm

Initialisation: γ(i, i + k) = for k = 0 to 1 and for i = 1 to n − k. Recurrence: γ(i, j) = max

        

γ(i + 1, j − 1) + δ(i, j); γ(i + 1, j); γ(i, j − 1); maxi<k<(j−1)[γ(i, k) + γ(k + 1, j)]. Matching score: δ(i, j) =

{ 1, if ai : aj ∈ {A : U, U : A, G : C, C : G} ∪ {G : U, U : G}; 0, otherwise.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-13
SLIDE 13

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

G G G A A C C U A G G G A A A U C C 2 2 3 2 2 1 1 1 1 1 1 1 1 1 1 1 1 3

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-14
SLIDE 14

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Other paradigms

Reporting sub-optimal structures (MFOLD, SFOLD) Partition function and the McCaskill’s calculation of Pij’s Folding kinetics, identifying ribo-switches MFE for secondary structure for interacting RNA molecules Partition function for secondary structure for interacting RNA molecules Non-coding RNAs (ncRNA genes) identification (EvoFold, RNAz…)

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-15
SLIDE 15

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Other paradigms

Reporting sub-optimal structures (MFOLD, SFOLD) Partition function and the McCaskill’s calculation of Pij’s Folding kinetics, identifying ribo-switches MFE for secondary structure for interacting RNA molecules Partition function for secondary structure for interacting RNA molecules Non-coding RNAs (ncRNA genes) identification (EvoFold, RNAz…)

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-16
SLIDE 16

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Other paradigms

Reporting sub-optimal structures (MFOLD, SFOLD) Partition function and the McCaskill’s calculation of Pij’s Folding kinetics, identifying ribo-switches MFE for secondary structure for interacting RNA molecules Partition function for secondary structure for interacting RNA molecules Non-coding RNAs (ncRNA genes) identification (EvoFold, RNAz…)

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-17
SLIDE 17

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Other paradigms

Reporting sub-optimal structures (MFOLD, SFOLD) Partition function and the McCaskill’s calculation of Pij’s Folding kinetics, identifying ribo-switches MFE for secondary structure for interacting RNA molecules Partition function for secondary structure for interacting RNA molecules Non-coding RNAs (ncRNA genes) identification (EvoFold, RNAz…)

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-18
SLIDE 18

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Other paradigms

Reporting sub-optimal structures (MFOLD, SFOLD) Partition function and the McCaskill’s calculation of Pij’s Folding kinetics, identifying ribo-switches MFE for secondary structure for interacting RNA molecules Partition function for secondary structure for interacting RNA molecules Non-coding RNAs (ncRNA genes) identification (EvoFold, RNAz…)

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-19
SLIDE 19

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Other paradigms

Reporting sub-optimal structures (MFOLD, SFOLD) Partition function and the McCaskill’s calculation of Pij’s Folding kinetics, identifying ribo-switches MFE for secondary structure for interacting RNA molecules Partition function for secondary structure for interacting RNA molecules Non-coding RNAs (ncRNA genes) identification (EvoFold, RNAz…)

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-20
SLIDE 20

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Now what?

A secondary structure was inferred! It can be analyzed in order to propose new experiments, to propose a mechanism of action, or to develop novel therapeutic approaches (a new drug for instance) It can be used for finding new members of its family (homologues) and this requires adapted database searching techniques It can serve as a starting point for predicting the three-dimensional structure

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-21
SLIDE 21

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Now what?

A secondary structure was inferred! It can be analyzed in order to propose new experiments, to propose a mechanism of action, or to develop novel therapeutic approaches (a new drug for instance) It can be used for finding new members of its family (homologues) and this requires adapted database searching techniques It can serve as a starting point for predicting the three-dimensional structure

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-22
SLIDE 22

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Now what?

A secondary structure was inferred! It can be analyzed in order to propose new experiments, to propose a mechanism of action, or to develop novel therapeutic approaches (a new drug for instance) It can be used for finding new members of its family (homologues) and this requires adapted database searching techniques It can serve as a starting point for predicting the three-dimensional structure

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-23
SLIDE 23

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Now what?

A secondary structure was inferred! It can be analyzed in order to propose new experiments, to propose a mechanism of action, or to develop novel therapeutic approaches (a new drug for instance) It can be used for finding new members of its family (homologues) and this requires adapted database searching techniques It can serve as a starting point for predicting the three-dimensional structure

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-24
SLIDE 24

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Now what?

A secondary structure was inferred! It can be analyzed in order to propose new experiments, to propose a mechanism of action, or to develop novel therapeutic approaches (a new drug for instance) It can be used for finding new members of its family (homologues) and this requires adapted database searching techniques It can serve as a starting point for predicting the three-dimensional structure

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-25
SLIDE 25

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Database search problem

Find all sequences matching a user specified secondary structure motif or all the sequences that can be folded into a user specified structure

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-26
SLIDE 26

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Non-probabilistic approaches

The first practical approaches were non-probabilistic A description language allows the users to represent structural motifs, and search databases RNAMOT, RNABOB, PatScan, and RNAMOTIF

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-27
SLIDE 27

parms wc += gu; descr h5(minlen=6,maxlen=7) ss(len=2) h5(minlen=3,maxlen=4) ss(minlen=4,maxlen=11) h3 ss(len=1) h5(minlen=4,maxlen=5) ss(len=7) h3 ss(minlen=4,maxlen=21) h5(minlen=4,maxlen=5) ss(len=7) h3 h3 ss(len=4)

. . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-28
SLIDE 28

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

RNAMOT

Gautheret D., Major F. & Cedergren R. (1990) Pattern searching/alignment with RNA primary and secondary structures: an effective descriptor for tRNA. Comp. Appl.

  • Biosc. 6, 325-331.

Laferriere A., Gautheret D. & Cedergren R. (1994) An RNA pattern matching program with enhanced performances and portability. Comp. Appl. Biosci. 10, 209-210. rna.igmors.u-psud.fr/gautheret/download

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-29
SLIDE 29

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

RNABOB

RNABOB is an implementation of D. Gautheret’s RNAMOT, but with a difgerent underlying algorithm using a non-deterministic finite state machine with node rewriting rules. (Computer scientists would probably cringe in horror. It works, and it’s fast, but is it street legal in a computer science department? Who knows.) If you’re looking for an RNA motif that fits a hard consensus pattern — a la PROSITE patterns, but with base-pairing — you might check out RNABOB.

http://eddylab.org/software.html

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-30
SLIDE 30

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

RNAMOTIF

Macke et al. (2001) Nuc. Acids. Res. 29(22):4724-4735. Sophisticated scripting language Matches can be ranked using a user-defined scoring function Minimum free energy can be used in the definition of the scoring function casegroup.rutgers.edu/casegr-sh-2.5.html

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-31
SLIDE 31

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Discussion

What are the main limitations?

These computer programs are practical and can be applied to large data-sets Hard consensus pattern means hit-or-miss

The major difgiculties arises from the subjectivity in deriving the best descriptor for a family of sequences It can be quite difgicult to design a pattern with both high sensitivity and high specificity

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-32
SLIDE 32

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Discussion

What are the main limitations?

These computer programs are practical and can be applied to large data-sets Hard consensus pattern means hit-or-miss

The major difgiculties arises from the subjectivity in deriving the best descriptor for a family of sequences It can be quite difgicult to design a pattern with both high sensitivity and high specificity

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-33
SLIDE 33

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Discussion

What are the main limitations?

These computer programs are practical and can be applied to large data-sets Hard consensus pattern means hit-or-miss

The major difgiculties arises from the subjectivity in deriving the best descriptor for a family of sequences It can be quite difgicult to design a pattern with both high sensitivity and high specificity

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-34
SLIDE 34

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Discussion

What are the main limitations?

These computer programs are practical and can be applied to large data-sets Hard consensus pattern means hit-or-miss

The major difgiculties arises from the subjectivity in deriving the best descriptor for a family of sequences It can be quite difgicult to design a pattern with both high sensitivity and high specificity

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-35
SLIDE 35

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Discussion

What are the main limitations?

These computer programs are practical and can be applied to large data-sets Hard consensus pattern means hit-or-miss

The major difgiculties arises from the subjectivity in deriving the best descriptor for a family of sequences It can be quite difgicult to design a pattern with both high sensitivity and high specificity

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-36
SLIDE 36

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Discussion

How can one move away from “hard” patterns?

Edit-distance

  • G. Myers. Approximately matching context-free languages.

Information Processing Letters vol. 54 (2) pp. 85-92, 1995. P5N88P , where P is the size of the grammar and N is length of the string.

k-mismatches

  • N. El-Mabrouk, M. Rafginot, J.E. Duchesne, M. Lajoie and
  • N. Luc. Approximate Matching of Secondary Structures.

Journal of Bioinformatics and Computational Biology, Vol. 3, No. 2, pp. 317-342, 2005. krpn , k is error threshold, n is string size, p is secondary structure size, r is number of “union” symbols

Probabilistic, a principled approach

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-37
SLIDE 37

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Discussion

How can one move away from “hard” patterns?

Edit-distance

  • G. Myers. Approximately matching context-free languages.

Information Processing Letters vol. 54 (2) pp. 85-92, 1995. P5N88P , where P is the size of the grammar and N is length of the string.

k-mismatches

  • N. El-Mabrouk, M. Rafginot, J.E. Duchesne, M. Lajoie and
  • N. Luc. Approximate Matching of Secondary Structures.

Journal of Bioinformatics and Computational Biology, Vol. 3, No. 2, pp. 317-342, 2005. krpn , k is error threshold, n is string size, p is secondary structure size, r is number of “union” symbols

Probabilistic, a principled approach

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-38
SLIDE 38

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Discussion

How can one move away from “hard” patterns?

Edit-distance

  • G. Myers. Approximately matching context-free languages.

Information Processing Letters vol. 54 (2) pp. 85-92, 1995. O(P5N88P), where P is the size of the grammar and N is length of the string.

k-mismatches

  • N. El-Mabrouk, M. Rafginot, J.E. Duchesne, M. Lajoie and
  • N. Luc. Approximate Matching of Secondary Structures.

Journal of Bioinformatics and Computational Biology, Vol. 3, No. 2, pp. 317-342, 2005. krpn , k is error threshold, n is string size, p is secondary structure size, r is number of “union” symbols

Probabilistic, a principled approach

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-39
SLIDE 39

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Discussion

How can one move away from “hard” patterns?

Edit-distance

  • G. Myers. Approximately matching context-free languages.

Information Processing Letters vol. 54 (2) pp. 85-92, 1995. O(P5N88P), where P is the size of the grammar and N is length of the string.

k-mismatches

  • N. El-Mabrouk, M. Rafginot, J.E. Duchesne, M. Lajoie and
  • N. Luc. Approximate Matching of Secondary Structures.

Journal of Bioinformatics and Computational Biology, Vol. 3, No. 2, pp. 317-342, 2005. krpn , k is error threshold, n is string size, p is secondary structure size, r is number of “union” symbols

Probabilistic, a principled approach

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-40
SLIDE 40

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Discussion

How can one move away from “hard” patterns?

Edit-distance

  • G. Myers. Approximately matching context-free languages.

Information Processing Letters vol. 54 (2) pp. 85-92, 1995. O(P5N88P), where P is the size of the grammar and N is length of the string.

k-mismatches

  • N. El-Mabrouk, M. Rafginot, J.E. Duchesne, M. Lajoie and
  • N. Luc. Approximate Matching of Secondary Structures.

Journal of Bioinformatics and Computational Biology, Vol. 3, No. 2, pp. 317-342, 2005. O(krpn), k is error threshold, n is string size, p is secondary structure size, r is number of “union” symbols

Probabilistic, a principled approach

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-41
SLIDE 41

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Discussion

How can one move away from “hard” patterns?

Edit-distance

  • G. Myers. Approximately matching context-free languages.

Information Processing Letters vol. 54 (2) pp. 85-92, 1995. O(P5N88P), where P is the size of the grammar and N is length of the string.

k-mismatches

  • N. El-Mabrouk, M. Rafginot, J.E. Duchesne, M. Lajoie and
  • N. Luc. Approximate Matching of Secondary Structures.

Journal of Bioinformatics and Computational Biology, Vol. 3, No. 2, pp. 317-342, 2005. O(krpn), k is error threshold, n is string size, p is secondary structure size, r is number of “union” symbols

Probabilistic, a principled approach

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-42
SLIDE 42

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Discussion

How can one move away from “hard” patterns?

Edit-distance

  • G. Myers. Approximately matching context-free languages.

Information Processing Letters vol. 54 (2) pp. 85-92, 1995. O(P5N88P), where P is the size of the grammar and N is length of the string.

k-mismatches

  • N. El-Mabrouk, M. Rafginot, J.E. Duchesne, M. Lajoie and
  • N. Luc. Approximate Matching of Secondary Structures.

Journal of Bioinformatics and Computational Biology, Vol. 3, No. 2, pp. 317-342, 2005. O(krpn), k is error threshold, n is string size, p is secondary structure size, r is number of “union” symbols

Probabilistic, a principled approach

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-43
SLIDE 43

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Transformational grammars

Pioneered by Noam Chomsky in the ’50s to model natural languages Formal grammars allow to determine what novel sentences are grammatical or not Transformational grammars are sometimes called generative grammars We look at non-probabilistic grammars first!

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-44
SLIDE 44

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Transformational grammars

Pioneered by Noam Chomsky in the ’50s to model natural languages Formal grammars allow to determine what novel sentences are grammatical or not Transformational grammars are sometimes called generative grammars We look at non-probabilistic grammars first!

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-45
SLIDE 45

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Transformational grammars

Pioneered by Noam Chomsky in the ’50s to model natural languages Formal grammars allow to determine what novel sentences are grammatical or not Transformational grammars are sometimes called generative grammars We look at non-probabilistic grammars first!

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-46
SLIDE 46

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Transformational grammars

Pioneered by Noam Chomsky in the ’50s to model natural languages Formal grammars allow to determine what novel sentences are grammatical or not Transformational grammars are sometimes called generative grammars We look at non-probabilistic grammars first!

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-47
SLIDE 47

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Chomsky hierarchy of transformational grammars

regular context−free context−sensitive unrestricted

Increasing order of expressivity, but also increasing

  • rder of computational resources.

Each class of languages has its associated machine that serves for parsing (accepting, deciding, recognizing) sentences of this language.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-48
SLIDE 48

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Chomsky hierarchy of transformational grammars

regular context−free context−sensitive unrestricted

Increasing order of expressivity, but also increasing

  • rder of computational resources.

Each class of languages has its associated machine that serves for parsing (accepting, deciding, recognizing) sentences of this language.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-49
SLIDE 49

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Transformational grammars: definitions

Constituted of symbols and rewriting rules (also called production rules) having the following form, α → β 2 types of symbols: terminal symbols and non-terminal symbols The lefu-hand side of a rule contains at least one non-terminal symbol, which is rewritten into the right hand-side of the rule Terminal symbols represents instances of the language, here nucleotides, and will be represented by lower-case letters

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-50
SLIDE 50

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Transformational grammars: definitions

Constituted of symbols and rewriting rules (also called production rules) having the following form, α → β 2 types of symbols: terminal symbols and non-terminal symbols The lefu-hand side of a rule contains at least one non-terminal symbol, which is rewritten into the right hand-side of the rule Terminal symbols represents instances of the language, here nucleotides, and will be represented by lower-case letters

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-51
SLIDE 51

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Transformational grammars: definitions

Constituted of symbols and rewriting rules (also called production rules) having the following form, α → β 2 types of symbols: terminal symbols and non-terminal symbols The lefu-hand side of a rule contains at least one non-terminal symbol, which is rewritten into the right hand-side of the rule Terminal symbols represents instances of the language, here nucleotides, and will be represented by lower-case letters

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-52
SLIDE 52

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Transformational grammars: definitions

Constituted of symbols and rewriting rules (also called production rules) having the following form, α → β 2 types of symbols: terminal symbols and non-terminal symbols The lefu-hand side of a rule contains at least one non-terminal symbol, which is rewritten into the right hand-side of the rule Terminal symbols represents instances of the language, here nucleotides, and will be represented by lower-case letters

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-53
SLIDE 53

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Transformational grammars: definitions

A small example, a grammar denoted by G S → gS1 | cS2 S1 → cS2 | ϵ S2 → gS1 | ϵ A derivation is the successive application of the rules starting with S (the start nonterminal). S ⇒ cS2 ⇒ cgS1 ⇒ cgcS2 ⇒ cgcgS1 ⇒ cgcg The language generated by G, denoted L(G), is all the strings that can be derived from S, {w|S

⇒ w}. A string is accepted by the grammar if there exist a derivation of the string from S.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-54
SLIDE 54

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Transformational grammars: definitions

A small example, a grammar denoted by G S → gS1 | cS2 S1 → cS2 | ϵ S2 → gS1 | ϵ A derivation is the successive application of the rules starting with S (the start nonterminal). S ⇒ cS2 ⇒ cgS1 ⇒ cgcS2 ⇒ cgcgS1 ⇒ cgcg The language generated by G, denoted L(G), is all the strings that can be derived from S, {w|S

⇒ w}. A string is accepted by the grammar if there exist a derivation of the string from S.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-55
SLIDE 55

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Transformational grammars: definitions

A small example, a grammar denoted by G S → gS1 | cS2 S1 → cS2 | ϵ S2 → gS1 | ϵ A derivation is the successive application of the rules starting with S (the start nonterminal). S ⇒ cS2 ⇒ cgS1 ⇒ cgcS2 ⇒ cgcgS1 ⇒ cgcg The language generated by G, denoted L(G), is all the strings that can be derived from S, {w|S

⇒ w}. A string is accepted by the grammar if there exist a derivation of the string from S.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-56
SLIDE 56

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Transformational grammars: definitions

A small example, a grammar denoted by G S → gS1 | cS2 S1 → cS2 | ϵ S2 → gS1 | ϵ A derivation is the successive application of the rules starting with S (the start nonterminal). S ⇒ cS2 ⇒ cgS1 ⇒ cgcS2 ⇒ cgcgS1 ⇒ cgcg The language generated by G, denoted L(G), is all the strings that can be derived from S, {w|S

⇒ w}. A string is accepted by the grammar if there exist a derivation of the string from S.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-57
SLIDE 57

A derivation can be visualized as a parse tree Terminals are leaves and non-terminals are internal nodes What was the input string? Can you enumerate some of the productions of the grammar?

S0 S1 S2 u S3 S4 a S5 S6 a S7 S8 u S9

S10

g

S11 S12

c

s13 S14

a

S15 S16

g

S17 S18

a g

. . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-58
SLIDE 58

A derivation can be visualized as a parse tree Terminals are leaves and non-terminals are internal nodes What was the input string? Can you enumerate some of the productions of the grammar?

S0 S1 S2 u S3 S4 a S5 S6 a S7 S8 u S9

S10

g

S11 S12

c

s13 S14

a

S15 S16

g

S17 S18

a g

. . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-59
SLIDE 59

A derivation can be visualized as a parse tree Terminals are leaves and non-terminals are internal nodes What was the input string? Can you enumerate some of the productions of the grammar?

S0 S1 S2 u S3 S4 a S5 S6 a S7 S8 u S9

S10

g

S11 S12

c

s13 S14

a

S15 S16

g

S17 S18

a g

. . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-60
SLIDE 60

A derivation can be visualized as a parse tree Terminals are leaves and non-terminals are internal nodes What was the input string? Can you enumerate some of the productions of the grammar?

S0 S1 S2 u S3 S4 a S5 S6 a S7 S8 u S9

S10

g

S11 S12

c

s13 S14

a

S15 S16

g

S17 S18

a g

. . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-61
SLIDE 61

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Transformational grammars

A small example S → gS1 | cS2 S1 → cS2 | ϵ S2 → gS1 | ϵ Give examples of sentences accepted (generated) by the grammar. Which class of grammar is this?

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-62
SLIDE 62

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Transformational grammars

A small example S → gS1 | cS2 S1 → cS2 | ϵ S2 → gS1 | ϵ Give examples of sentences accepted (generated) by the grammar. Which class of grammar is this?

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-63
SLIDE 63

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Transformational grammars

A small example S → gS1 | cS2 S1 → cS2 | ϵ S2 → gS1 | ϵ Give examples of sentences accepted (generated) by the grammar. Which class of grammar is this?

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-64
SLIDE 64

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Chomsky hierarchy of transformational grammars

Grammar type Decidability Productions Regular finite state automata W aW, W a Context-free push-down automata W Context-sensitive linear bounded automata W Unrestricted Turing machines

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-65
SLIDE 65

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Chomsky hierarchy of transformational grammars

Grammar type Decidability Productions Regular finite state automata W aW, W a Context-free push-down automata W Context-sensitive linear bounded automata W Unrestricted Turing machines

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-66
SLIDE 66

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Chomsky hierarchy of transformational grammars

Grammar type Decidability Productions Regular finite state automata W → aW, W → a Context-free push-down automata W Context-sensitive linear bounded automata W Unrestricted Turing machines

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-67
SLIDE 67

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Chomsky hierarchy of transformational grammars

Grammar type Decidability Productions Regular finite state automata W → aW, W → a Context-free push-down automata W Context-sensitive linear bounded automata W Unrestricted Turing machines

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-68
SLIDE 68

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Chomsky hierarchy of transformational grammars

Grammar type Decidability Productions Regular finite state automata W → aW, W → a Context-free push-down automata W → γ Context-sensitive linear bounded automata W Unrestricted Turing machines

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-69
SLIDE 69

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Chomsky hierarchy of transformational grammars

Grammar type Decidability Productions Regular finite state automata W → aW, W → a Context-free push-down automata W → γ Context-sensitive linear bounded automata W Unrestricted Turing machines

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-70
SLIDE 70

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Chomsky hierarchy of transformational grammars

Grammar type Decidability Productions Regular finite state automata W → aW, W → a Context-free push-down automata W → γ Context-sensitive linear bounded automata αWβ → αγβ Unrestricted Turing machines

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-71
SLIDE 71

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Chomsky hierarchy of transformational grammars

Grammar type Decidability Productions Regular finite state automata W → aW, W → a Context-free push-down automata W → γ Context-sensitive linear bounded automata αWβ → αγβ Unrestricted Turing machines

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-72
SLIDE 72

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Chomsky hierarchy of transformational grammars

Grammar type Decidability Productions Regular finite state automata W → aW, W → a Context-free push-down automata W → γ Context-sensitive linear bounded automata αWβ → αγβ Unrestricted Turing machines α → β

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-73
SLIDE 73

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Prosite

N-glycosylation site n-{p}-[st]-{p} S0 nS1 S1 aS2 cS2 yS2 S2 sS3 tS3 S1 a c y What type of grammar is that? www.expasy.ch/prosite

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-74
SLIDE 74

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Prosite

N-glycosylation site n-{p}-[st]-{p} S0 nS1 S1 aS2 cS2 yS2 S2 sS3 tS3 S1 a c y What type of grammar is that? www.expasy.ch/prosite

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-75
SLIDE 75

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Prosite

N-glycosylation site n-{p}-[st]-{p} S0 → nS1 S1 → aS2|cS2| . . . |yS2 S2 → sS3|tS3 S1 → a|c| . . . |y What type of grammar is that? www.expasy.ch/prosite

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-76
SLIDE 76

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Prosite

N-glycosylation site n-{p}-[st]-{p} S0 → nS1 S1 → aS2|cS2| . . . |yS2 S2 → sS3|tS3 S1 → a|c| . . . |y What type of grammar is that? www.expasy.ch/prosite

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-77
SLIDE 77

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Prosite

N-glycosylation site n-{p}-[st]-{p} S0 → nS1 S1 → aS2|cS2| . . . |yS2 S2 → sS3|tS3 S1 → a|c| . . . |y What type of grammar is that? www.expasy.ch/prosite

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-78
SLIDE 78

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

RNA secondary structure

Write a grammar whose language consists of all the sequences folding into either of the following two stem-loop structures. G A A G A G G A N-N’ N-N’ N-N’ N-N’ N-N’ N-N’ S aAu cAg gAc uAa A aBu cBg gBc uBa B aCu cCg gCc uCa C agag gaga What type of grammar is that?

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-79
SLIDE 79

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

RNA secondary structure

Write a grammar whose language consists of all the sequences folding into either of the following two stem-loop structures. G A A G A G G A N-N’ N-N’ N-N’ N-N’ N-N’ N-N’ S → aAu | cAg | gAc | uAa A → aBu | cBg | gBc | uBa B → aCu | cCg | gCc | uCa C → agag | gaga What type of grammar is that?

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-80
SLIDE 80

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

RNA secondary structure

Write a grammar whose language consists of all the sequences folding into either of the following two stem-loop structures. G A A G A G G A N-N’ N-N’ N-N’ N-N’ N-N’ N-N’ S → aAu | cAg | gAc | uAa A → aBu | cBg | gBc | uBa B → aCu | cCg | gCc | uCa C → agag | gaga What type of grammar is that?

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-81
SLIDE 81

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Cocke-Younger-Kasami (CYK) algorithm

CYK is a widely used algorithm for the parsing of context-free grammars (CFG) The CFG must be first transformed into its Chomsky normal form (CNF) All the productions must be of the form:

A BC (exactly two nonterminals) or A a (exactly one terminal)

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-82
SLIDE 82

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Cocke-Younger-Kasami (CYK) algorithm

CYK is a widely used algorithm for the parsing of context-free grammars (CFG) The CFG must be first transformed into its Chomsky normal form (CNF) All the productions must be of the form:

A BC (exactly two nonterminals) or A a (exactly one terminal)

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-83
SLIDE 83

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Cocke-Younger-Kasami (CYK) algorithm

CYK is a widely used algorithm for the parsing of context-free grammars (CFG) The CFG must be first transformed into its Chomsky normal form (CNF) All the productions must be of the form:

A BC (exactly two nonterminals) or A a (exactly one terminal)

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-84
SLIDE 84

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Cocke-Younger-Kasami (CYK) algorithm

CYK is a widely used algorithm for the parsing of context-free grammars (CFG) The CFG must be first transformed into its Chomsky normal form (CNF) All the productions must be of the form:

A → BC (exactly two nonterminals) or A a (exactly one terminal)

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-85
SLIDE 85

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Cocke-Younger-Kasami (CYK) algorithm

CYK is a widely used algorithm for the parsing of context-free grammars (CFG) The CFG must be first transformed into its Chomsky normal form (CNF) All the productions must be of the form:

A → BC (exactly two nonterminals) or A → a (exactly one terminal)

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-86
SLIDE 86

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Cocke-Younger-Kasami (CYK) algorithm

S → g T c S S1 S2 S1 g S2 T S4 S4 c

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-87
SLIDE 87

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Cocke-Younger-Kasami (CYK) algorithm

S → g T c S → S1 S2 S1 g S2 T S4 S4 c

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-88
SLIDE 88

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Cocke-Younger-Kasami (CYK) algorithm

S → g T c S → S1 S2 S1 → g S2 T S4 S4 c

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-89
SLIDE 89

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Cocke-Younger-Kasami (CYK) algorithm

S → g T c S → S1 S2 S1 → g S2 → T S4 S4 c

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-90
SLIDE 90

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Cocke-Younger-Kasami (CYK) algorithm

S → g T c S → S1 S2 S1 → g S2 → T S4 S4 → c

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-91
SLIDE 91

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Cocke-Younger-Kasami (CYK) algorithm

S → g T c S → S1 S2 S1 → g S2 → T S4 S4 → c

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-92
SLIDE 92

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Cocke-Younger-Kasami (CYK) algorithm

Write a CFG in CNF for the following stem-loop structure. G A A G G-C A-U U-A S S1S2 S1 u S2 S3S4 S4 a S3 S5S6 S5 a S6 S7S8 S8 u S7 S9S10 S9 g S10 S11S12 S12 c S11 S13S14 S13 a S14 S15S16 S15 g S16 S17S18 S17 a S18 g

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-93
SLIDE 93

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Cocke-Younger-Kasami (CYK) algorithm

Write a CFG in CNF for the following stem-loop structure. G A A G G-C A-U U-A S → S1S2 S1 u S2 S3S4 S4 a S3 S5S6 S5 a S6 S7S8 S8 u S7 S9S10 S9 g S10 S11S12 S12 c S11 S13S14 S13 a S14 S15S16 S15 g S16 S17S18 S17 a S18 g

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-94
SLIDE 94

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Cocke-Younger-Kasami (CYK) algorithm

Write a CFG in CNF for the following stem-loop structure. G A A G G-C A-U U-A S → S1S2 S1 → u S2 S3S4 S4 a S3 S5S6 S5 a S6 S7S8 S8 u S7 S9S10 S9 g S10 S11S12 S12 c S11 S13S14 S13 a S14 S15S16 S15 g S16 S17S18 S17 a S18 g

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-95
SLIDE 95

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Cocke-Younger-Kasami (CYK) algorithm

Write a CFG in CNF for the following stem-loop structure. G A A G G-C A-U U-A S → S1S2 S1 → u S2 → S3S4 S4 a S3 S5S6 S5 a S6 S7S8 S8 u S7 S9S10 S9 g S10 S11S12 S12 c S11 S13S14 S13 a S14 S15S16 S15 g S16 S17S18 S17 a S18 g

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-96
SLIDE 96

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Cocke-Younger-Kasami (CYK) algorithm

Write a CFG in CNF for the following stem-loop structure. G A A G G-C A-U U-A S → S1S2 S1 → u S2 → S3S4 S4 → a S3 S5S6 S5 a S6 S7S8 S8 u S7 S9S10 S9 g S10 S11S12 S12 c S11 S13S14 S13 a S14 S15S16 S15 g S16 S17S18 S17 a S18 g

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-97
SLIDE 97

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Cocke-Younger-Kasami (CYK) algorithm

Write a CFG in CNF for the following stem-loop structure. G A A G G-C A-U U-A S → S1S2 S1 → u S2 → S3S4 S4 → a S3 → S5S6 S5 a S6 S7S8 S8 u S7 S9S10 S9 g S10 S11S12 S12 c S11 S13S14 S13 a S14 S15S16 S15 g S16 S17S18 S17 a S18 g

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-98
SLIDE 98

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Cocke-Younger-Kasami (CYK) algorithm

Write a CFG in CNF for the following stem-loop structure. G A A G G-C A-U U-A S → S1S2 S1 → u S2 → S3S4 S4 → a S3 → S5S6 S5 → a S6 → S7S8 S8 → u S7 → S9S10 S9 → g S10 S11S12 S12 c S11 S13S14 S13 a S14 S15S16 S15 g S16 S17S18 S17 a S18 g

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-99
SLIDE 99

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Cocke-Younger-Kasami (CYK) algorithm

Write a CFG in CNF for the following stem-loop structure. G A A G G-C A-U U-A S → S1S2 S1 → u S2 → S3S4 S4 → a S3 → S5S6 S5 → a S6 → S7S8 S8 → u S7 → S9S10 S9 → g S10 → S11S12 S12 → c S11 → S13S14 S13 → a S14 → S15S16 S15 → g S16 → S17S18 S17 → a S18 → g

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-100
SLIDE 100

S0 S1 S2 u S3 S4 a S5 S6 a S7 S8 u S9

S10

g

S11 S12

c

. . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-101
SLIDE 101

S0 S1 S2 u S3 S4 a S5 S6 a S7 S8 u S9

S10

g

S11 S12

c

s13 S14

a

S15 S16

g

S17 S18

a g

. . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-102
SLIDE 102

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Cocke-Younger-Kasami (CYK) algorithm: idea

For a given grammar G, let W

⇒ α indicate that the string α can be derived from W of G Also, let s be an input string of length n Remember that G is in Chomsky Normal form! Let V i l W W s i i l 1 For l 1 V i 1 W W s i i For l 1 V i l A A BC B s i i k 1 C s i k i l 1 1 k l

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-103
SLIDE 103

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Cocke-Younger-Kasami (CYK) algorithm: idea

For a given grammar G, let W

⇒ α indicate that the string α can be derived from W of G Also, let s be an input string of length n Remember that G is in Chomsky Normal form! Let V i l W W s i i l 1 For l 1 V i 1 W W s i i For l 1 V i l A A BC B s i i k 1 C s i k i l 1 1 k l

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-104
SLIDE 104

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Cocke-Younger-Kasami (CYK) algorithm: idea

For a given grammar G, let W

⇒ α indicate that the string α can be derived from W of G Also, let s be an input string of length n Remember that G is in Chomsky Normal form! Let V i l W W s i i l 1 For l 1 V i 1 W W s i i For l 1 V i l A A BC B s i i k 1 C s i k i l 1 1 k l

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-105
SLIDE 105

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Cocke-Younger-Kasami (CYK) algorithm: idea

For a given grammar G, let W

⇒ α indicate that the string α can be derived from W of G Also, let s be an input string of length n Remember that G is in Chomsky Normal form! Let V(i, l) = {W|W

⇒ s[i, i + l − 1]} For l 1 V i 1 W W s i i For l 1 V i l A A BC B s i i k 1 C s i k i l 1 1 k l

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-106
SLIDE 106

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Cocke-Younger-Kasami (CYK) algorithm: idea

For a given grammar G, let W

⇒ α indicate that the string α can be derived from W of G Also, let s be an input string of length n Remember that G is in Chomsky Normal form! Let V(i, l) = {W|W

⇒ s[i, i + l − 1]} For l = 1 V i 1 W W s i i For l 1 V i l A A BC B s i i k 1 C s i k i l 1 1 k l

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-107
SLIDE 107

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Cocke-Younger-Kasami (CYK) algorithm: idea

For a given grammar G, let W

⇒ α indicate that the string α can be derived from W of G Also, let s be an input string of length n Remember that G is in Chomsky Normal form! Let V(i, l) = {W|W

⇒ s[i, i + l − 1]} For l = 1 V i 1 W W s i i For l 1 V i l A A BC B s i i k 1 C s i k i l 1 1 k l

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-108
SLIDE 108

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Cocke-Younger-Kasami (CYK) algorithm: idea

For a given grammar G, let W

⇒ α indicate that the string α can be derived from W of G Also, let s be an input string of length n Remember that G is in Chomsky Normal form! Let V(i, l) = {W|W

⇒ s[i, i + l − 1]} For l = 1 V(i, 1) = W W s i i For l 1 V i l A A BC B s i i k 1 C s i k i l 1 1 k l

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-109
SLIDE 109

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Cocke-Younger-Kasami (CYK) algorithm: idea

For a given grammar G, let W

⇒ α indicate that the string α can be derived from W of G Also, let s be an input string of length n Remember that G is in Chomsky Normal form! Let V(i, l) = {W|W

⇒ s[i, i + l − 1]} For l = 1 V(i, 1) = { W|W → s[i, i] } For l 1 V i l A A BC B s i i k 1 C s i k i l 1 1 k l

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-110
SLIDE 110

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Cocke-Younger-Kasami (CYK) algorithm: idea

For a given grammar G, let W

⇒ α indicate that the string α can be derived from W of G Also, let s be an input string of length n Remember that G is in Chomsky Normal form! Let V(i, l) = {W|W

⇒ s[i, i + l − 1]} For l = 1 V(i, 1) = { W|W → s[i, i] } For l > 1 V i l A A BC B s i i k 1 C s i k i l 1 1 k l

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-111
SLIDE 111

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Cocke-Younger-Kasami (CYK) algorithm: idea

For a given grammar G, let W

⇒ α indicate that the string α can be derived from W of G Also, let s be an input string of length n Remember that G is in Chomsky Normal form! Let V(i, l) = {W|W

⇒ s[i, i + l − 1]} For l = 1 V(i, 1) = { W|W → s[i, i] } For l > 1 V i l A A BC B s i i k 1 C s i k i l 1 1 k l

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-112
SLIDE 112

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Cocke-Younger-Kasami (CYK) algorithm: idea

For a given grammar G, let W

⇒ α indicate that the string α can be derived from W of G Also, let s be an input string of length n Remember that G is in Chomsky Normal form! Let V(i, l) = {W|W

⇒ s[i, i + l − 1]} For l = 1 V(i, 1) = { W|W → s[i, i] } For l > 1 V(i, l) = { A | A → BC, B

⇒ s[i, i + k − 1], C

⇒ s[i + k, i + l − 1], 1 ≤ k < l }

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-113
SLIDE 113

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Cocke-Younger-Kasami (CYK) algorithm: example

S → AB | BC A → BA | a B → CC | b C → AB | a V(i, l) = {W|W ⋆ ⇒ s[i, i + l − 1]} s b a a b a i 1 2 3 4 5 l = 1 B A, C A, C B A, C l = 2 S, A B S, C S, A l = 3 ∅ B B l = 4 ∅ S, A, C l = 5 S, A, C

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-114
SLIDE 114

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Cocke-Younger-Kasami (CYK) algorithm: example

S → AB | BC A → BA | a B → CC | b C → AB | a V(i, l) = {W|W ⋆ ⇒ s[i, i + l − 1]} s b a a b a i 1 2 3 4 5 l = 1 B A, C A, C B A, C l = 2 S, A B S, C S, A l = 3 ∅ B B l = 4 ∅ S, A, C l = 5 S, A, C

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-115
SLIDE 115

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Cocke-Younger-Kasami (CYK) algorithm: example

S → AB | BC A → BA | a B → CC | b C → AB | a V(i, l) = {W|W ⋆ ⇒ s[i, i + l − 1]} s b a a b a i 1 2 3 4 5 l = 1 B A, C A, C B A, C l = 2 S, A B S, C S, A l = 3 ∅ B B l = 4 ∅ S, A, C l = 5 S, A, C

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-116
SLIDE 116

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Cocke-Younger-Kasami (CYK) algorithm: example

S → AB | BC A → BA | a B → CC | b C → AB | a V(i, l) = {W|W ⋆ ⇒ s[i, i + l − 1]} s b a a b a i 1 2 3 4 5 l = 1 B A, C A, C B A, C l = 2 S, A B S, C S, A l = 3 ∅ B B l = 4 ∅ S, A, C l = 5 S, A, C

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-117
SLIDE 117

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Cocke-Younger-Kasami (CYK) algorithm: example

S → AB | BC A → BA | a B → CC | b C → AB | a V(i, l) = {W|W ⋆ ⇒ s[i, i + l − 1]} s b a a b a i 1 2 3 4 5 l = 1 B A, C A, C B A, C l = 2 S, A B S, C S, A l = 3 ∅ B B l = 4 ∅ S, A, C l = 5 S, A, C

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-118
SLIDE 118

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Cocke-Younger-Kasami (CYK) algorithm: example

S → AB | BC A → BA | a B → CC | b C → AB | a V(i, l) = {W|W ⋆ ⇒ s[i, i + l − 1]} s b a a b a i 1 2 3 4 5 l = 1 B A, C A, C B A, C l = 2 S, A B S, C S, A l = 3 ∅ B B l = 4 ∅ S, A, C l = 5 S, A, C

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-119
SLIDE 119

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Cocke-Younger-Kasami (CYK) algorithm: example

S → AB | BC A → BA | a B → CC | b C → AB | a V(i, l) = {W|W ⋆ ⇒ s[i, i + l − 1]} s b a a b a i 1 2 3 4 5 l = 1 B A, C A, C B A, C l = 2 S, A B S, C S, A l = 3 ∅ B B l = 4 ∅ S, A, C l = 5 S, A, C

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-120
SLIDE 120

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Cocke-Younger-Kasami (CYK) algorithm: example

S → AB | BC A → BA | a B → CC | b C → AB | a V(i, l) = {W|W ⋆ ⇒ s[i, i + l − 1]} s b a a b a i 1 2 3 4 5 l = 1 B A, C A, C B A, C l = 2 S, A B S, C S, A l = 3 ∅ B B l = 4 ∅ S, A, C l = 5 S, A, C

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-121
SLIDE 121

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Cocke-Younger-Kasami (CYK) algorithm: example

S → AB | BC A → BA | a B → CC | b C → AB | a V(i, l) = {W|W ⋆ ⇒ s[i, i + l − 1]} s b a a b a i 1 2 3 4 5 l = 1 B A, C A, C B A, C l = 2 S, A B S, C S, A l = 3 ∅ B B l = 4 ∅ S, A, C l = 5 S, A, C

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-122
SLIDE 122

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Cocke-Younger-Kasami (CYK) algorithm: example

S → AB | BC A → BA | a B → CC | b C → AB | a V(i, l) = {W|W ⋆ ⇒ s[i, i + l − 1]} s b a a b a i 1 2 3 4 5 l = 1 B A, C A, C B A, C l = 2 S, A B S, C S, A l = 3 ∅ B B l = 4 ∅ S, A, C l = 5 S, A, C

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-123
SLIDE 123

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Cocke-Younger-Kasami (CYK) algorithm: example

S → AB | BC A → BA | a B → CC | b C → AB | a V(i, l) = {W|W ⋆ ⇒ s[i, i + l − 1]} s b a a b a i 1 2 3 4 5 l = 1 B A, C A, C B A, C l = 2 S, A B S, C S, A l = 3 ∅ B B l = 4 ∅ S, A, C l = 5 S, A, C

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-124
SLIDE 124

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Cocke-Younger-Kasami (CYK) algorithm: example

S → AB | BC A → BA | a B → CC | b C → AB | a V(i, l) = {W|W ⋆ ⇒ s[i, i + l − 1]} s b a a b a i 1 2 3 4 5 l = 1 B A, C A, C B A, C l = 2 S, A B S, C S, A l = 3 ∅ B B l = 4 ∅ S, A, C l = 5 S, A, C

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-125
SLIDE 125

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Cocke-Younger-Kasami (CYK) algorithm: example

S → AB | BC A → BA | a B → CC | b C → AB | a V(i, l) = {W|W ⋆ ⇒ s[i, i + l − 1]} s b a a b a i 1 2 3 4 5 l = 1 B A, C A, C B A, C l = 2 S, A B S, C S, A l = 3 ∅ B B l = 4 ∅ S, A, C l = 5 S, A, C

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-126
SLIDE 126

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Cocke-Younger-Kasami (CYK) algorithm: example

S → AB | BC A → BA | a B → CC | b C → AB | a V(i, l) = {W|W ⋆ ⇒ s[i, i + l − 1]} s b a a b a i 1 2 3 4 5 l = 1 B A, C A, C B A, C l = 2 S, A B S, C S, A l = 3 ∅ B B l = 4 ∅ S, A, C l = 5 S, A, C

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-127
SLIDE 127

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Cocke-Younger-Kasami (CYK) algorithm: example

S → AB | BC A → BA | a B → CC | b C → AB | a V(i, l) = {W|W ⋆ ⇒ s[i, i + l − 1]} s b a a b a i 1 2 3 4 5 l = 1 B A, C A, C B A, C l = 2 S, A B S, C S, A l = 3 ∅ B B l = 4 ∅ S, A, C l = 5 S, A, C

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-128
SLIDE 128

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Cocke-Younger-Kasami (CYK) algorithm: example

S → AB | BC A → BA | a B → CC | b C → AB | a V(i, l) = {W|W ⋆ ⇒ s[i, i + l − 1]} s b a a b a i 1 2 3 4 5 l = 1 B A, C A, C B A, C l = 2 S, A B S, C S, A l = 3 ∅ B B l = 4 ∅ S, A, C l = 5 S, A, C

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-129
SLIDE 129

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Cocke-Younger-Kasami (CYK) algorithm: example

S → AB | BC A → BA | a B → CC | b C → AB | a V(i, l) = {W|W ⋆ ⇒ s[i, i + l − 1]} s b a a b a i 1 2 3 4 5 l = 1 B A, C A, C B A, C l = 2 S, A B S, C S, A l = 3 ∅ B B l = 4 ∅ S, A, C l = 5 S, A, C

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-130
SLIDE 130

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Cocke-Younger-Kasami (CYK) algorithm: example

S → AB | BC A → BA | a B → CC | b C → AB | a V(i, l) = {W|W ⋆ ⇒ s[i, i + l − 1]} s b a a b a i 1 2 3 4 5 l = 1 B A, C A, C B A, C l = 2 S, A B S, C S, A l = 3 ∅ B B l = 4 ∅ S, A, C l = 5 S, A, C

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-131
SLIDE 131

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Cocke-Younger-Kasami (CYK) algorithm: example

S → AB | BC A → BA | a B → CC | b C → AB | a V(i, l) = {W|W ⋆ ⇒ s[i, i + l − 1]} s b a a b a i 1 2 3 4 5 l = 1 B A, C A, C B A, C l = 2 S, A B S, C S, A l = 3 ∅ B B l = 4 ∅ S, A, C l = 5 S, A, C

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-132
SLIDE 132

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Cocke-Younger-Kasami (CYK) algorithm: example

S → AB | BC A → BA | a B → CC | b C → AB | a V(i, l) = {W|W ⋆ ⇒ s[i, i + l − 1]} s b a a b a i 1 2 3 4 5 l = 1 B A, C A, C B A, C l = 2 S, A B S, C S, A l = 3 ∅ B B l = 4 ∅ S, A, C l = 5 S, A, C

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-133
SLIDE 133

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

b a a b a

A B S

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-134
SLIDE 134

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

b a a b a

A B S

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-135
SLIDE 135

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

b a a b a

A B S

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-136
SLIDE 136

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

b a a b a

A B S

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-137
SLIDE 137

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

b a a b a

A B S

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-138
SLIDE 138

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

b a a b a

A B S B A C C

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-139
SLIDE 139

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

b a a b a

A B S B A C C

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-140
SLIDE 140

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

b a a b a

A B A A C B C B S

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-141
SLIDE 141

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Cocke-Younger-Kasami (CYK) algorithm: algorithm

{ Initialization } for i = 1 to n do V(i,1) = {A | A → a is a production and s[i] = a} { Iteration } for l = 2 to n do for i = 1 to n - l + 1 do V(i,l) = ∅ for k = 1 to l - 1 do V(i,l) = V(i,l) ∪ {A | A → BC, B ∈ V(i,k) and C ∈ V(i+k,l-k)}

Given an input of size n and grammar having m nonterminal symbols, CYK runs in O(mn2) space and O(m2n3) time.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-142
SLIDE 142

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Cocke-Younger-Kasami (CYK) algorithm: algorithm

{ Initialization } for i = 1 to n do V(i,1) = {A | A → a is a production and s[i] = a} { Iteration } for l = 2 to n do for i = 1 to n - l + 1 do V(i,l) = ∅ for k = 1 to l - 1 do V(i,l) = V(i,l) ∪ {A | A → BC, B ∈ V(i,k) and C ∈ V(i+k,l-k)}

Given an input of size n and grammar having m nonterminal symbols, CYK runs in O(mn2) space and O(m2n3) time.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-143
SLIDE 143

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Cocke-Younger-Kasami (CYK) algorithm: remarks

An RNA secondary structure (motif) can be represented as a CFG (in CNF) CYK can be used for finding all its occurrences in a database CYK finds an exact match Still hit-or-miss algorithm Gene Myers adapted the algorithm for finding approximate matches

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-144
SLIDE 144

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Cocke-Younger-Kasami (CYK) algorithm: remarks

An RNA secondary structure (motif) can be represented as a CFG (in CNF) CYK can be used for finding all its occurrences in a database CYK finds an exact match Still hit-or-miss algorithm Gene Myers adapted the algorithm for finding approximate matches

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-145
SLIDE 145

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Cocke-Younger-Kasami (CYK) algorithm: remarks

An RNA secondary structure (motif) can be represented as a CFG (in CNF) CYK can be used for finding all its occurrences in a database CYK finds an exact match Still hit-or-miss algorithm Gene Myers adapted the algorithm for finding approximate matches

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-146
SLIDE 146

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Cocke-Younger-Kasami (CYK) algorithm: remarks

An RNA secondary structure (motif) can be represented as a CFG (in CNF) CYK can be used for finding all its occurrences in a database CYK finds an exact match Still hit-or-miss algorithm Gene Myers adapted the algorithm for finding approximate matches

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-147
SLIDE 147

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Cocke-Younger-Kasami (CYK) algorithm: remarks

An RNA secondary structure (motif) can be represented as a CFG (in CNF) CYK can be used for finding all its occurrences in a database CYK finds an exact match Still hit-or-miss algorithm Gene Myers adapted the algorithm for finding approximate matches

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-148
SLIDE 148

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Discussion

A C C U A C U U A C C U G C C C ( . . ) A U U U A C C U G C U C

AUUU is not accepted ACCU and GCUC are both accepted, but one is the consensus and the other the exception

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-149
SLIDE 149

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Discussion

A C C U A C U U A C C U G C C C ( . . ) A U U U A C C U G C U C

AUUU is not accepted ACCU and GCUC are both accepted, but one is the consensus and the other the exception

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-150
SLIDE 150

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Discussion

A C C U A C U U A C C U G C C C ( . . ) A U U U A C C U G C U C

AUUU is not accepted ACCU and GCUC are both accepted, but one is the consensus and the other the exception

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-151
SLIDE 151

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Stochastic (Context-Free) grammars

Because of their discrete nature, it’s difgicult to design patterns that 1) are specific enough 2) and yet will be general enough to match unseen cases Any grammar in the Chomsky hierarchy can be transformed into a probabilistic model In practice, because the cost of parsing a string (sequence

  • r database) using context-sensitive and unrestricted

grammars is prohibitive, applications are restricted to regular and context-free grammars

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-152
SLIDE 152

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Stochastic (Context-Free) grammars

Because of their discrete nature, it’s difgicult to design patterns that 1) are specific enough 2) and yet will be general enough to match unseen cases Any grammar in the Chomsky hierarchy can be transformed into a probabilistic model In practice, because the cost of parsing a string (sequence

  • r database) using context-sensitive and unrestricted

grammars is prohibitive, applications are restricted to regular and context-free grammars

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-153
SLIDE 153

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Stochastic (Context-Free) grammars

Because of their discrete nature, it’s difgicult to design patterns that 1) are specific enough 2) and yet will be general enough to match unseen cases Any grammar in the Chomsky hierarchy can be transformed into a probabilistic model In practice, because the cost of parsing a string (sequence

  • r database) using context-sensitive and unrestricted

grammars is prohibitive, applications are restricted to regular and context-free grammars

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-154
SLIDE 154

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Stochastic grammars

A stochastic context-free grammar (SCFG) for an RNA will have production rules of the following forms: S0 → (.25) : g S1 c | (.25) : c S1 g | (.25) : a S1 u | (.25) : u S1 a to represent base-pairs, and Si → (.50) : u Sj | (.50) : g Sj to represent single stranded regions.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-155
SLIDE 155

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Stochastic grammars: problems

Given a sequence finding the most likely parse (alignment) Probability that this SCFG produces that sequence (scoring) Estimating the probabilities of the model (training)

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-156
SLIDE 156

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Stochastic grammars: problems

Given a sequence finding the most likely parse (alignment) Probability that this SCFG produces that sequence (scoring) Estimating the probabilities of the model (training)

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-157
SLIDE 157

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Stochastic grammars: problems

Given a sequence finding the most likely parse (alignment) Probability that this SCFG produces that sequence (scoring) Estimating the probabilities of the model (training)

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-158
SLIDE 158

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Notation

Given an SCFG in Chomsky normal form with M nonterminal symbols, W = W1, .., Wm and W1 the start symbol Let v, w and z denote the indices for the nonterminal symbols, Wv, Wy and Wz Production rules are of the form: Wv → WyWz and Wv → a Let the probability parameters be called, tv(y, z) for transitions and ev(a) for emissions Finally, let i, j and k be the indices for the symbols xi, xj and xk in the sequence x of length n

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-159
SLIDE 159

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Notation

Given an SCFG in Chomsky normal form with M nonterminal symbols, W = W1, .., Wm and W1 the start symbol Let v, w and z denote the indices for the nonterminal symbols, Wv, Wy and Wz Production rules are of the form: Wv → WyWz and Wv → a Let the probability parameters be called, tv(y, z) for transitions and ev(a) for emissions Finally, let i, j and k be the indices for the symbols xi, xj and xk in the sequence x of length n

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-160
SLIDE 160

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Notation

Given an SCFG in Chomsky normal form with M nonterminal symbols, W = W1, .., Wm and W1 the start symbol Let v, w and z denote the indices for the nonterminal symbols, Wv, Wy and Wz Production rules are of the form: Wv → WyWz and Wv → a Let the probability parameters be called, tv(y, z) for transitions and ev(a) for emissions Finally, let i, j and k be the indices for the symbols xi, xj and xk in the sequence x of length n

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-161
SLIDE 161

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Notation

Given an SCFG in Chomsky normal form with M nonterminal symbols, W = W1, .., Wm and W1 the start symbol Let v, w and z denote the indices for the nonterminal symbols, Wv, Wy and Wz Production rules are of the form: Wv → WyWz and Wv → a Let the probability parameters be called, tv(y, z) for transitions and ev(a) for emissions Finally, let i, j and k be the indices for the symbols xi, xj and xk in the sequence x of length n

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-162
SLIDE 162

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Notation

Given an SCFG in Chomsky normal form with M nonterminal symbols, W = W1, .., Wm and W1 the start symbol Let v, w and z denote the indices for the nonterminal symbols, Wv, Wy and Wz Production rules are of the form: Wv → WyWz and Wv → a Let the probability parameters be called, tv(y, z) for transitions and ev(a) for emissions Finally, let i, j and k be the indices for the symbols xi, xj and xk in the sequence x of length n

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-163
SLIDE 163

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

CYK algorithm (alignment)

{ Initialization } for i = 1 to n, v = 1 to M γ(i, 1, v) = ev(xi) { Iteration } for l = 2 to n, i = 1 to n − l + 1, v = 1 to M γ(i, l, v) = maxy,z maxk=1,...,l−1 {γ(i, k, y)γ(i + k, l − k, z)tv(y, z)} { Termination } log P(x, ˆ π|θ) = γ(1, n, 1).

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-164
SLIDE 164

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Cocke-Younger-Kasami (CYK) algorithm: non-probabilistic

{ Initialization } for i = 1 to n do V(i,1) = {A | A → a is a production and s[i] = a} { Iteration } for l = 2 to n do for i = 1 to n - l + 1 do V(i,l) = ∅ for k = 1 to l - 1 do V(i,l) = V(i,l) ∪ {A | A → BC, B ∈ V(i,k) and C ∈ V(i+k,l-k)}

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-165
SLIDE 165

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

b a a b a

A B S B A C C

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-166
SLIDE 166

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

CYK algorithm: probabilistic

{ Initialization } for i = 1 to n, v = 1 to M γ(i, 1, v) = log ev(xi) { Iteration } for l = 2 to n, i = 1 to n − l + 1, v = 1 to M γ(i, j, v) = maxy,z maxk=1,...,l−1 {γ(i, k, y)+γ(i+k, l−k, z)+log tv(y, z)} { Termination } log P(x, ˆ π|θ) = γ(1, n, 1).

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-167
SLIDE 167

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Complexity

Memory O(L2M) Time O(L3M3)

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-168
SLIDE 168

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

CYK algorithm: inside (scoring)

{ Initialization } for i = 1 to n, v = 1 to M α(i, 1, v) = ev(xi) { Iteration } for l = 2 to n, i = 1 to n − l + 1, v = 1 to M α(i, l, v) = ∑M

y=1

∑M

z=1

k=1,...,l−1 {α(i, k, y)α(i+k, l−k, z)tv(y, z)}

{ Termination } log P(x|θ) = α(1, n, 1).

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-169
SLIDE 169

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

b a a b a

A B S

b a a b a

A B S B A C C

b a a b a

A B S B A C C

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-170
SLIDE 170

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Estimating the probabilities

The transition and emission probabilities are estimated from the user input data (alignment and structure).

In theory:

The inside-outside, an iterative expectation-maximization (EM), algorithm can be used for parameter re-estimation

In practice:

Parameters are extracted from a user input alignment

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-171
SLIDE 171

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Expectation-Maximization (EM)

Iterative algorithm for finding the maximum-likelihood estimates

  • f the parameters.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-172
SLIDE 172

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Estimating the parameters

( ( ( ( . . ) ) ) ) G G A G A U C U C C G G G G A - C C C C U G G G A A C C C A G G G G A U C C C U G G G G A A C C C C

S1 S2 S3 S4

S

0.8

→ g S1 c S

0.2

→ u S1 a S1

1.0

→ g S2 c S2

0.8

→ g S3 c S2

0.2

→ a S3 u S3

1.0

→ g S4 c S4

1.0

→ S5S6 S5

1.0

→ a S6

0.4

→ a S6

0.4

→ u S6

0.2

→ ϵ

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-173
SLIDE 173

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Estimating the parameters

( ( ( ( . . ) ) ) ) G G A G A U C U C C G G G G A - C C C C U G G G A A C C C A G G G G A U C C C U G G G G A A C C C C

S1 S2 S3 S4

S

0.8

→ g S1 c S

0.2

→ u S1 a S1

1.0

→ g S2 c S2

0.8

→ g S3 c S2

0.2

→ a S3 u S3

1.0

→ g S4 c S4

1.0

→ S5S6 S5

1.0

→ a S6

0.4

→ a S6 S6

0.4

→ u S6 S6

0.2

→ ϵ

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-174
SLIDE 174

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

tRNA: a more realistic input

# STOCKHOLM 1.0 #=GF AU Koala DA0260 GGGCGAAUAGUGUCAGC.GGGAGCACACCAGACUUGCAUCUGGUAG.GGAGGGUUCGAGUCCCUCUUUGUCCACCA #=GR DA0260 SS (((((((..((((.........)))).(((((......))))).....(((((.......)))))))))))).... DA0261 GGGCGAAUAGUGUCAGC.GGGAGCACACCAGACUUGCAUCUGGUAG.GGAGGGUUCGAGUCCCUCUUUGUCCACCA #=GR DA0261 SS (((((((..((((.........)))).(((((......))))).....(((((.......)))))))))))).... DA0340 GGGCUCGUAGCUCAGC..GGGAGAGCGCCGCCUUUGCAGGCGGAGGCCGCGGGUUCAAAUCCCGCCGAGUCCA... #=GR DA0340 SS (((((((..((((.........)))).(((((......))))).....(((((.......)))))))))))).... DA0380 GGGCCCAUAGCUCAGU..GGUAGAGUGCCUCCUUUGCAGGAGGAUGCCCUGGGUUCGAAUCCCAGUGGGUCCA... #=GR DA0380 SS (((((((..((((.........)))).(((((......))))).....(((((.......)))))))))))).... DA0420 GGGCCCAUAGCUCAGU..GGUAGAGUGCCUCCUUUGCAGGAGGAUGCCCUGGGUUGGAAUCCCAGUGGGUCCA... #=GR DA0420 SS (((((((..((((.........)))).(((((......))))).....(((((.......)))))))))))).... DA0580 GGGCCCGUAGCUCAGACUGGGAGAGCGCCGCCCUUGCAGGCGGAGGCCCCGGGUUCAAAUCCCGGUGGGUCCA... #=GR DA0580 SS (((((((..((((.........)))).(((((......))))).....(((((.......)))))))))))).... DA0620 GGGCCCGUAGCUCAGACUGGGAGAGCGCCGCCCUUGCAGGCGGAGGCCCCGGGUUCAAAUCCCGGUGGGUCCA... #=GR DA0620 SS (((((((..((((.........)))).(((((......))))).....(((((.......)))))))))))).... Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-175
SLIDE 175

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Stochastic Context-Free Grammars (SCFG)

Sean Eddy, one of the pioneers of the use of SCFGs in bioinformatics, has developed several tools: http://eddylab.org/software.html

RSEARCH aligns an RNA query to target sequences, using SCFG algorithms to score both secondary structure and primary sequence alignment simultaneously;

  • Infernal. RNA structure analysis using covariance models

(new);

  • COVE. RNA structure analysis using covariance models

(old).

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-176
SLIDE 176

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

RSearch

Input: an RNA sequence and its secondary structure Output: similar RNAs on the basis of both primary sequence and secondary structure R.J. Klein and S.R. Eddy (2003) RSEARCH: Finding homologs of single structured RNA sequences. BMC Bioinformatics, 4:44, 2003 (doi:10.1186/1471-2105-4-44)

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-177
SLIDE 177

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

RSearch

Input: an RNA sequence and its secondary structure Output: similar RNAs on the basis of both primary sequence and secondary structure R.J. Klein and S.R. Eddy (2003) RSEARCH: Finding homologs of single structured RNA sequences. BMC Bioinformatics, 4:44, 2003 (doi:10.1186/1471-2105-4-44)

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-178
SLIDE 178

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

RSearch

Input: an RNA sequence and its secondary structure Output: similar RNAs on the basis of both primary sequence and secondary structure R.J. Klein and S.R. Eddy (2003) RSEARCH: Finding homologs of single structured RNA sequences. BMC Bioinformatics, 4:44, 2003 (doi:10.1186/1471-2105-4-44)

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-179
SLIDE 179

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

RSearch

# STOCKHOLM 1.0 #=GS Holley DE tRNA-Ala that Holley sequenced from Yeast genome Holley GGGCGTGTGGCGTAGTCGGTAGCGCGCTCCCTTAGCATGGGAGAGGtCTCCGGTTCGATTCCGGACTCGTCCA #=GR Holley SS (((((.(..((((........)))).(((((.......))))).....(((((.......)))))).))))). // Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-180
SLIDE 180

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Remarks

RIBOSUM substitution matrices (analogous to residue substitution scores such as PAM and BLOSUM but for base pairs) Reports the statistical significance of all the matches Execution time is O(NM3) where N is the size of the database and M is the length of the input sequence “(…) a typical single search of a metazoan genome may take a few thousand CPU hours.”

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-181
SLIDE 181

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Remarks

RIBOSUM substitution matrices (analogous to residue substitution scores such as PAM and BLOSUM but for base pairs) Reports the statistical significance of all the matches Execution time is O(NM3) where N is the size of the database and M is the length of the input sequence “(…) a typical single search of a metazoan genome may take a few thousand CPU hours.”

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-182
SLIDE 182

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Remarks

RIBOSUM substitution matrices (analogous to residue substitution scores such as PAM and BLOSUM but for base pairs) Reports the statistical significance of all the matches Execution time is O(NM3) where N is the size of the database and M is the length of the input sequence “(…) a typical single search of a metazoan genome may take a few thousand CPU hours.”

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-183
SLIDE 183

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Remarks

RIBOSUM substitution matrices (analogous to residue substitution scores such as PAM and BLOSUM but for base pairs) Reports the statistical significance of all the matches Execution time is O(NM3) where N is the size of the database and M is the length of the input sequence “(…) a typical single search of a metazoan genome may take a few thousand CPU hours.”

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-184
SLIDE 184

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

INFERNAL

INFERNAL 1.1 Nawrocki, E. P. & Eddy, S. R. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics 29, 2933–2935 (2013). Rfam 14 (August 2018, 2791 families, hand curated) Kalvari, I. et al. Rfam 13.0: shifuing to a genome-centric resource for non-coding RNA families. Nucleic Acids Res 46, D335–D342 (2018).

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-185
SLIDE 185

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

INFERNAL/Rfam covariance models

S0 S1 S2 u S3 S4 a S5 S6 a S8 u ... Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-186
SLIDE 186

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

INFERNAL/Rfam covariance models

# STOCKHOLM 1.0 #=GC SS_cons <<<<..>>>> seq1 GGAGAUCUCC seq2 GGGGAUCCCC seq3 UGGGAACCCA seq4 GGGGAUCCCU seq5 GGGGAACCCC //

S0 S1 S2 u S3 S4 a S5 S6 a S8 u ... Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-187
SLIDE 187

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Summary

Hard consensus patterns are difgicult to design SCFGs are powerful but slow (thousands of hours for scanning a bacterial genome) Specialised programs have been developed, each recognising a specific structure; these programs are generally sensitive, specific and (relatively) fast:

tRNAscan-SE (by Sean Eddy) detects 99% of the known tRNAs with an error rate of 1 false positive per 15 billion nucleotides

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-188
SLIDE 188

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Summary

Hard consensus patterns are difgicult to design SCFGs are powerful but slow (thousands of hours for scanning a bacterial genome) Specialised programs have been developed, each recognising a specific structure; these programs are generally sensitive, specific and (relatively) fast:

tRNAscan-SE (by Sean Eddy) detects 99% of the known tRNAs with an error rate of 1 false positive per 15 billion nucleotides

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-189
SLIDE 189

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Summary

Hard consensus patterns are difgicult to design SCFGs are powerful but slow (thousands of hours for scanning a bacterial genome) Specialised programs have been developed, each recognising a specific structure; these programs are generally sensitive, specific and (relatively) fast:

tRNAscan-SE (by Sean Eddy) detects 99% of the known tRNAs with an error rate of 1 false positive per 15 billion nucleotides

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-190
SLIDE 190

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Summary

Hard consensus patterns are difgicult to design SCFGs are powerful but slow (thousands of hours for scanning a bacterial genome) Specialised programs have been developed, each recognising a specific structure; these programs are generally sensitive, specific and (relatively) fast:

tRNAscan-SE (by Sean Eddy) detects 99% of the known tRNAs with an error rate of 1 false positive per 15 billion nucleotides

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-191
SLIDE 191

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Summary

Hard consensus patterns are difgicult to design SCFGs are powerful but slow (thousands of hours for scanning a bacterial genome) Specialised programs have been developed, each recognising a specific structure; these programs are generally sensitive, specific and (relatively) fast:

tRNAscan-SE (by Sean Eddy) detects 99% of the known tRNAs with an error rate of 1 false positive per 15 billion nucleotides

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-192
SLIDE 192

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Summary

Hard consensus patterns are difgicult to design SCFGs are powerful but slow (thousands of hours for scanning a bacterial genome) Specialised programs have been developed, each recognising a specific structure; these programs are generally sensitive, specific and (relatively) fast:

tRNAscan-SE (by Sean Eddy) detects 99% of the known tRNAs with an error rate of 1 false positive per 15 billion nucleotides

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-193
SLIDE 193

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Summary

Hard consensus patterns are difgicult to design SCFGs are powerful but slow (thousands of hours for scanning a bacterial genome) Specialised programs have been developed, each recognising a specific structure; these programs are generally sensitive, specific and (relatively) fast:

tRNAscan-SE (by Sean Eddy) detects 99% of the known tRNAs with an error rate of 1 false positive per 15 billion nucleotides

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-194
SLIDE 194

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

References

  • M. Zuker.

On finding all suboptimal foldings of an RNA molecule. 244:48–52, 1989.

  • Y. Ding and C. E. Lawrence.

A bayesian statistical algorithm for rna secondary structure prediction. Computers & Chemistry, pages 387–400, 1999. J S McCaskill. The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers, 29(6-7):1105–19, Jan 1990.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-195
SLIDE 195

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

References (cont.)

Mirela Andronescu, Zhi Chuan Zhang, and Anne Condon. Secondary structure prediction of interacting RNA molecules. J Mol Biol, 345(5):987–1001, Feb 2005. Can Alkan, Emre Karakoc, Joseph H Nadeau, S Cenk Sahinalp, and Kaizhong Zhang. RNA-RNA interaction prediction and antisense RNA target search. J Comput Biol, 13(2):267–82, Mar 2006. Ho-Lin Chen, Anne Condon, and Hosna Jabbari. An O(n(5)) algorithm for MFE prediction of kissing hairpins and 4-chains in nucleic acids. J Comput Biol, 16(6):803–15, Jun 2009.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-196
SLIDE 196

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

References (cont.)

Hamidreza Chitsaz, Raheleh Salari, S Cenk Sahinalp, and Rolf Backofen. A partition function algorithm for interacting nucleic acid strands. Bioinformatics, 25(12):i365–73, Jun 2009. Jakob Skou Pedersen, Gill Bejerano, Adam Siepel, Kate Rosenbloom, Kerstin Lindblad-Toh, Eric S Lander, W James Kent, Webb Miller, and David Haussler. Identification and classification of conserved rna secondary structures in the human genome. PLoS Comput Biol, 2(4):e33, Apr 2006.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-197
SLIDE 197

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Inference problem Search problem Preamble Inference problem Search problem

Pensez-y!

L’impression de ces notes n’est probablement pas nécessaire!

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics