CSI5126 . Algorithms in bioinformatics Suffjx Trees Marcel Turcotte - - PowerPoint PPT Presentation

csi5126 algorithms in bioinformatics
SMART_READER_LITE
LIVE PREVIEW

CSI5126 . Algorithms in bioinformatics Suffjx Trees Marcel Turcotte - - PowerPoint PPT Presentation

. LCE . . . . . . . Preamble Repeats Generalized Suffjx Tree More Repeats LCA Preamble . Repeats Generalized Suffjx Tree More Repeats LCA LCE CSI5126 . Algorithms in bioinformatics Suffjx Trees Marcel Turcotte School of


slide-1
SLIDE 1

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

  • CSI5126. Algorithms in bioinformatics

Suffjx Trees Marcel Turcotte

School of Electrical Engineering and Computer Science (EECS) University of Ottawa

Version September 20, 2018

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-2
SLIDE 2

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Summary

In today’s lecture, we explore the fact that suffjx trees expose all the internal repeats of an input string. We look at the generalisation suffjx trees. Finally, we see how introducing an additional result, the lowest common ancestor, opens door to solving problems such as k-mismatch efgectively. General objective

Creating suffjx tree based algorithms for solving a variety

  • f problems on strings.

Reading

Dan Gusfjeld (1997) Algorithms on strings, trees, and sequences : computer science and computational biology. Cambridge University Press. Chapters 8 (optional), 9. Wing-Kin Sung (2010) Algorithms in Bioinformatics: A Practical Introduction. Chapman & Hall/CRC. QH 324.2 .S86 2010 Pages 61–63. See also: http://suffixtree.org

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-3
SLIDE 3

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Summary

r 1 monotonous$ 2 tonous$ no 4 tonous$

  • 5

tonous$ 6 us$ no 3 tonous$ 7 us$ 8 us$ 9 us$

10

s$

11

$

monotonous$

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-4
SLIDE 4

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Summary

A suffjx tree can be built in linear time and space. Suffjx trees were developed to determine if a string P

  • ccurs in a text T in time proportional to P (after

pre-processing, i.e. building the tree). Indeed, P is a substring of T ifg P is a prefjx of a suffjx of T. To locate P, it suffjce to follow a unique path from the root of the tree up to a node, explicit or implicit, that corresponds to the end of the pattern. This takes time proportional to the length of the pattern. Nowadays, suffjx tree based algorithms have been developed to solve a large array of problems for which no effjcient algorithm was known. This lecture presents some of them.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-5
SLIDE 5

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Summary

A suffjx tree can be built in linear time and space. Suffjx trees were developed to determine if a string P

  • ccurs in a text T in time proportional to |P| (after

pre-processing, i.e. building the tree). Indeed, P is a substring of T ifg P is a prefjx of a suffjx of T. To locate P, it suffjce to follow a unique path from the root of the tree up to a node, explicit or implicit, that corresponds to the end of the pattern. This takes time proportional to the length of the pattern. Nowadays, suffjx tree based algorithms have been developed to solve a large array of problems for which no effjcient algorithm was known. This lecture presents some of them.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-6
SLIDE 6

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Summary

A suffjx tree can be built in linear time and space. Suffjx trees were developed to determine if a string P

  • ccurs in a text T in time proportional to |P| (after

pre-processing, i.e. building the tree). Indeed, P is a substring of T ifg P is a prefjx of a suffjx of T. To locate P, it suffjce to follow a unique path from the root of the tree up to a node, explicit or implicit, that corresponds to the end of the pattern. This takes time proportional to the length of the pattern. Nowadays, suffjx tree based algorithms have been developed to solve a large array of problems for which no effjcient algorithm was known. This lecture presents some of them.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-7
SLIDE 7

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Summary

A suffjx tree can be built in linear time and space. Suffjx trees were developed to determine if a string P

  • ccurs in a text T in time proportional to |P| (after

pre-processing, i.e. building the tree). Indeed, P is a substring of T ifg P is a prefjx of a suffjx of T. To locate P, it suffjce to follow a unique path from the root of the tree up to a node, explicit or implicit, that corresponds to the end of the pattern. This takes time proportional to the length of the pattern. Nowadays, suffjx tree based algorithms have been developed to solve a large array of problems for which no effjcient algorithm was known. This lecture presents some of them.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-8
SLIDE 8

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Summary

A suffjx tree can be built in linear time and space. Suffjx trees were developed to determine if a string P

  • ccurs in a text T in time proportional to |P| (after

pre-processing, i.e. building the tree). Indeed, P is a substring of T ifg P is a prefjx of a suffjx of T. To locate P, it suffjce to follow a unique path from the root of the tree up to a node, explicit or implicit, that corresponds to the end of the pattern. This takes time proportional to the length of the pattern. Nowadays, suffjx tree based algorithms have been developed to solve a large array of problems for which no effjcient algorithm was known. This lecture presents some of them.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-9
SLIDE 9

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Longest repeated substring

Let (i, j) denote the substring of S starting at i and ending at j, i.e. S[i..j]. A repeat is a pair i j i j such that i i and S i j S i j . The longest repeated substring is the pair i j i j such that the length of the substring is maximum. The longest repeated substring of abracadabra is abra.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-10
SLIDE 10

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Longest repeated substring

Let (i, j) denote the substring of S starting at i and ending at j, i.e. S[i..j]. A repeat is a pair ((i, j), (i′, j′)) such that i < i′ and S[i..j] = S[i′..j′]. The longest repeated substring is the pair i j i j such that the length of the substring is maximum. The longest repeated substring of abracadabra is abra.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-11
SLIDE 11

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Longest repeated substring

Let (i, j) denote the substring of S starting at i and ending at j, i.e. S[i..j]. A repeat is a pair ((i, j), (i′, j′)) such that i < i′ and S[i..j] = S[i′..j′]. The longest repeated substring is the pair ((i, j), (i′, j′)) such that the length of the substring is maximum. The longest repeated substring of abracadabra is abra.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-12
SLIDE 12

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Longest repeated substring

Let (i, j) denote the substring of S starting at i and ending at j, i.e. S[i..j]. A repeat is a pair ((i, j), (i′, j′)) such that i < i′ and S[i..j] = S[i′..j′]. The longest repeated substring is the pair ((i, j), (i′, j′)) such that the length of the substring is maximum. The longest repeated substring of abracadabra is abra.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-13
SLIDE 13

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Naïve algorithm

Imagine an algorithm to fjnd the longest repeated substring without using a suffjx tree. What is its time complexity?

n4 , n3 , n2 , n ?

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-14
SLIDE 14

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Naïve algorithm

Imagine an algorithm to fjnd the longest repeated substring without using a suffjx tree. What is its time complexity?

O(n4), O(n3), O(n2), O(n)?

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-15
SLIDE 15

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Longest repeated substring

Let C[i, j] be the length of the longest common extension of the suffjxes i and j of S Clearly, the largest C[i, j] value is the solution to the longest repeat problem

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-16
SLIDE 16

s i i s i s s i p m

1 1 1

m i i i i s s s s p p p

Base conditions.

Let C[i, |S|] = 1 if S(i) = S(|S|), 1 ≤ i < |S| Let C[i, |S|] = 0 if S(i) ̸= S(|S|), 1 ≤ i < |S|

. . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-17
SLIDE 17

s i i s i s s i p m

1 1 1 3 1 1 2 1 1 1 1

m i i i i s s s s p p

1

p

4

General case.

C[i, j] = 0 if S(i) ̸= S(j), 1 ≤ i < j < |S| C[i, j] = 1 + C[i + 1, j + 1] if S(i) = S(j), 1 ≤ i < j < |S|

. . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-18
SLIDE 18

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Exercise (easy)

Solve the longest common substring using dynamic programming. Problem: Given as input two strings, S and T, the longest common substring consists in fjnding the longest substrings that are common to both, S and T.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-19
SLIDE 19

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Suffjx tree-based algorithm

r 1 u x 8 11 12 v 9 10 y 4 7 z 3 6 w 2 5

T G A $ A$ GA$ A TA TTAGGA$ GGA$ TTAGGA$ GGA$ CATTATTAGGA$ TTA TTAGGA$ GGA$ GGA$ $

CATTATTAGGA$

Outline a suffjx tree based algorithm for fjnding repeats? What characterizes a repeat?

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-20
SLIDE 20

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Defjnition

Let’s defjne a branching node, sometimes called fork, as a node having two or more children.

δ γ b a η i j δ δ b a S The path-label of a node is the concatenation of all the edge labels along the path from the root to the node.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-21
SLIDE 21

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Finding the longest repeated substring

Let’s defjne a branching node, sometimes called fork, as a node having two or more children.

δ γ b a η i j δ δ b a S

It suffjce to traverse the tree and fjnd a node 1) which is a fork node and 2) which has the longest path-label. Finding the longest repeated substring takes O(|T|).

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-22
SLIDE 22

x b x a c a 3 2 1 4 5 6 c a b x a c c c b x a c 3 6 5 2 4 1 x a b x a c

. . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-23
SLIDE 23

tam am m $ tam$ $ tam$ $ tam$ $ 3 1 4 2 5 6 0,3 1,2 2,1 6,1 3 ,4 6,1 3 ,4 6,1 3 ,4 6,1

tamtam$

0 1 2 3 4 5 6

. . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-24
SLIDE 24

public class Annotation implements Info { private int pathLength; private Info next; Annotation(int pathLength) { this.pathLength = pathLength; } public Info getNextInfo() { return next; } public void setNextInfo(Info next) { this.next = next; } public int getPathLength() { return pathLength; } public static void addPathLength(SuffixTree tree) { InternalNode root = (InternalNode) tree.getRoot(); if ( root != null ) addPathLength(0, (NodeInterface) root.getFirstChild()); } private static void addPathLength(int prefix, NodeInterface node) { if (node == null) return; int pathLength = prefix + node.getLength(); node.setInfo( new Annotation(pathLength)); if (node instanceof InternalNode) addPathLength( pathLength , (NodeInterface) ((InternalNode) node).getFirstChild()); addPathLength(prefix, (NodeInterface) node.getRightSybling()); } }

. . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-25
SLIDE 25

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Longest repeated substring algorithm

Build a suffjx tree for S, the input string. Top-down traversal of the tree, adding path-label information to each node.

Record the longest path-label so far.

Report the longest path-label recorded.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-26
SLIDE 26

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Generalized suffjx tree

To fjnd the longest common substring of a set of strings, we need to introduce the concept of generalized suffjx tree. A generalized suffjx tree represents all the suffjxes of a set of strings S1 S2 SK . In the suffjx tree for a single sequence, leaves are labeled with the starting position of the suffjx within the string. In a generalized suffjx tree, the leaves are labeled with a tuple, with a fjrst index indicating the string this suffjx belongs to, 1 k, and the second index indicating the starting position. Because some of the k strings might have a common suffjx, some leaves might contain more than one tuple. Alternatively, a unique terminator can be appended to each string so that a leaf designates a unique suffjx.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-27
SLIDE 27

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Generalized suffjx tree

To fjnd the longest common substring of a set of strings, we need to introduce the concept of generalized suffjx tree. A generalized suffjx tree represents all the suffjxes of a set of strings {S1, S2, . . . , SK}. In the suffjx tree for a single sequence, leaves are labeled with the starting position of the suffjx within the string. In a generalized suffjx tree, the leaves are labeled with a tuple, with a fjrst index indicating the string this suffjx belongs to, 1 k, and the second index indicating the starting position. Because some of the k strings might have a common suffjx, some leaves might contain more than one tuple. Alternatively, a unique terminator can be appended to each string so that a leaf designates a unique suffjx.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-28
SLIDE 28

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Generalized suffjx tree

To fjnd the longest common substring of a set of strings, we need to introduce the concept of generalized suffjx tree. A generalized suffjx tree represents all the suffjxes of a set of strings {S1, S2, . . . , SK}. In the suffjx tree for a single sequence, leaves are labeled with the starting position of the suffjx within the string. In a generalized suffjx tree, the leaves are labeled with a tuple, with a fjrst index indicating the string this suffjx belongs to, 1 k, and the second index indicating the starting position. Because some of the k strings might have a common suffjx, some leaves might contain more than one tuple. Alternatively, a unique terminator can be appended to each string so that a leaf designates a unique suffjx.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-29
SLIDE 29

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Generalized suffjx tree

To fjnd the longest common substring of a set of strings, we need to introduce the concept of generalized suffjx tree. A generalized suffjx tree represents all the suffjxes of a set of strings {S1, S2, . . . , SK}. In the suffjx tree for a single sequence, leaves are labeled with the starting position of the suffjx within the string. In a generalized suffjx tree, the leaves are labeled with a tuple, with a fjrst index indicating the string this suffjx belongs to, 1..k, and the second index indicating the starting position. Because some of the k strings might have a common suffjx, some leaves might contain more than one tuple. Alternatively, a unique terminator can be appended to each string so that a leaf designates a unique suffjx.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-30
SLIDE 30

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Generalized suffjx tree

To fjnd the longest common substring of a set of strings, we need to introduce the concept of generalized suffjx tree. A generalized suffjx tree represents all the suffjxes of a set of strings {S1, S2, . . . , SK}. In the suffjx tree for a single sequence, leaves are labeled with the starting position of the suffjx within the string. In a generalized suffjx tree, the leaves are labeled with a tuple, with a fjrst index indicating the string this suffjx belongs to, 1..k, and the second index indicating the starting position. Because some of the k strings might have a common suffjx, some leaves might contain more than one tuple. Alternatively, a unique terminator can be appended to each string so that a leaf designates a unique suffjx.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-31
SLIDE 31

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Generalized suffjx tree

To fjnd the longest common substring of a set of strings, we need to introduce the concept of generalized suffjx tree. A generalized suffjx tree represents all the suffjxes of a set of strings {S1, S2, . . . , SK}. In the suffjx tree for a single sequence, leaves are labeled with the starting position of the suffjx within the string. In a generalized suffjx tree, the leaves are labeled with a tuple, with a fjrst index indicating the string this suffjx belongs to, 1..k, and the second index indicating the starting position. Because some of the k strings might have a common suffjx, some leaves might contain more than one tuple. Alternatively, a unique terminator can be appended to each string so that a leaf designates a unique suffjx.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-32
SLIDE 32

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Generalized suffjx tree: an example

a x b x b $ a x b a x b $ a $ 1,4 $ $ b b a b x $ b $ 1,5 1,2 2,1 1,3 1,1 2,2 $ 1,6 b $ 2,5 x b 2,3 b $ 2,4

S1 = axbaxb and S2 = bxbab

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-33
SLIDE 33

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

(Generalized) Substring Problem

  • Defjnition. A set of strings, or database, is known in advanced

and fjxed. After spending a linear amount of time pre-processing the input database, the algorithm will be presented a collection of strings and for each string the algorithm should be able to tell if the string is present in one or more strings from the input.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-34
SLIDE 34

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Application

DNA identifjcation. The U.S. army sequences a portion of the DNA of each member of its personnel. The sequence is selected so that 1) it is easy to retrieve that exact sequence and 2) it is unique to each individual. In the case of a severe casualty, this particular DNA sequence can be used to identify uniquely a person.

  • Solution. A generalized suffjx tree is built that contains all the

input sequences. This takes time proportional to the sum of the

  • lengths. To identify a person takes time proportional to length of

the sequence identifjer. The solution would also work if the sequence identifjer can only be partially identifjed (in extreme cases).

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-35
SLIDE 35

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Longest Common Substring (LCS)

Finding the longest common substring of a set of strings is a recurring problem, and one which has many applications in bioinformatics. In 1970, Donald Knuth conjectured that it would be impossible to fjnd a linear time algorithm to solve this problem. The longest common substring of S1 = axbaxb and S2 = bxbab, is xba. This problem can be elegantly solve in O(|S1| + |S2|) using generalized suffjx trees. How?

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-36
SLIDE 36

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Longest Common Substring Algorithm

Let’s consider the case of two sequences, the generalization to k strings is trivial,

  • 1. Construct a generalized suffjx tree for S1 and S2;
  • 2. In linear time, traverse the tree and label each node

with (1), (2) or (1,2) if the subtree underneath the node contains only leaves from the fjrst string, only leaves from the second string or a mixture of the two; (hint: use a bottom-up traversal)

  • 3. In linear time, fjnd the node such that 1) it’s labeled

(1,2) and 2) it has the longest path-label.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-37
SLIDE 37

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Longest Common Substring

a x b x b $ a x b a x b $ a $ 1,4 $ $ b b a b x $ b $ 1,5 1,2 2,1 1,3 1,1 2,2 $ 1,6 b $ 2,5 x b 2,3 b $ 2,4

⇒ The node with prefjx xba is the deepest node (longest path label) that has descendants in both strings.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-38
SLIDE 38

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

DNA contamination problem

A host organism can be used to store foreign DNA molecules. Clone library. A foreign DNA segment can be inserted in a host organism in a way that makes it easy to retrieve the segment for later uses. The host will be selected for its ability to rapidly replicate, yeast for example, and therefore to make an endless number of copies of the original information.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-39
SLIDE 39

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

DNA contamination problem

It sometimes occur that the retrieved segments are contaminated with DNA from the host. The DNA contamination problem consists in fjnding all the substrings that are common to the host, S1, and the segment, S2, and are at least l nucleotides long.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-40
SLIDE 40

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

DNA contamination problem

Solution: build a generalized suffjx tree for S1 and S2. Traverse the tree and annotate all the nodes whose subtree contains leaves from both sequences; this takes a linear amount of time. Traverse the tree and for each node annotated with 1 and 2, such that the string length of the path is greater than l, print the string and locations, the traversal of the tree takes a linear amount of time.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-41
SLIDE 41

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

String Repeats

Repetitive sequences (strings) constitute a large fraction of the genomes. Transposable elements represent:

35.0–50% of the Homo sapiens (Human genome) 50.0% Zea mays (maize, corn) 15.0% Drosophila melanogaster (fruit fmy) 2.0% Arabidopsis thaliana (a fmowering plant) 1.8% Caenorhabditis elegans (a nematode, round worm) 3.1% Saccharomyces cerevisiae (baker’s yeast)

⇒ Certain repeats have been related to diseases, regulation and molecular evolution.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-42
SLIDE 42

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Human Genome Organization

Human genome (3 Gb) Genes (900 Mb) Coding DNA (90 Mb) Noncoding DNA (810 Mb) Pseudogenes Genes fragments Introns, leaders and trailers Extragenic DNA (2.1 Gb) Repetitive DNA (420 Mb) Unique and low copy number (1.6 Gb) Tandem repeats Interspersed Satellite Minisatellite Microsatellite LTRs LINEs SINEs Transposons

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-43
SLIDE 43

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Sequence repeats — classifjcation

Satellites: located near the centromeres or telomeres, up to one million bp long. Microsatellite: 2 to 5 bp, 100 copies, found at the end of the eukaryotic chromosomes (telomeres), in humans hundreds of copies of TTAGGG. Minisatellite: up to 25 bp, 30 to 2,000 copies

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-44
SLIDE 44

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Sequence repeats — classifjcation

transposable elements: sequences that have the ability to move from one location of the genome to another, play an important role in evolution, they are classifjed according to their mechanism of transposition.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-45
SLIDE 45

class I: RNA mediated. long terminal repeat (LTR): retrotransposons (related to retroviruses), SINES: short interspersed nuclear elements, 80-300 bp, one particular family is called Alu, in the human genome, there are 1.2 million copies (10%), (other sources say 300,000 copies, i.e ca. 5%

  • f the genome),

LINES: long interspersed nuclear elements, 6-800 Kbp, one particular family is called LINE1, the human genome genome contains 593,000 copies (14.6%). class II: DNA mediated. Human genome has ca. 200,000 copies of elements of this type. . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-46
SLIDE 46

class III: has features of class I and class II, MITES miniature inverted repeat transposable elements, 400 bp, discovered in fmowering plants, frequently associated with regulatory regions of genes. . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-47
SLIDE 47

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Sequence repeats

>ALU Human ALU interspersed repetitive sequence - a consensus. ggccgggcgcggtggctcacgcctgtaatcccagcactttgggaggccgaggcgggaggatcacttgagc ccaggagttcgagaccagcctgggcaacatagtgaaaccccgtctctacaaaaaatacaaaaattagccg ggcgtggtggcgcgcgcctgtagtcccagctactcgggaggctgaggcaggaggatcgcttgagcccggg aggtcgaggctgcagtgagccgtgatcgcgccactgcactccagcctgggcgacagagcgagaccctgtc tcaaaaaaaa

The Alu itself is constituted of repeats of length aprox. 40. Often fmanked by a tandem repeat, length 7-10, such that the left and right sequence are complementary palindromes. 300, 000+ nearly, but not identical, copies dispersed throughout the genome.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-48
SLIDE 48

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Finding all repetitive structures

For a given string of length n, there are Θ(n2) substrings (one of length n, two of length n 1, three of length n 2 … n substrings of length 1). There are therefore n4 possible pairs — 8 1 1037 possible pairs in the case of the human genome!

We must carefully defjne what pairs are interesting otherwise too many results will be returned to the user!

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-49
SLIDE 49

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Finding all repetitive structures

For a given string of length n, there are Θ(n2) substrings (one of length n, two of length n − 1, three of length n − 2 … n substrings of length 1). There are therefore n4 possible pairs — 8 1 1037 possible pairs in the case of the human genome!

We must carefully defjne what pairs are interesting otherwise too many results will be returned to the user!

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-50
SLIDE 50

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Finding all repetitive structures

For a given string of length n, there are Θ(n2) substrings (one of length n, two of length n − 1, three of length n − 2 … n substrings of length 1). There are therefore Θ(n4) possible pairs — 8 1 1037 possible pairs in the case of the human genome!

We must carefully defjne what pairs are interesting otherwise too many results will be returned to the user!

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-51
SLIDE 51

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Finding all repetitive structures

For a given string of length n, there are Θ(n2) substrings (one of length n, two of length n − 1, three of length n − 2 … n substrings of length 1). There are therefore Θ(n4) possible pairs — 8.1 × 1037 possible pairs in the case of the human genome!

We must carefully defjne what pairs are interesting otherwise too many results will be returned to the user!

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-52
SLIDE 52

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Finding all repetitive structures

For a given string of length n, there are Θ(n2) substrings (one of length n, two of length n − 1, three of length n − 2 … n substrings of length 1). There are therefore Θ(n4) possible pairs — 8.1 × 1037 possible pairs in the case of the human genome!

We must carefully defjne what pairs are interesting otherwise too many results will be returned to the user!

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-53
SLIDE 53

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Defjnition

  • Defjnition. A maximal pair (or maximal repeat pair) is a pair
  • f identical substrings α and β that cannot be extended either to

the left or to the right without causing a mismatch, in other words, the character to the immediate left of α is difgerent than the one to the immediate left of β, and similarly to the right, the characters immediately following α and β are difgerent. ⇒ A maximal pair will be denoted (pα, pβ, n′) where pα and pβ are the starting positions and n′ their length. The set of all the maximal pairs of S will be noted R(S).

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-54
SLIDE 54

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Maximal pairs

xyzbcdeeebcdxyzbcd The fjrst and second occurrences of bcd form a maximal pair, (4, 10, 3), the second and third occurrences form a maximal pair, (10, 16, 3), but not occurrences one and three. Are the two occurrences of xyzbcd forming a maximal pair? To ensure that suffjxes and prefjxes can participate to maximal pairs a terminator is added at both ends. $xyzbcdeeebcdxyzbcd$ Our defjnition does not prevent overlapping substrings, and this is fjne.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-55
SLIDE 55

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Where to fjnd maximal pairs?

Construct a suffjx tree for S. Repeats are found at internal nodes, so let be the current internal node under consideration. Let denote its path-label. What next? Let’s take care of the right hand side. How can you make sure that cannot be extended on the right? Select pairs of suffjxes such that each of the two elements of the pair is from a distinct child of . How would you take care of the left hand side? For every pair of suffjxes i j, S i 1 S j 1. Still many possible pairs of suffjxes! Let’s consider a more constrained problem.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-56
SLIDE 56

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Where to fjnd maximal pairs?

Construct a suffjx tree for S. Repeats are found at internal nodes, so let be the current internal node under consideration. Let denote its path-label. What next? Let’s take care of the right hand side. How can you make sure that cannot be extended on the right? Select pairs of suffjxes such that each of the two elements of the pair is from a distinct child of . How would you take care of the left hand side? For every pair of suffjxes i j, S i 1 S j 1. Still many possible pairs of suffjxes! Let’s consider a more constrained problem.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-57
SLIDE 57

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Where to fjnd maximal pairs?

Construct a suffjx tree for S. Repeats are found at internal nodes, so let V be the current internal node under consideration. Let α denote its path-label. What next? Let’s take care of the right hand side. How can you make sure that cannot be extended on the right? Select pairs of suffjxes such that each of the two elements of the pair is from a distinct child of . How would you take care of the left hand side? For every pair of suffjxes i j, S i 1 S j 1. Still many possible pairs of suffjxes! Let’s consider a more constrained problem.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-58
SLIDE 58

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Where to fjnd maximal pairs?

Construct a suffjx tree for S. Repeats are found at internal nodes, so let V be the current internal node under consideration. Let α denote its path-label. What next? Let’s take care of the right hand side. How can you make sure that cannot be extended on the right? Select pairs of suffjxes such that each of the two elements of the pair is from a distinct child of . How would you take care of the left hand side? For every pair of suffjxes i j, S i 1 S j 1. Still many possible pairs of suffjxes! Let’s consider a more constrained problem.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-59
SLIDE 59

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Where to fjnd maximal pairs?

Construct a suffjx tree for S. Repeats are found at internal nodes, so let V be the current internal node under consideration. Let α denote its path-label. What next? Let’s take care of the right hand side. How can you make sure that α cannot be extended on the right? Select pairs of suffjxes such that each of the two elements of the pair is from a distinct child of . How would you take care of the left hand side? For every pair of suffjxes i j, S i 1 S j 1. Still many possible pairs of suffjxes! Let’s consider a more constrained problem.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-60
SLIDE 60

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Where to fjnd maximal pairs?

Construct a suffjx tree for S. Repeats are found at internal nodes, so let V be the current internal node under consideration. Let α denote its path-label. What next? Let’s take care of the right hand side. How can you make sure that α cannot be extended on the right? Select pairs of suffjxes such that each of the two elements of the pair is from a distinct child of V. How would you take care of the left hand side? For every pair of suffjxes i j, S i 1 S j 1. Still many possible pairs of suffjxes! Let’s consider a more constrained problem.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-61
SLIDE 61

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Where to fjnd maximal pairs?

Construct a suffjx tree for S. Repeats are found at internal nodes, so let V be the current internal node under consideration. Let α denote its path-label. What next? Let’s take care of the right hand side. How can you make sure that α cannot be extended on the right? Select pairs of suffjxes such that each of the two elements of the pair is from a distinct child of V. How would you take care of the left hand side? For every pair of suffjxes i j, S i 1 S j 1. Still many possible pairs of suffjxes! Let’s consider a more constrained problem.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-62
SLIDE 62

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Where to fjnd maximal pairs?

Construct a suffjx tree for S. Repeats are found at internal nodes, so let V be the current internal node under consideration. Let α denote its path-label. What next? Let’s take care of the right hand side. How can you make sure that α cannot be extended on the right? Select pairs of suffjxes such that each of the two elements of the pair is from a distinct child of V. How would you take care of the left hand side? For every pair of suffjxes i, j, S[i] − 1 ̸= S[j] − 1. Still many possible pairs of suffjxes! Let’s consider a more constrained problem.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-63
SLIDE 63

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Where to fjnd maximal pairs?

Construct a suffjx tree for S. Repeats are found at internal nodes, so let V be the current internal node under consideration. Let α denote its path-label. What next? Let’s take care of the right hand side. How can you make sure that α cannot be extended on the right? Select pairs of suffjxes such that each of the two elements of the pair is from a distinct child of V. How would you take care of the left hand side? For every pair of suffjxes i, j, S[i] − 1 ̸= S[j] − 1. Still many possible pairs of suffjxes! Let’s consider a more constrained problem.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-64
SLIDE 64

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Maximum Unique Pairs (MUM)

Algorithms to compare biological sequences (to be presented later) run in quadratic time and space. In the case of complete genomic sequences this is not feasible. To circumvent this limitation, algorithms have been developed that fjrst fjnd a set of mums that are used as a starting point, anchors, for further processing by conventional sequence alignment techniques.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-65
SLIDE 65

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Maximum Unique Pairs (MUM)

Given 2 sequences S1 and S2 ∈ A∗ and l > 0, a maximal unique match is a string u such that:

|u| ≥ l u occurs exactly once in S1 and exactly once in S2 ∀ a ∈ A, nor au or ua occurs simultaneously in S1 and S2.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-66
SLIDE 66

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Maximum Unique Pairs (MUM)

ACAAGTCTTCTATCAGACTCCAGAAAAGTATCAGAGAGCAATGAA CCACACTGCCTACCAGGTGTATCAGACCCACAAGTCCTTCTTAGA

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-67
SLIDE 67

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Maximum Unique Pairs (MUM)

ACAAGTCTTCTATCAGACTCCAGAAAAGTATCAGAGAGCAATGAA CCACACTGCCTACCAGGTGTATCAGACCCACAAGTCCTTCTTAGA

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-68
SLIDE 68

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Maximum Unique Pairs (MUM)

ACAAGTCTTCTATCAGACTCCAGAAAAGTATCAGAGAGCAATGAA CCACACTGCCTACCAGGTGTATCAGACCCACAAGTCCTTCTTAGA

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-69
SLIDE 69

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Maximum Unique Pairs (MUM)

ACAAGTCTTCTATCAGACTCCAGAAAAGTATCAGAGAGCAATGAA CCACACTGCCTACCAGGTGTATCAGACCCACAAGTCCTTCTTAGA

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-70
SLIDE 70

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Where to look for MUMs?

Construct a generalized suffjx tree for S1 and S2 Repeats and common substrings are found at internal nodes, look for an internal node that has children in S1 and S2, let’s call it Can have more than 2 children? No, this would mean that u occurs more than once in one or both input strings Is it possible that there are internal nodes along one of the paths from to a leaf? No, again it would mean that u occurs more than once in one or both input strings So, we have that has to be an internal node that has exactly 2 children that are leaves

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-71
SLIDE 71

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Where to look for MUMs?

Construct a generalized suffjx tree for S1 and S2 Repeats and common substrings are found at internal nodes, look for an internal node that has children in S1 and S2, let’s call it Can have more than 2 children? No, this would mean that u occurs more than once in one or both input strings Is it possible that there are internal nodes along one of the paths from to a leaf? No, again it would mean that u occurs more than once in one or both input strings So, we have that has to be an internal node that has exactly 2 children that are leaves

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-72
SLIDE 72

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Where to look for MUMs?

Construct a generalized suffjx tree for S1 and S2 Repeats and common substrings are found at internal nodes, look for an internal node that has children in S1 and S2, let’s call it V Can have more than 2 children? No, this would mean that u occurs more than once in one or both input strings Is it possible that there are internal nodes along one of the paths from to a leaf? No, again it would mean that u occurs more than once in one or both input strings So, we have that has to be an internal node that has exactly 2 children that are leaves

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-73
SLIDE 73

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Where to look for MUMs?

Construct a generalized suffjx tree for S1 and S2 Repeats and common substrings are found at internal nodes, look for an internal node that has children in S1 and S2, let’s call it V Can V have more than 2 children? No, this would mean that u occurs more than once in one or both input strings Is it possible that there are internal nodes along one of the paths from to a leaf? No, again it would mean that u occurs more than once in one or both input strings So, we have that has to be an internal node that has exactly 2 children that are leaves

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-74
SLIDE 74

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Where to look for MUMs?

Construct a generalized suffjx tree for S1 and S2 Repeats and common substrings are found at internal nodes, look for an internal node that has children in S1 and S2, let’s call it V Can V have more than 2 children? No, this would mean that u occurs more than once in one or both input strings Is it possible that there are internal nodes along one of the paths from to a leaf? No, again it would mean that u occurs more than once in one or both input strings So, we have that has to be an internal node that has exactly 2 children that are leaves

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-75
SLIDE 75

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Where to look for MUMs?

Construct a generalized suffjx tree for S1 and S2 Repeats and common substrings are found at internal nodes, look for an internal node that has children in S1 and S2, let’s call it V Can V have more than 2 children? No, this would mean that u occurs more than once in one or both input strings Is it possible that there are internal nodes along one of the paths from V to a leaf? No, again it would mean that u occurs more than once in one or both input strings So, we have that has to be an internal node that has exactly 2 children that are leaves

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-76
SLIDE 76

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Where to look for MUMs?

Construct a generalized suffjx tree for S1 and S2 Repeats and common substrings are found at internal nodes, look for an internal node that has children in S1 and S2, let’s call it V Can V have more than 2 children? No, this would mean that u occurs more than once in one or both input strings Is it possible that there are internal nodes along one of the paths from V to a leaf? No, again it would mean that u occurs more than once in one or both input strings So, we have that has to be an internal node that has exactly 2 children that are leaves

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-77
SLIDE 77

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Where to look for MUMs?

Construct a generalized suffjx tree for S1 and S2 Repeats and common substrings are found at internal nodes, look for an internal node that has children in S1 and S2, let’s call it V Can V have more than 2 children? No, this would mean that u occurs more than once in one or both input strings Is it possible that there are internal nodes along one of the paths from V to a leaf? No, again it would mean that u occurs more than once in one or both input strings So, we have that V has to be an internal node that has exactly 2 children that are leaves

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-78
SLIDE 78

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Where to look for MUMs?

Is it enough?

  • No. Is it possible that u is embedded in a

longer motif? In other words, that u is not maximal. u can certainly not be extended to the right. But how about the left? Yes, it is quite possible that u is in fact part of a larger motif, say au How, to check for that? The leaves beneath contains the starting positions of the string u in S1 and S2, therefore it suffjce to compare S1 i 1 and S2 j 1 Time and space complexity?

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-79
SLIDE 79

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Where to look for MUMs?

Is it enough? No. Is it possible that u is embedded in a longer motif? In other words, that u is not maximal. u can certainly not be extended to the right. But how about the left? Yes, it is quite possible that u is in fact part of a larger motif, say au How, to check for that? The leaves beneath contains the starting positions of the string u in S1 and S2, therefore it suffjce to compare S1 i 1 and S2 j 1 Time and space complexity?

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-80
SLIDE 80

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Where to look for MUMs?

Is it enough? No. Is it possible that u is embedded in a longer motif? In other words, that u is not maximal. u can certainly not be extended to the right. But how about the left? Yes, it is quite possible that u is in fact part of a larger motif, say au How, to check for that? The leaves beneath contains the starting positions of the string u in S1 and S2, therefore it suffjce to compare S1 i 1 and S2 j 1 Time and space complexity?

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-81
SLIDE 81

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Where to look for MUMs?

Is it enough? No. Is it possible that u is embedded in a longer motif? In other words, that u is not maximal. u can certainly not be extended to the right. But how about the left? Yes, it is quite possible that u is in fact part of a larger motif, say au How, to check for that? The leaves beneath contains the starting positions of the string u in S1 and S2, therefore it suffjce to compare S1 i 1 and S2 j 1 Time and space complexity?

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-82
SLIDE 82

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Where to look for MUMs?

Is it enough? No. Is it possible that u is embedded in a longer motif? In other words, that u is not maximal. u can certainly not be extended to the right. But how about the left? Yes, it is quite possible that u is in fact part of a larger motif, say au How, to check for that? The leaves beneath contains the starting positions of the string u in S1 and S2, therefore it suffjce to compare S1 i 1 and S2 j 1 Time and space complexity?

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-83
SLIDE 83

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Where to look for MUMs?

Is it enough? No. Is it possible that u is embedded in a longer motif? In other words, that u is not maximal. u can certainly not be extended to the right. But how about the left? Yes, it is quite possible that u is in fact part of a larger motif, say au How, to check for that? The leaves beneath contains the starting positions of the string u in S1 and S2, therefore it suffjce to compare S1 i 1 and S2 j 1 Time and space complexity?

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-84
SLIDE 84

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Where to look for MUMs?

Is it enough? No. Is it possible that u is embedded in a longer motif? In other words, that u is not maximal. u can certainly not be extended to the right. But how about the left? Yes, it is quite possible that u is in fact part of a larger motif, say au How, to check for that? The leaves beneath V contains the starting positions of the string u in S1 and S2, therefore it suffjce to compare S1[i − 1] and S2[j − 1] Time and space complexity?

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-85
SLIDE 85

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Where to look for MUMs?

Is it enough? No. Is it possible that u is embedded in a longer motif? In other words, that u is not maximal. u can certainly not be extended to the right. But how about the left? Yes, it is quite possible that u is in fact part of a larger motif, say au How, to check for that? The leaves beneath V contains the starting positions of the string u in S1 and S2, therefore it suffjce to compare S1[i − 1] and S2[j − 1] Time and space complexity?

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-86
SLIDE 86

r C

GATCG$ CTCGT&

S1,2

ATCG$

S2,1

TCGT& G

S1,5

$

S2,5

&

S1,1

ATCG$ T

S1,4

$ CG

S2,2

T&

S1,3

$ G

S2,3

T&

S2,4

T&

S2,5

&

S2,5

&

S1,4

$

. . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-87
SLIDE 87

all_mums ( node v ) i f v i s a l e a f return i f # c h i l d r e n i s 2 i f l e f t c h i l d i s a l e a f and r i g h t c h i l d i s a l e a f set u to the path label

  • f

the path i f char to the l e f t

  • f u in S1

d i f f e r s from the char to the l e f t

  • f u in S2 and the

path i s long enough then d i s p l a y mum information else all_mums ( l e f t c h i l d ) all_mums ( r i g h t c h i l d ) else for each c h i l d

  • f v

all_mums ( c h i l d )

. . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-88
SLIDE 88

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Genome Alignment

The subsequent steps of a complete algorithm for the alignment of two genomic sequences involve:

Finding the longest sequence of MUMs occurring in the same order the two sequences. Apply an alignment algorithm (to be presented later) on the pairs of regions in between two MUMs.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-89
SLIDE 89

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Lowest Common Ancestor

  • Defjnition. The lowest common ancestor (lca) of any two nodes
  • f a rooted tree is the deepest node which is an ancestor* of both

nodes.

1 2 3 4 5 6 7

The lca of 5 and 7 is 6, the lca of 1 and 3 is 2, and so on.

*A node u is an ancestor of a node v if u is a node that occurs on the

unique path from the root to v.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-90
SLIDE 90

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Lowest Common Ancestor

1 2 3 4 5 6 7

How would you fjnd the lowest common ancestor? What is the time complexity?

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-91
SLIDE 91

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Lowest Common Ancestor in O(3n) time

Using two stacks Si and Sj.

Starting at node i, visit all the parents nodes until reaching the root of the tree, each visited node is pushed onto Si Repeat the same operations starting at node j, this time, each visited node is pushed onto Sj Whilst the top nodes are identical, pop(Si) and pop(Sj) The last identical node is the lowest common ancestor

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-92
SLIDE 92

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Lowest Common Ancestor Problem (Overview)

Given an input tree with n nodes. Let’s assume that n < 4, 294, 967, 296 nodes. In the unit-cost RAM model, O(log n) bits can be read, written or used as an address in constant time. Words

  • f 32 bits.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-93
SLIDE 93

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Lowest Common Ancestor Problem (Overview)

(Although not necessary) Let’s also make the following assumptions.

  • 1. O(log n) bits can be compared, added, subtracted,

multiplied, or divided in constant time.

  • 2. bit-level operations on O(log n) bits numbers can be

performed in constant time, including AND, OR, XOR, left or right shift by up to O(log n) bits, creating masks of 1s, and fjnding the position of the left-most or right-most 1.

It can be shown, but we will not, that after a linear amount of time pre-processing the input tree, linear w.r.t. the number of nodes, the lca of any two nodes can be found in constant time! See (Gusfjeld 1997) §8.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-94
SLIDE 94

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

LCA Algorithm: Overview

8 1 2 3 5 6 7 4 9 10 11 13 14 15 12

0001 0011 0101 0111 1001 1011 1101 1111 0010 0110 0100 1000 1100 1010 1110

For this overview of the lca algorithm, let’s consider the case of a complete rooted binary tree. This tree has p leaves and n nodes, where n = 2p − 1.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-95
SLIDE 95

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

LCA Algorithm: Overview

8 1 2 3 5 6 7 4 9 10 11 13 14 15 12

0001 0011 0101 0111 1001 1011 1101 1111 0010 0110 0100 1000 1100 1010 1110

Furthermore, consider the in order (Left-Root-Right) labelling of the tree and its interpretation as binary numbers. How much does it cost to label this tree? n time. This is the pre-processing step/time.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-96
SLIDE 96

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

LCA Algorithm: Overview

8 1 2 3 5 6 7 4 9 10 11 13 14 15 12

0001 0011 0101 0111 1001 1011 1101 1111 0010 0110 0100 1000 1100 1010 1110

Furthermore, consider the in order (Left-Root-Right) labelling of the tree and its interpretation as binary numbers. How much does it cost to label this tree? n time. This is the pre-processing step/time.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-97
SLIDE 97

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

LCA Algorithm: Overview

8 1 2 3 5 6 7 4 9 10 11 13 14 15 12

0001 0011 0101 0111 1001 1011 1101 1111 0010 0110 0100 1000 1100 1010 1110

Furthermore, consider the in order (Left-Root-Right) labelling of the tree and its interpretation as binary numbers. How much does it cost to label this tree? O(n) time. This is the pre-processing step/time.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-98
SLIDE 98

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

LCA Algorithm: Overview

8 1 2 3 5 6 7 4 9 10 11 13 14 15 12

0001 0011 0101 0111 1001 1011 1101 1111 0010 0110 0100 1000 1100 1010 1110

Furthermore, consider the in order (Left-Root-Right) labelling of the tree and its interpretation as binary numbers. How much does it cost to label this tree? O(n) time. This is the pre-processing step/time.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-99
SLIDE 99

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

LCA Algorithm: Overview

8 1 2 3 5 6 7 4 9 10 11 13 14 15 12

0001 0011 0101 0111 1001 1011 1101 1111 0010 0110 0100 1000 1100 1010 1110

The number of edges on any path from the root to any leaf is d = log2 p. Let’s now interpret the numbers (labels) as d 1 bit path numbers, i.e. starting from the left hand side of the number, each bit represents a direction, 0 = left, 1 = right.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-100
SLIDE 100

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

LCA Algorithm: Overview

8 1 2 3 5 6 7 4 9 10 11 13 14 15 12

0001 0011 0101 0111 1001 1011 1101 1111 0010 0110 0100 1000 1100 1010 1110

The number of edges on any path from the root to any leaf is d = log2 p. Let’s now interpret the numbers (labels) as d + 1 bit path numbers, i.e. starting from the left hand side of the number, each bit represents a direction, 0 = left, 1 = right.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-101
SLIDE 101

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

LCA Algorithm: Overview

8 1 2 3 5 6 7 4 9 10 11 13 14 15 12

0001 0011 0101 0111 1001 1011 1101 1111 0010 0110 0100 1000 1100 1010 1110

The number of edges on any path from the root to any leaf is d = log2 p. Let’s now interpret the numbers (labels) as d + 1 bit path numbers, i.e. starting from the left hand side of the number, each bit represents a direction, 0 = left, 1 = right.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-102
SLIDE 102

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

LCA Algorithm: Overview

8 1 2 3 5 6 7 4 9 10 11 13 14 15 12

0001 0011 0101 0111 1001 1011 1101 1111 0010 0110 0100 1000 1100 1010 1110

The structure of a path number is as follows, for a node v at level i, the left-most i bits are the path bits, followed by 1, which is a separator, and the remaining bits are 0s, i.e. (path bits, 1, 0s).

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-103
SLIDE 103

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

LCA Algorithm: Overview

8 1 2 3 5 6 7 4 9 10 11 13 14 15 12

0001 0011 0101 0111 1001 1011 1101 1111 0010 0110 0100 1000 1100 1010 1110

Any two nodes that have a common ancestor at level k are labeled with path numbers such that the fjrst k bits are identical. Consider the nodes 5 and 7, 3 and 6, or 7 and 9.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-104
SLIDE 104

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

LCA Algorithm: Overview

8 1 2 3 5 6 7 4 9 10 11 13 14 15 12

0001 0011 0101 0111 1001 1011 1101 1111 0010 0110 0100 1000 1100 1010 1110

Any two nodes that have a common ancestor at level k are labeled with path numbers such that the fjrst k bits are identical. Consider the nodes 5 and 7, 3 and 6, or 7 and 9.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-105
SLIDE 105

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

LCA Algorithm: Overview

8 1 2 3 5 6 7 4 9 10 11 13 14 15 12

0001 0011 0101 0111 1001 1011 1101 1111 0010 0110 0100 1000 1100 1010 1110

Given two nodes u and v, what property of XORu,v would be particularly interesting here? lca(9 11) = XOR1001 1011 = 0010.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-106
SLIDE 106

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

LCA Algorithm: Overview

8 1 2 3 5 6 7 4 9 10 11 13 14 15 12

0001 0011 0101 0111 1001 1011 1101 1111 0010 0110 0100 1000 1100 1010 1110

Given two nodes u and v, what property of XORu,v would be particularly interesting here? lca(9, 11) = XOR1001,1011 = 0010.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-107
SLIDE 107

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

LCA Algorithm: Overview

8 1 2 3 5 6 7 4 9 10 11 13 14 15 12

0001 0011 0101 0111 1001 1011 1101 1111 0010 0110 0100 1000 1100 1010 1110

Given two nodes u and v, since the left-most k bits are identical, the left-most k bits of the XOR of the two path numbers, denoted XORu,v, will all be 0s. The left-most 1 of XORu,v, occurs a position k + 1 (from the left).

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-108
SLIDE 108

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

LCA Algorithm: Overview

8 1 2 3 5 6 7 4 9 10 11 13 14 15 12

0001 0011 0101 0111 1001 1011 1101 1111 0010 0110 0100 1000 1100 1010 1110

Given two nodes u and v, the path number of lca(u, v) is obtained by calculating XORu,v, fjnding the left-most 1, let k be the position

  • f the left-most 1, shift u right d + 1 − k positions, set the

right-most bit to 1, shift u left d+1−k positions, thus inserting 0s.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-109
SLIDE 109

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

LCA Algorithm: Overview

8 1 2 3 5 6 7 4 9 10 11 13 14 15 12

0001 0011 0101 0111 1001 1011 1101 1111 0010 0110 0100 1000 1100 1010 1110

lca(9, 14) = XOR1001,1110 = 0111, k = 2, shift 1001 right by d + 1 − k = 3 + 1 − 2 = 2 positions, result is 10, set right most bit to 1, result is 11, shift 11 d + 1 − k = 3 + 1 − 2 = 2 to the left, padding with 0s, result is 1100.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-110
SLIDE 110

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

LCA Algorithm: Overview

Pre-processing (labelling) requires O(n) time. lca(u, v) requires a fjxed number of bit-level operations, each of which can be performed in constant time. The idea behind the general lca algorithm is to conceptually map the nodes of a complete binary tree, labeled with path numbers, onto the nodes of the input tree, in such a way that the result of an lca query on the complete binary can be used to answer a query on the input tree.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-111
SLIDE 111

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

LCA and Suffjx Srees

What does lca mean in the context of suffjx trees?

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-112
SLIDE 112

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

LCA and Suffjx Srees

What does lca mean in the context of suffjx trees?

α γ j δ α S α j γ δ i i v

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-113
SLIDE 113

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

LCA and Suffjx Trees

lca(i,j) returns the deepest node which is a common ancestor of both i and j. The path from the root to that node spells the longest common prefjx of the suffjxes i and j Therefore, the longest common prefjx of any two suffjxes can be found in constant time! Once the tree has been pre-processed, which takes a linear amount of time.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-114
SLIDE 114

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

LCA and Suffjx Trees

lca(i,j) returns the deepest node which is a common ancestor of both i and j. The path from the root to that node spells the longest common prefjx of the suffjxes i and j Therefore, the longest common prefjx of any two suffjxes can be found in constant time! Once the tree has been pre-processed, which takes a linear amount of time.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-115
SLIDE 115

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

LCA and Suffjx Trees

lca(i,j) returns the deepest node which is a common ancestor of both i and j. The path from the root to that node spells the longest common prefjx of the suffjxes i and j Therefore, the longest common prefjx of any two suffjxes can be found in constant time! Once the tree has been pre-processed, which takes a linear amount of time.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-116
SLIDE 116

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

LCA and Suffjx Trees

lca(i,j) returns the deepest node which is a common ancestor of both i and j. The path from the root to that node spells the longest common prefjx of the suffjxes i and j Therefore, the longest common prefjx of any two suffjxes can be found in constant time! Once the tree has been pre-processed, which takes a linear amount of time.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-117
SLIDE 117

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Defjnition.

  • Defjnition. The longest common extension problem is as follows.

One is given two strings, S1 and S2, after a preprocessing phase the user should be able fjnd the longest common substring starting at position i in sequence 1 and j in sequence 2.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-118
SLIDE 118

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Longest Common Extension Algorithm

To solve this problem, fjrst build a generalized suffjx tree for S1 and S2. Then process the tree so that lca queries can be answered for that tree, this will take a linear amount of time, and label the tree to record the string-depth of every node, this also will take linear time.

The length of the longest common extension starting at positions i and j is the string-depth recorded at the node designated by lca(i, j).

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-119
SLIDE 119

Is it useful?

. . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-120
SLIDE 120

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

k-mismatch problem

  • Defjnition. Given a pattern P, a text T and a number of

mismatches, k, fjxed in advance and independent of |P| and |T|. A k-mismatch of P against T is a substring of T that matches P with at most k mismatches (errors), in other words, there are at least |P| − k matches. The k-mismatch problem consists in fjnding all k-mismatches.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-121
SLIDE 121

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

k-mismatch problem

Given T = Taumatawhakatangihangakoauauotamateapokaiwh- enuakitanatahu is a hill south of Waipukurau, New Zealand and P = auauatamateapakaiwhemua, fjnd all 3-mismatches.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-122
SLIDE 122

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

k-mismatch check

Is there a 3-mismatch occurrence of P at position 25 of T?

T = taumatawhakatangihangakoauauotamateapokaiwhenuakit... P = auauatamateapakaiwhemua

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-123
SLIDE 123

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

k-mismatch check

Checking for k-mismatch starting from position i in T.

  • 1. Set j to 1, i to i and count to 0;
  • 2. Compute l

lce P j T i ;

  • 3. If j

l n 1 then a k-mismatch of P in T occurs at position i, stop.

  • 4. If count

k then increment count by one, set j to j l 1 and i to i l 1, go to 2.

  • 5. a k-mismatch of P does not occur in T at position i.

A similar algorithm exists for matching with wild cards.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-124
SLIDE 124

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

k-mismatch check

Checking for k-mismatch starting from position i in T.

  • 1. Set j to 1, i to i and count to 0;
  • 2. Compute l

lce P j T i ;

  • 3. If j

l n 1 then a k-mismatch of P in T occurs at position i, stop.

  • 4. If count

k then increment count by one, set j to j l 1 and i to i l 1, go to 2.

  • 5. a k-mismatch of P does not occur in T at position i.

A similar algorithm exists for matching with wild cards.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-125
SLIDE 125

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

k-mismatch check

Checking for k-mismatch starting from position i in T.

  • 1. Set j to 1, i′ to i and count to 0;
  • 2. Compute l = lce((P, j), (T, i′));
  • 3. If j + l = n + 1 then a k-mismatch of P in T occurs at

position i, stop.

  • 4. If count ≤ k then increment count by one, set j to

j + l + 1 and i′ to i′ + l + 1, go to 2.

  • 5. a k-mismatch of P does not occur in T at position i.

⇒ A similar algorithm exists for matching with wild cards.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-126
SLIDE 126

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Exact matching with wild cards

A protein motif called the Zinc fjnger†: c..c.............h..h the “.” symbol matches any character. The following (regular) expression matches 45 words in /usr/share/dict/words: i...ement including: imbuement, implement, inclement, increment, induement and inurement.

†PROSITE is a collection of known protein motifs

(www.expasy.ch/prosite)

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-127
SLIDE 127

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Exact String Matching with Wild Cards

Finding a match starting from position i in T.

  • 1. Set j to 1 and i′ to i.
  • 2. Compute l = lce(j, i′), where j is a starting position in P

and i′ a position in T.

  • 3. If j + l = n + 1 then P occurs in T at position i, stop.
  • 4. If P(j + l) or T(i′ + l) is a wild card, set j to j + l + 1 and i′

to i′ + l + 1, go to 2.

  • 5. P does not occur in T at position i.

⇒ How much space is needed? The algorithm takes O(k) for a fjx

  • k. How much time to fjnd all occurrences?

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-128
SLIDE 128

T i j i’ P

⇒ At the start of the algorithm. . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-129
SLIDE 129

. T α i j α α γ j δ v i’ i’ P

⇒ Since the longest common extension is immediately followed by a wild card, the algorithm is allowed to continue. . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . .. . . . . .

slide-130
SLIDE 130

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

. $ T i j v i’ P i’ β β j γ β

⇒ The end of P has been reached.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-131
SLIDE 131

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Defjnition

  • Defjnition. A maximal palindrome of radius d is a substring S′ of

S such that S′ = ααr, |α| = d, and for any d′ > d S′ is not a palindrome. According to the above defjnition the length of S′ is even.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-132
SLIDE 132

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Maximum Palindromes

k k+1

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-133
SLIDE 133

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Maximum Palindromes

k k+1

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-134
SLIDE 134

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Maximum Palindromes

k k+1

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-135
SLIDE 135

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Maximum Palindromes

c b a a b c k k+1

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-136
SLIDE 136

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Maximum Palindromes

c b a a b c k k+1 d d d d d e e

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-137
SLIDE 137

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Maximum Palindromes

c b a a b c k k+1 d d d d d e e a b c d d d e e c b a d d n-k+1

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-138
SLIDE 138

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Finding All Maximal Palindromes

  • 1. Create a generalized suffjx tree for S and Sr, process the

tree so that lce queries can be answered in constant time. Creating Sr, the generalized suffjx tree and process necessary for lce takes O(|S|).

  • 2. For q from 1 to |S| − 1; k = lce(q + 1, |S| − q + 1) is the

radius of the longest palindrome centered at q.

Each iteration takes O(1) time. ⇒ where lce, is the “longest common extension”.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-139
SLIDE 139

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Finding Maximal Palindromes

Hum …is it biologically relevant? Probably not.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-140
SLIDE 140

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Finding Maximal Palindromes

a u c g a u a a a a a g g

The above string is biologically relevant, why?

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-141
SLIDE 141

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Finding Maximal Palindromes

a u c g a u k k+1 a a a a a g g

It contains a biological palindrome, a string that reads the same when reversed and complemented, where the following rules are used to obtain the complement, A is the complement of U (and vice versa), and G is the complement of C (and vice versa).

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-142
SLIDE 142

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Finding Maximal Palindromes

a ac acc A U A C G C G C G U A U where ac denotes the complement

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-143
SLIDE 143

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Finding Maximal Palindromes

a u c g a u k k+1 a a a a a g g

How to fjnd the maximal biological palindromes.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-144
SLIDE 144

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Finding Maximal Palindromes

a u c g a u k k+1 a a a a a g g g a u u u u c c a u c u u n-k+1

Given and input sequence S compute its reverse complement, Src, for every k, compute lce(S, k + 1, Src, n − k + 1).

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-145
SLIDE 145

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Finding Maximal Palindromes

S

r

a b c a c b x y x c b a a b c y n−q+1 S α α α α

r r

q+1 q+1 n−q+1 a b c x y γ δ lca(q+1,n−q+1)

⇒ Similarly for complemented palindromes or for palindromes separated by a bounded distance k.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-146
SLIDE 146

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Seed

Seed: bio.site.uottawa.ca/software/seed Mohammad Anwar, Truong Nguyen and Marcel Turcotte (2006) Identifjcation of consensus RNA secondary structures using suffjx arrays. BMC Bioinformatics, 7:244.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-147
SLIDE 147

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

The Nobel Prize in Chemistry 2006

In 2006, Fire and Mello received the Nobel prize in Medicine for their discovery of RNA interference, which is a cellular process by which the expression of a specifjc gene is inhibited — we say that the gene has been silenced. Andrew Z. Fire

Stanford University School of Medicine Stanford, CA, USA

Craig C. Mello

University of Massachusetts Medical School Worcester, MA, USA Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-148
SLIDE 148

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

RNA interference

RNA interference (RNAi) is a mechanism in molecular biology where the presence

  • f

certain fragments

  • f

double-stranded ribonucleic acid (dsRNA) interferes with the expression of a particular gene which shares a homologous sequence with the dsRNA. Wikipedia www.nature.com/focus/rnai/animations/ or bcove.me/k8cp9woy/

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-149
SLIDE 149

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

RNA interference

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-150
SLIDE 150

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Examination 2006

RNA silencing involves an RNAi element, which consists of stem-loop secondary structure, where the stem (a double stranded region) is 20 to 25 nucleotides long and the loop is at least 4 nucleotides long but no more than k. Moreover,

  • ne of the two strands of the stem is the reverse

complement of a portion of the gene that it silences. RNAi elements are encoded by the genome. Outline an algorithm in pseudo-code that fjnds all the RNAi elements of a given genome. Specifjcally, it fjnds stem-loops structures, with the above characteristics, such that one of their two strands is the complement of an existing gene. Assume that the location of all the protein-coding genes is known. Make sure to describe the necessary data structures and how they are initialized.

G G C A U U A G C C G C A A

... ... stem 5' 3' loop

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-151
SLIDE 151

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Defjnition

  • Defjnition. A tandem repeat is a string α such that α = ββ,

where β is a substring. Finding tandem repeats in n2 : for i to n-1 do for j from i+1 to n do l = length of the longest common extention of i,j if i+l >= j then a tandem repeat of length 2(j-i+1) starts at i

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-152
SLIDE 152

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Defjnition

  • Defjnition. A tandem repeat is a string α such that α = ββ,

where β is a substring. Finding tandem repeats in O(n2): for i to n-1 do for j from i+1 to n do l = length of the longest common extention of i,j if i+l >= j then a tandem repeat of length 2(j-i+1) starts at i

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-153
SLIDE 153

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

k-mismatch tandem repeat

  • Defjnition. A k-mismatch tandem repeat is a substring that

becomes a tandem repeat after k or fewer characters are changed. Outline an algorithm which fjnds k-mismatch tandem repeats. What’s the complexity of your algorithm? ⇒ (Landau & Schmidt 1993) presents an algorithm running O(kn log(n

k) + z), where z is the number of tandem repeats.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-154
SLIDE 154

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Software using suffjx trees/arrays

REPuter: bibiserv.techfak.uni-bielefeld.de/reputer

  • S. Kurtz, J. V. Choudhuri, E. Ohlebusch, C.

Schleiermacher, J. Stoye, R. Giegerich: REPuter: The Manifold Applications of Repeat Analysis on a Genomic Scale. Nucleic Acids Res., 29(22):4633-4642, 2001. VMATCH: www.vmatch.de MUMMER: mummer.sourceforge.net A.L. Delcher, S. Kasif, R.D. Fleischmann, J. Peterson, O. White, and S.L. Salzberg (1999) Alignment of Whole Genomes. Nucleic Acids Research, 27:11 (1999), 2369-2376.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-155
SLIDE 155

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Applications

Excerpts from www.vmatch.de.

Detecting unique substrings in large collection of DNA sequences that are used as signatures allowing for rapid and accurate diagnostics to identify pathogen bacteria and viruses; Computing a non-redundant set from a large collection of protein sequences from Zea-Maize; Finding sequence contamination errors in Arabidopsis thaliana; Mapping clustered sequences to large genomes; Pattern searches in plant sequences; Computing repeats in complete genomes.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-156
SLIDE 156

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

References

Gusfjeld D. (1997) Algorithms on strings, trees, and sequences. Cambridge Press. Landau G. M. and Schmidt J.P. (1993) An algorithm for approximate tandem repeats. Proc. 4th Symp. on Combinatorial Pattern Matching. Springer LNCS 684, pages 120–133.

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-157
SLIDE 157

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

References

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics
slide-158
SLIDE 158

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE Preamble Repeats Generalized Suffjx Tree More Repeats LCA LCE

Pensez-y!

L’impression de ces notes n’est probablement pas nécessaire!

Marcel Turcotte

  • CSI5126. Algorithms in bioinformatics