Algorithms in Bioinformatics: A Practical Introduction Suffix tree - - PowerPoint PPT Presentation

algorithms in bioinformatics a practical introduction
SMART_READER_LITE
LIVE PREVIEW

Algorithms in Bioinformatics: A Practical Introduction Suffix tree - - PowerPoint PPT Presentation

Algorithms in Bioinformatics: A Practical Introduction Suffix tree Overview What is suffix tree? Simple application of suffix tree Linear time algorithm for constructing suffix tree Suffix array FM-index 1-mismatch search


slide-1
SLIDE 1

Algorithms in Bioinformatics: A Practical Introduction

Suffix tree

slide-2
SLIDE 2

Overview

 What is suffix tree?  Simple application of suffix tree  Linear time algorithm for constructing

suffix tree

 Suffix array  FM-index  1-mismatch search

slide-3
SLIDE 3

Suffix Trie

E.g. consider the string S = acacag$ Suffix Trie: a ties of all possible suffices of S $ a c a g c a g c g 7 c g 1 3 5 2 4 6 a g g a

$ 7 g$ 6 ag$ 5 cag$ 4 acag$ 3 cacag$ 2 acacag$ 1 Suffix

$ $ $ $ $ $

slide-4
SLIDE 4

Suffix Tree

Suffix tree for S=acacag$: merge nodes with only one child

1 2 3 4 5 6 7 a c a c a g $

S=

v $ a c a g c a g c g c g 7 1 3 5 2 4 6 a g g $ $ $ $ $ $ a

Path-label of node v is “aca” Denoted as α(v) “ca” is an edge label This is a leaf edge

slide-5
SLIDE 5

Size of Suffix Tree (I)

How big is a suffix tree?

Suffix tree has exactly n leaves and at most 2n-1 edges

The total length of all edge labels is O(n2).

Can we store suffix tree using o(n2) bit space?

1 2 3 4 5 6 7 a c a c a g $

S=

$ a ca g c a g c g c g 7 1 3 5 2 4 6 a g g $ $ $ $ $ $ a

slide-6
SLIDE 6

Size of Suffix Tree (II)

Suffix tree has exactly n leaves and at most 2n-1 edges Note that each edge label can be represented using 2 indices Thus, suffix tree can be represented using O(n log n) bits

1 2 3 4 5 6 7 a c a c a g $

S=

$ a ca g c a g c g c g 7 1 3 5 2 4 6 a g g $ $ $ $ $ $

6,7 6,7 6,7 6,7 4,7 4,7

a

7,7 1,1 2,3 2,3 Note: The end index of every leaf edge should be 7, the last index of S. Thus, for leaf edges, we only need to store the start index.

slide-7
SLIDE 7

Property of suffix tree

 Fact: For any internal node v in the suffix

tree, if the path label of v is α(v)=ap, then

 there exists another node w in the suffix tree such

that α(w)=p.

 Proof: Skip the proof.  Definition of Suffix Link:

 For any internal node v, define its suffix link sl(v)

= w.

slide-8
SLIDE 8

Suffix Link example

 S=acacag$

$ a c a g c a g c g c g 7 1 3 5 2 4 6 a g g $ $ $ $ $ $ a

slide-9
SLIDE 9

Generalized suffix tree

 Build a suffix tree for two or more strings  E.g. S1 = acgat#, S2 = cgt$

a c g g t a t t a t c 4 1 2 1 3 2 3 t a t $ $ $ # g t # 5 # # 4 6 # $ #

slide-10
SLIDE 10

Applications of Suffix Tree

slide-11
SLIDE 11

Exact string matching problem

To find all occurrences of Q in S (searching)

Search for the node x in the suffix tree which represent Q

All the leaves in the subtree rooted at x are the occurrences

Time: O(|Q| + occ) where occ is the total no. of occurrences

$ a ca g c a g c g c g 7 1 3 5 2 4 6 a g g $ $ $ $ $ $ a E.g. S = acacag$ Q = aca Occurrences: 1, 3

slide-12
SLIDE 12

Longest repeated substring problem

 To find the longest repeated substring in S

 Find the deepest internal node

 Time: O(n)

$ a ca g c a g c g c g 7 1 3 5 2 4 6 a g g $ $ $ $ $ $ a E.g. S = acacag$ The longest repeat is aca.

slide-13
SLIDE 13

Longest common substring problem

To find the longest common substring of two or more sequences

Note: 1970, Don Knuth conjectured that a linear time algorithm for this problem is impossible

Now, we know that it can be solved in linear time.

E.g. consider two string S1 and S2,

1.

Build generalized suffix tree for S1# and S2$

2.

Then, mark each internal node with leaves representing suffixes of both S1 and S2.

3.

Report the deepest marked node

slide-14
SLIDE 14

Example for the longest common substring

 E.g. S1 = acgat#, S2 = cgt$  The longest common substring is “cg”. Its length is 2.

a c g g t a t t a t c 4 1 2 1 3 2 3 t a $ $ $ # g t # 5 # # 4 6 # $ #t

slide-15
SLIDE 15

Longest common prefix (I)

 Given a string S. For any i, j,

 Denote lcp(i, j) be the length of the longest common prefix

  • f suffix i and j of S.

$ a ca g c a g c g c g 7 1 3 5 2 4 6 a g g $ $ $ $ $ $ a

1 2 3 4 5 6 7 a c a c a g $

S=

The longest common prefix

  • f suffix 1 and suffix 3 is aca!

lcp(1, 3) = 3

slide-16
SLIDE 16

Longest common prefix (II)

 Note that the lowest common ancestor(lca) of leaves

i and j identifies the longest common prefix.

 lcp(i, j) = |α(lca(i, j))|.  A well-know result:

 Consider a tree of size n, after an O(n) time preprocessing,

the lca for any two nodes can be returned in O(1) time.

 First obtained by Harel and Tarjan (SIAM J. Comp. 1984)  Simplified by Schieber and Vishkin (SIAM J. Comp. 1988)

 Based on the above result,

 After an O(n) time preprocessing,

 For any suffix i and suffix j, we can compute the longest

common prefix of them in O(1) time.

slide-17
SLIDE 17

Finding Palindrome (I)

Given a string S, palindrome is a substring u of S s.t. u = ur

E.g. ACAGACA

Consider a palindrome u=S[i..i+|u|-1], u is called a maximal palindrome if S[i’..j’] is not a palindrome for any [i’..j’]⊃[i..i+|u|-1].

Note that every palindrome is contained in a maximal palindrome.

Thus, maximal palindromes are a compact way to represent all palindromes.

Complemented Palindrome is a string u s.t. u = ūr

E.g. ACAUGU

Maximal complemented palindrome is defined similarly.

slide-18
SLIDE 18

Finding Palindrome (II)

 Recall that restriction enzyme usually is in the

form of complemented palindrome.

 This motivates the following two problems:

 The palindrome problem:

 Given a string S (representing the genome) of length n,

the problem is to locate all maximal palindromes in S.

 The complemented palindrome problem:

 Given a string S (representing the genome) of length n,

the problem is to locate all maximal complemented palindromes in S.

slide-19
SLIDE 19

Properties of palindrome (I)

 If S[i..i+k-1]=Sr[n-i+1..n-i+k], then

u=S[i-k+1..i+k-1] is an odd length palindrome

1 i n

S

a a a b a a a

1 n

Sr

a a a b a a a

i-k+1 i+k-1 n-i+1

slide-20
SLIDE 20

Properties of palindrome (II)

 If S[i..i+k-1]=Sr[n-i+2..n-i+k+1], then

u=S[i-k..i+k-1] is an even length palindrome

1 i-k i n

S

a a a b b a a a

1 n-i+2 n

Sr

a a a b b a a a

i+k-1

slide-21
SLIDE 21

Solution to the palindrome problem

 Preprocess S and Sr so that any longest

common prefix query can be answered in constant time.

 For i=1 to n,

 Find the longest common prefix for (Si, Sr

n-i+1). If

the longest prefix is k, we find an odd length maximal palindrome S[i-k+1..i+k-1].

 Find the longest common prefix for (Si, Sr

n-i+2). If

the longest prefix is k, we find an even length maximal palindrome S[i-k..i+k-1].

slide-22
SLIDE 22

Extracting embedded suffix tree from a generalized suffix tree

 Input: The generalized suffix tree T of K

strings S1, …, SK.

 Aim: Compute the suffix tree Ti of the string Si.

r

x y

a c g g t a t t a t c 4 1 2 1 3 2 3 t a t $ $ $ # g

z w

t # 5 # # 4 6 # $ #

r a c g g t a t a c 4 1 2 3 t a t # g

w

t # 5 # # 6 # #

S1 = acgat#, S2 = cgt$

T T1

slide-23
SLIDE 23

Extracting embedded suffix tree from a generalized suffix tree

Observation: Ti is a subtree of T such that

The leaves of Ti are the leaves of T corresponding to Si.

The internal nodes of Ti are the lowest common ancestors of some leaves for Si.

The edges of Ti can be inferred from the ancestor descendent relationship among those nodes.

r

x y

a c g g t a t t a t c 4 1 2 1 3 2 3 t a t $ $ $ # g

z w

t # 5 # # 4 6 # $ #

r a c g g t a t a c 4 1 2 3 t a t # g

w

t # 5 # # 6 # #

S1 = acgat#, S2 = cgt$

T T1

slide-24
SLIDE 24

Extracting embedded suffix tree from a generalized suffix tree

r a c g g t a t a c 4 1 2 3 t a t # g

w

t # 5 # # 6 # # r a c g g a t a c 4 1 2 3 t a t # g

w

t # # 6 # # r a c g a t c 4 1 2 a t g

w

t # # 6 # # r a c 4 1 a t g

w

t # 6 # # r a c 1 a t g 6 # # r 6 #

→ → → → →

slide-25
SLIDE 25

Common substrings of more than 2 strings (I)

 Given a set of strings (protein or DNA

sequences), we want to know what substrings are common to a large number of these strings?

 Why this question is important?

 DNA and protein sequences will evolve. If a

substring occur commonly in wide range of

  • species. This may mean that the substring is

critical for the correct functionality.

slide-26
SLIDE 26

Common substrings of more than 2 strings (II)

 Given K strings whose total length is n.  For every 2≤k≤K, define l(k) be the

length of the longest substring common to at least k of these strings.

 The problem is to compute l(k) for all k.

slide-27
SLIDE 27

Common substrings of more than 2 strings (III)

 Example:

 Consider a set of 5 strings { sandollar,

sandlot, handler, grand, pantry }

 Then, we have

k l(k) corresponding substring 2 4 sand 3 3 and 4 3 and 5 2 an

slide-28
SLIDE 28

Common substrings of more than 2 strings (IV)

Illustrating the solution by example:

S1 = aacg$, S2 = acgc#, S3 = cga%. (K=3)

1.

Build a generalized suffix tree T for the K strings in O(n) time. $ a c g c g g c 5 2 1 $ # 5 4 # % 3 % 1 g $ c a 4 # c 3 1 $ # 2 a % c 4 2 $ # 3 a %

slide-29
SLIDE 29

Common substrings of more than 2 strings (V)

2.

By traversing T, for each internal node v, compute its string depth. In total, O(n) time.

1 1 3

$ a c g c g g c 5 2 1 $ # 5 4 # % 3 % 1 g $ c a 4 #

2

c 3 1 $ # 2 a %

1

c 4 2 $ # 3 a %

slide-30
SLIDE 30

Common substrings of more than 2 strings (VI)

3.

By traversing T, for each internal node v, compute C(v). [C(v) is defined as the number of distinct termination symbols in the subtree rooted at v]

This step takes O(Kn) time. 3

3 3 2

$ a c g c g g c 5 2 1 $ # 5 4 # % 3 % 1 g $ c a 4 #

3

c 3 1 $ # 2 a %

3

c 4 2 $ # 3 a %

slide-31
SLIDE 31

Common substrings of more than 2 strings (VII)

4.

Traverse T and visit every internal node v. For each v, if V(C(v)) < string-depth of v, set V(C(v)) = string-depth of v. [After step 4, V(k) = the length of the longest substring common to exactly k of these strings.]

5.

l(k)=V(k). For i=k-1 downto 2, l(i)=max{l(i+1), V(i)}.

This two steps take O(n) time.

For our example, V(2) = 3, V(3) = 2.

Thus, l(3) = 2, l(2) = 3.

In total, this algorithm takes O(Kn) time.

Actually, we can improve this algorithm to O(n) time by mean

  • f lcp!
slide-32
SLIDE 32

Linear time algorithm for constructing suffix tree

slide-33
SLIDE 33

Straightforward construction

  • f suffix tree

 Consider S = s1s2…sn where sn=$  Algorithm:

 Initialize the tree with only a root  For i = n to 1

 Includes S[i..n] into the tree

 Time: O(n2)

slide-34
SLIDE 34

Example of construction

 S=acca$

  Init For-loop $ 5 I5 a 4 I4 5 $ $  a 4 I3 5 $ $ 3 a $ c I2 c c a a 3 2 a 4 5 $ $ $ $  I1 c c a a 3 2 5 $ $ $ c c a a 4 1 $ $

slide-35
SLIDE 35

Construction of generalized suffix tree

 S’= c#

 I1 c c a a 3 2 5 $ $ $ c c a a 4 1 $ $ Init For-loop J2 c c a a 3 2 5 $ $ $ c c a a 4 1 $ $ 2 #  J1 c c a a 3 2 5 $ $ $ c c a a 4 1 $ $ 2 # # 1

slide-36
SLIDE 36

Can we construct a suffix tree in o(n2) time?

  • Yes. We can construct it in O(n) time.

Weiner’s algorithm [1973]

Linear time for constant size alphabet, but much space

McGreight’s algorithm [JACM 1976]

Linear time for constant size alphabet, quadratic space

Ukkonen’s algorithm [Algorithmica, 1995]

Online algorithm, linear time for constant size alphabet, less space

Farach’s algorithm [FOCS 1997]

Linear time for general alphabet

Hon,Sadakane, and Sung’s algorithm [FOCS 2003]

O(n) bit space O(n logen) time for 0<e<1

O(n) bit space O(n) time for suffix array construction

We will discuss Farach’s algorithm later.

slide-37
SLIDE 37

Idea

 Build Odd Suffix Tree and Even Suffix Tree  Then, merge odd and even suffix tree.

$ a ca g c a g c g c g 7 1 3 5 2 4 6 a g g $ $ $ $ $ $ a $ a ca g c a g c g c g 7 1 3 5 2 4 6 a g g $ $ $ $ $ $ a

Even Suffix Tree Odd Suffix Tree

slide-38
SLIDE 38

Idea

Input: a string S of length n

1.

Recursively compute the suffix tree To of all suffixes beginning at the odd positions.

To is of size n/2.

2.

From To, compute Te which is the suffix tree for all suffixes beginning at the even positions.

3.

Merge To and Te to form the suffix tree for S.

slide-39
SLIDE 39

Stage 1: Constructing odd suffix tree

 Given a string S[1..n], we generate a new

string S’[1..n/2] as follows.

 we map pairs of characters into single characters

as follows:

 S[1..2], S[3..4], S[5..6], …, S[n-1..n].

 Remove the duplicates from the pairs of

characters and sort them by radix sort.

 S’[i] = rank of S[2i-1..2i] in the sorted list, for

i=1, 2, …, n/2.

 By recursion, we get the suffix tree T’ for S’  Convert T’ to the odd suffix tree To.

slide-40
SLIDE 40

Example (I)

 S = aaabbbabbaba$  S[1..2]=aa, S[3..4]=ab, S[5..6]=bb,

S[7..8]=ab, S[9..10]=ba, S[11..12]=ba.

 By stable sort, aa < ab < ba < bb.

 Rank(aa)=1, Rank(ab)=2, Rank(ba)=3,

Rank(bb)=4.

 So, S’=124233$.

slide-41
SLIDE 41

Example (II)

 By recursion, construct the suffix tree

T’ for S’:

7 $ 1 4 2 2 3 6 5 $ 3

slide-42
SLIDE 42

Example (III)

 Convert T’ to the odd tree:

13 $ 1 7

a b

3 11 9 $ 5 This is not a suffix tree i  2i-1

slide-43
SLIDE 43

Example (IV)

 Refine the odd tree To:

13 $ 1 7

b b

3 11 9 $ 5

b a a

slide-44
SLIDE 44

Time complexity for building the

  • dd tree

 Let Time(n) be the time to build a suffix

tree for a string of length n.

 Stable sorting and refinement of the

  • dd trees take O(n) time.

 Build suffix tree for S’ takes Time(n/2).  So, Stage 1 takes Time(n/2)+O(n) time.

slide-45
SLIDE 45

Stage 2: Build the even tree

1.

Generate the lex-ordering of the leaves in Te.

2.

For any two adjacent leaves 2i and 2j, we find lcp(2i, 2j).

3.

Construct the even tree Te from left to right (according to the lex-ordering).

slide-46
SLIDE 46

Build the even tree (Step 1)

 We get the lex-ordering of the leaves in

To.

 Generate the lex-ordering of the leaves

in Te.

 For each leaf i in To, get the preceding

character c=S[i-1] and form a pair (c,i). Each pair represents a even suffix i-1.

 Perform stable sorting on those pairs. We

get the lex-ordering of the leaves in Te.

slide-47
SLIDE 47

Example

 Lex-ordering of the leaves in To:

 13 < 1 < 7 < 3 < 11 < 9 < 5

 The pairs are:

 (a,13), ($,1), (b,7), (a, 3), (a, 11), (b, 9), (b, 5).

 After stable sorting, we have

 ($, 1), (a, 13), (a, 3), (a, 11), (b, 7),

(b, 9), (b, 5).

 Hence, the lex-ordering of the leaves of Te:

 12 < 2 < 10 < 6 < 8 < 4

S = aaabbbabbaba$

slide-48
SLIDE 48

Build the even tree (Step 2)

 For any two adjacent leaves 2i and 2j, we

first find lcp(2i, 2j).

 Observation: lcp(2i, 2j) =

 lcp(2i+1, 2j+1)+1 if S[2i]=S[2j]  0 otherwise

 Proof:

 If S[2i]≠S[2j], lcp(2i,2j)=0.  Otherwise, lcp(2i,2j)=1+lcp(2i+1,2j+1).

slide-49
SLIDE 49

Example

 Recall that the lex-

  • rdering of leaves:

 12 < 2 < 10 < 6 < 8 < 4.

 By the previous

  • bservation, we have

 lcp(8,4)=lcp(9,5)+1=2

 Similarly, we have

 lcp(12,2)=1, lcp(2,10)=1,

lcp(10,6)=0, lcp(6,8)=1, lcp(8,4)=2 13 $ 1 7

b b

3 11 9 $ 5

b a a

slide-50
SLIDE 50

Build the even tree (Step 3)

 Construct the even tree Te from left to

right.

12

a

12 $ 2

a b b b a b b a b a $

a

12 $ 2

a b b b a b b a b a $

10

slide-51
SLIDE 51

Build the even tree (Step 3)

a

12 $ 2

a b b b a b b a b a $

10 6

a

12 $ 2

a b b b a b b a b a $

10 6

b

8

a

12 $ 2

a b b b a b b a b a $

10 6

b

8 4

b

slide-52
SLIDE 52

Time complexity for building the even tree

 Step 1: O(n) time  Step 2: O(n) time  Step 3: O(n) time

slide-53
SLIDE 53

Stage 3: Merge odd and even trees

We can merge To and Te by DFS. However, it takes O(n2) time.

13 $ 1

a a b b b a b b a b a $

7

a b a $ b b

3 11 9

b a $

$ 5

babbaba$

b b a b b a b a $

b a a

13 $ 1

a a b b b a b b a b a $

7

a b a $ b b

3 11 9

b a $

$ 5

babbaba$

b b a b b a b a $

b a a

Even tree Odd tree

a

12 $ 2

a b b b a b b a b a $

10 6

b

8 4

b a

12 $ 2

a b b b a b b a b a $

10 6

b

8 4

b

13

$,1

1 7 3 11 9

$,1 b,1 a,1 a,1

12

$,1 a,1

2

b,1

10

a,2 a,4 b,9

6

b,1 b,5

8 4

b,1 a,2

5

b,5 b,8 a,11 b,10 a,2 a,2 b,1

13

$,1

1 7 3 11 9

$,1 b,1 a,1 a,1

12

$,1 a,1

2

b,1

10

a,2 a,4 b,9

6

b,1 b,5

8 4

b,1 a,2

5

b,5 b,8 a,2 a,2 b,1

slide-54
SLIDE 54

Stage 3: Merge odd and even trees

We merge To and Te by DFS. We merge two edges as long as they start with the same character. The merge is ended when one edge is longer than the other.

13 $ 1 7 3 11 9 $ 5

b a a a

12 $ 2

a b b b a b b a b a $

10 6

b

8 4

b

slide-55
SLIDE 55

Merge odd and even trees

We merge To and Te by DFS. We merge two edges as long as they start with the same character. The merge is ended when one edge is longer than the other.

13

$,1

2 7 3 11 9

$,1 b,1 a,1 a,1

12

$,1 a,11

1

$,1 b,2

10

$,1 a,4 b,9

6

b,3 b,3

8 4

b,1 a,4

5

b,3 b,8

slide-56
SLIDE 56

Merge odd and even trees

The merging may over-merged some nodes.

To correct the tree, we need to unmerge some nodes.

13

$,1

2 7 3 11 9

$,1 b,1 a,1 a,1

12

$,1 a,11

1

$,1 b,2

10

$,1 a,4 b,9

6

b,3 b,3

8 4

b,1 a,4

5

b,3 b,8

slide-57
SLIDE 57

Definition of L() and d()

 For every node u which may be over-merged,

there exist two leaves 2i and 2j-1 such that u=lca(2i, 2j-1).

 Denote L(u) be the correct depth of u, that is,

lcp(2i,2j-1).

 Note that lcp(2i,2j-1) = 1+lcp(2i+1,2j) if

S[2i]=S[2j-1]; 0 otherwise.

 Let v be lca(2i+1,2j).  Denote d(u) = v.  Note that d() is equivalent to suffix link!

slide-58
SLIDE 58

Example of d()

13

$,1

2 7 3 11 9

$,1 b,1 a,1 a,1

12

$,1 a,11

1

$,1 b,2

10

$,1 a,4 b,9

6

b,3 b,3

8 4

b,1 a,4

5

b,3 b,8

d()

slide-59
SLIDE 59

Relationship between L() and d()

Suppose u = lca(2i, 2j-1).

Note1: if u is not the root, then S[2i]=S[2j-1].

Note2: lcp(2i,2j-1) = 1+lcp(2i+1,2j) if S[2i]=S[2j-1]; 0

  • therwise

Note3: d(u) = lcp(2i+1,2j)

Hence, L(u) = 1 + L(d(u)) if u is not the root. Otherwise, L(u)=0.

Lemma: L(u) = the length of the purple path from u to the root.

slide-60
SLIDE 60

Example of L()

13

$,1

2 7 3 11 9

$,1 b,1 a,1 a,1

12

$,1 a,11

1

$,1 b,2

10

$,1 a,4 b,9

6

b,3 b,3

8 4

b,1 a,4

5

b,3 b,8

L( )=2 L( )=3 L( )=4 L( )=2 L( )=2 L( )=2 L( )=1 L( )=1

slide-61
SLIDE 61

Unmerge the border nodes based

  • n L() (I)

13

$,1

2 7 3 11 9

$,1 b,1 a,1 a,1

12

$,1 a,11

1

$,1 b,2

10

$,1 a,4 b,9

6

b,3 b,3

8 4

b,1 a,4

5

b,3 b,8

L( )=2 L( )=3 L( )=4 L( )=2 L( )=2 L( )=2 L( )=1 L( )=1

slide-62
SLIDE 62

Unmerge the border nodes based

  • n L() (II)

13

$,1

1 7 3 11 9

$,1 b,1 a,1 a,1

12

$,1 a,1

2

b,1

10

a,2 a,4 b,9

6

b,1 b,5

8 4

b,1 a,2

5

b,5 b,8 a,11 b,10 a,2 a,2 b,1

slide-63
SLIDE 63

Time complexity for merging

 Merge the tree using DFS takes O(n)

time.

 Compute the links d() takes O(n) time.  Compute L() takes O(n) time.  Unmerge takes O(n) time.

slide-64
SLIDE 64

Total time complexity of Farach’s algorithm

 Stage 1: Time(n/2)+O(n)  Stage 2: O(n)  Stage 3: O(n)  Thus, Time(n) = Time(n/2)+O(n).  By solving the equation,

 Time(n)= O(n).

slide-65
SLIDE 65

Disadvantage of suffix tree

 Suffix tree is space inefficient. It

requires O(n|Σ|log n) bits.

 Manber and Myers (SIAM J. Comp 1993)

proposes a new data structure, called suffix array, which has a similar functionality as suffix tree. Moreover, it

  • nly requires O(n log n) bits.
slide-66
SLIDE 66

Suffix Array (I)

It is just sorted suffixes.

E.g. consider S = acacag$

Suffix array is an array of n indices. Thus, it takes O(n log n) bits.

Suffix Position SA[i] Suffix acacag$ 1 7 $ cacag$ 2 1 acacag$ acag$ 3 => 3 acag$ cag$ 4

Sort

5 ag$ ag$ 5 2 cacag$ g$ 6 4 cag$ $ 7 6 g$

slide-67
SLIDE 67

Observation

 The leaves of a suffix tree is in suffix

array order.

$ a ca g c a g c g c g 7 1 3 5 2 4 6 a g g $ $ $ $ $ $ a

SA[i] Suffix 7 $ 1 acacag$ 3 acag$ 5 ag$ 2 cacag$ 4 cag$ 6 g$

slide-68
SLIDE 68

Linear time construction of suffix array from suffix tree

 Recall that the suffix tree T of S[1..n] can be

constructed in O(n) time.

 Then, by “lexical” depth-first traversal of T,

the suffix array of S is obtained.

 This takes O(n) time.

 However, the space used during construction

is the same as that for suffix tree! This defeats the purpose of suffix array.

 Today, we can build suffix array using O(n)

bit space and O(n) time.

slide-69
SLIDE 69

range(T,Q)

 For a pattern Q, its

  • ccurrences in T form a

consecutive SA range.

 Example: For T= acacag$, ca

  • ccurs in SA[5] and SA[6].

 Definition:

 We called range(T,Q)=[st..ed]

if

 Q is a prefix of every Tj for

j=SA[st], SA[st+1], …, SA[ed]

 where Tj = j suffix of T = T[j..n].

 Example: range(T,ca)=[5..6]

SA[i] Suffix 7 $ 1 acacag$ 3 acag$ 5 ag$ 2 cacag$ 4 cag$ 6 g$ 1 2 3 4 5 6 7

slide-70
SLIDE 70

Find occurrence of query Q in a string S using suffix array

Input: (1) the suffix array of a string T of length n and (2) a query Q of length m

Aim: check if Q occurs in T

Idea: binary search!

slide-71
SLIDE 71

Algorithm

slide-72
SLIDE 72

Example

 Consider T = acacag$  Pattern Q = acag  L=1  R=7  M=(L+R)/2=4

SA[i] Suffix 7 $ 1 acacag$ 3 acag$ 5 ag$ 2 cacag$ 4 cag$ 6 g$ i 1 2 3 4 5 6 7

slide-73
SLIDE 73

Example

 Consider T = acacag$  Pattern Q = acag  L=1  R=7  M=(L+R)/2=4  suffix-SA[M] > Q.  Set R=M=4.

SA[i] Suffix 7 $ 1 acacag$ 3 acag$ 5 ag$ 2 cacag$ 4 cag$ 6 g$ i 1 2 3 4 5 6 7

slide-74
SLIDE 74

Example

 Consider T = acacag$  Pattern Q = acag  L=1  R=4  M=(L+R)/2=2  suffix-SA[M] < Q.  Set L=M=2.

SA[i] Suffix 7 $ 1 acacag$ 3 acag$ 5 ag$ 2 cacag$ 4 cag$ 6 g$ i 1 2 3 4 5 6 7

slide-75
SLIDE 75

Example

 Consider T = acacag$  Pattern Q = acag  L=2  R=4  M=(L+R)/2=3  The pattern Q is found at SA[M]=3.

SA[i] Suffix 7 $ 1 acacag$ 3 acag$ 5 ag$ 2 cacag$ 4 cag$ 6 g$ i 1 2 3 4 5 6 7

slide-76
SLIDE 76

Can we do better?

 During each step of binary search,

 we need to compare Q with a suffix using O(m) time, which is

time consuming.

 Can we do better?  We have the following observation.

 Suppose LCP(Q, suffix-SA[L]) is l and LCP(Q, suffix-SA[R]) is r.  Then, LCP(Q, suffix-SA[M]) > min{l,r}.

 Below, we describe how to utilize this observation to

speedup the computation.

slide-77
SLIDE 77

Algorithm

slide-78
SLIDE 78

Example

 Consider T = acacag$  Pattern Q = acag  L=1, l=0  R=7, r=0  mlr = min(l,r)=0  M=(L+R)/2=4

SA[i] Suffix 7 $ 1 acacag$ 3 acag$ 5 ag$ 2 cacag$ 4 cag$ 6 g$ i 1 2 3 4 5 6 7

slide-79
SLIDE 79

Example

Consider T = acacag$

Pattern Q = acag

L=1, l=0

R=7, r=0

mlr = min(l,r)=0

M=(L+R)/2=4, m=1

The (m+1) char of suffix-SA[M] is g.

The (m+1) char of Q is c.

So, suffix-SA[M] > Q.

Set R=M=4 and r=m=1.

SA[i] Suffix 7 $ 1 acacag$ 3 acag$ 5 ag$ 2 cacag$ 4 cag$ 6 g$ i 1 2 3 4 5 6 7

slide-80
SLIDE 80

Example

Consider T = acacag$

Pattern Q = acag

L=1, l=0

R=4, r=1

mlr = min(l,r)=0

M=(L+R)/2=2, m=3

The (m+1) char of suffix-SA[M] is c.

The (m+1) char of Q is g.

So, suffix-SA[M] < Q.

Set L=M=2 and l=m=3.

SA[i] Suffix 7 $ 1 acacag$ 3 acag$ 5 ag$ 2 cacag$ 4 cag$ 6 g$ i 1 2 3 4 5 6 7

slide-81
SLIDE 81

Example

 Consider T = acacag$  Pattern Q = acag  L=2, l=3  R=4, r=1  mlr = min(l,r)=1  M=(L+R)/2=3, m=4  The pattern Q is found at SA[M]=3.

SA[i] Suffix 7 $ 1 acacag$ 3 acag$ 5 ag$ 2 cacag$ 4 cag$ 6 g$ i 1 2 3 4 5 6 7

slide-82
SLIDE 82

Time analysis

 Binary search will perform log n

comparisons

 Each comparison takes at most O(m)

time

 In the worst case, O(m log n) time.  Myers and Manber report that, in

practice, the time is O(m + log n).

slide-83
SLIDE 83

Suffix array and suffix tree

 We show one example of replacing

suffix tree by suffix array

 Note that most applications related to

suffix tree can be solved using suffix array with some time blow up!

 When space is limited, replacing suffix

tree by suffix array is a good choice.

slide-84
SLIDE 84

The size is still too big!

 Why?  DNA sequences can be very long!

E.g. Fly: ~100M bases, Human: ~3G bases, Tree: ~9G bases

 Storage to store indexing data structure for

human genome Suffix Tree: ~40G bytes Suffix Array: ~13G bytes

 Can we further reduce the space?

slide-85
SLIDE 85

Solution

 Grossi, Vitter (STOC2000)

 Compressed suffix array (CSA)

 Ferragine, Manzini (FOCS2000)

 FM-index

 Both of them can be stored in O(n) bit space  For Human Genome

 Both CSA and FM-index can be stored within 2G

bytes.

slide-86
SLIDE 86

FM-index

Consider a text T=acacag$

FM-index stores:

A.

The string BW=g$ccaaa

B.

C[x] = total no. of

  • ccurrences of each symbol

less than x. E.g. C[a]=1, C[c]=4, C[g]=6, C[t]=7

C.

A data-structure occ(x, i) which tells us the number of

  • ccurrences of x in BW[1..i]

using O(1) time.

SA[i] Suffix T[SA[i]-1] 7 $ g 1 acacag$ $ 3 acag$ c 5 ag$ c 2 cacag$ a 4 cag$ a 6 g$ a

slide-87
SLIDE 87

Data-structure for answering the

  • cc(x, i) query?

Given the text BW[1..n], we divide BW[1..n] into buckets of size log2n.

For each bucket i = 1, …, n/log2n, we store

P[i] = number of x’s in BW[1.. ilog2n].

Each bucket is further subdivided into log n sub-buckets of size log n.

For each sub-bucket j of the bucket i, we store

Q[i][j] = number of x’s in the first j sub-buckets.

……… BW

log2 n

………

log n

2

n/log2 n

3 4 1 1 2 3

log n

00x0xxx0x0

slide-88
SLIDE 88

Data-structure for answering the

  • cc(x, i) query?

 We also need a lookup table rank(b,k)

 b is any string of length (log n)/2  1 ≤ k ≤ (log n)/2

 rank(b,k) = number of x in the first k characters of b.

00…0 xx…x 1 2 3 ………………(log n)/2 x0x..0 … … 1 1 2 ………………………

bits ) ( ) log log log ( space Total bits ) log (log s entry take Each 2 log 2 log 2 entries

  • f

Number

2 log

n

  • n

n n O n O n n n

n

= = = =

slide-89
SLIDE 89

Space complexity of the occ() data-structure

 P[1..n/log2n] uses O(n/log n) bits  Q[1..n/log2n][1..log n] uses O(n log log

n / log n) bits

 rank(b,k) uses 2logn/2 (log n/2) = o(n)

bits

 In total, we use O(n log log n / log n)

bits.

slide-90
SLIDE 90

How to compute occ(x,i)?

Suppose log n = 10.

To compute occ(x, 327),

The result is P[3]+Q[4][2]+rank(00x0x,5)+rank(xx0x0,2)

Hence, O(1) time to compute occ(x,i).

……… BW

log2 n

………

log n

2

n/log2 n

3 4 1 1 2 3

log n

00x0xxx0x0

slide-91
SLIDE 91

Size of FM-index

 Structure A can be store in 2n bits  Structure B can be store in O(log n) bits  Structure C can be store in

O(n log log n/log n) bits

 In total, the size of FM-index is

O(n) bits.

slide-92
SLIDE 92

Observation

 C[x]+occ(x,i) is the number of

suffixes smaller than of xTSA[i].

 Example:

 C[c]=4  occ(c, 6)=2  Number of suffixes smaller than

cT[SA[6]..n]=ccag$ is 6.

SA[i] Suffix T[SA[i]-1] 7 $ g 1 acacag$ $ 3 acag$ c 5 ag$ c 2 cacag$ a 4 cag$ a 6 g$ a

slide-93
SLIDE 93

Lemma

 Suppose range(T,Q) is [st..ed].  Then, range(T,xQ) = [p..q] where

 p = C[x] + occ(x,st-1) + 1  q = C[x] + occ(x,ed)

 Proof:

 p = 1+number of suffixes strictly smaller than xQ.

The latter term = number of suffixes smaller than

  • r equal to xTSA[st-1].

 q = number of suffixes smaller than or equal to

xTSA[ed].

slide-94
SLIDE 94

Backward search

Given the text T and the FM-index, we want to determine if Q exists in T.

Algorithm BW_exist(Q[1..m])

1.

x=Q[m], i=m-1;

2.

/* find range(T,Q[m]) */ st = C[x]+1, ed = C[x+1];

3.

while (st≤ed and i>1) {

/* find range(T, Q[i-1..m]) */

x = Q[i-1];

st = C[x] + occ(x, st-1) + 1;

ed = C[x] + occ(x, ed);

i = i – 1; }

4.

if st > ed, then pattern not found else pattern found.

slide-95
SLIDE 95

Example

 T = acacag$  Q = aca  Q[3..3]=a  sp=C[a] +1

=1+1=2

 ep=C[c]

=4

SA[i] Suffix T[SA[i]-1] 7 $ g 1 acacag$ $ 3 acag$ c 5 ag$ c 2 cacag$ a 4 cag$ a 6 g$ a

g $ c c a a a

slide-96
SLIDE 96

Example

 T = acacag$  Q = aca  Q[2..3]=ca  st=C[c]+occ(c,stold-1)+1

=4+0+1=5

 ed=C[c]+occ(c,edold)

=4+2=6

SA[i] Suffix T[SA[i]-1] 7 $ g 1 acacag$ $ 3 acag$ c 5 ag$ c 2 cacag$ a 4 cag$ a 6 g$ a

g $ c c a a a

slide-97
SLIDE 97

Example

 T = acacag$  Q = aca  Q[1..3]=aca  st=C[a]+occ(a,stold-1)+1

=1+0+1=2

 ed=C[a]+occ(a,edold)

=1+2=3

SA[i] Suffix T[SA[i]-1] 7 $ g 1 acacag$ $ 3 acag$ c 5 ag$ c 2 cacag$ a 4 cag$ a 6 g$ a

g $ c c a a a

slide-98
SLIDE 98

Example

 T = acacag$  Q = aca  Q[1..3]=aca  st=C[a]+occ(a,stold-1)+1

=1+0+1=2

 ed=C[a]+occ(a,edold)

=1+2=3

SA[i] Suffix T[SA[i]-1] 7 $ g 1 acacag$ $ 3 acag$ c 5 ag$ c 2 cacag$ a 4 cag$ a 6 g$ a

g $ c c a a a

Q occurrences in T!

slide-99
SLIDE 99

Time complexity of Backward Search

 To find a pattern Q[1..m]

 Step 1, 2, and 4 can be computed in O(1)

time.

 For step 3,

 We need to iterate the loop for m-1 times.  Each iteration of the loop can be computed in

O(1) time.

 The loop takes O(m) time.

 In total, O(m) time for backward search.

slide-100
SLIDE 100

Conclusion

 Suffix tree is a powerful data-structure which

has a lot of applications in Computational Biology.

 Problems:

 Suffix tree is too big!

 Can be solved using CSA and FM-index

 Suffix tree can only solve exact match problem!

(Most of the biology problems are approximate match!)

 Many works have been done on this area! But still not

  • pratical. One of the important area to explore!
slide-101
SLIDE 101

1-mismatch problem

slide-102
SLIDE 102

1-mismatch problem

 Index: the suffix tree of a text T[1..n]  For any pattern P[1..m], the 1-mismatch

problem finds all occurrences of P in T that has hamming distance at most 1.

 Example:

P = ACGT T = AACGTGGCCAACTTGGA

slide-103
SLIDE 103

Naïve solution

 Index:

 create the suffix tree for T

 Algorithm for query:

 Generate all possible 1-mismatch patterns of P.  Find occurrences of every 1-mismatch pattern

 Running time:

 There are |Σ|m possible 1-mismatch patterns.  Using suffix tree, it takes O(m) time to find

  • ccurrences of each 1-mismatch pattern

 In total, O(m2+occ) time where occ is total

number of occurrences.

slide-104
SLIDE 104

Any other solutions?

Cobbs [CPM 1995]

O(n log n) bit index, O(m2+occ) query time.

Amir et al [Journal of Algorithm 2000]

O(n log3 n) bit index, O(m log n loglog n + occ) query time.

Buchsbaum et al [ESA 2000]

O(n log2 n) bit index, O(m log log n + occ) query time.

Cole et al [SODA 2004]

O(n log2 n) bit index, O(m + log log n + occ) query time.

Trinh et al [CPM 2004]

O(n log n) bit index, O(m log n + occ) query time.

Lam et al [ISAAC 2005]

O(n sqrt(log n)) bit index, O(m log log n + occ) query time.

Chan et al [ESA 2006]

O(n log n) bit index, O(m + log log n + occ) query time.

Today, we have a look of the solution of Trinh et al.

slide-105
SLIDE 105

Index

 Suffix array of T

 SA[1..n]

 Inversed suffix array of T

 SA-1[1..n] where SA[SA-1[i]] = i.

 Definition:

 We called range(T,P)=[st..ed] if

 P is a prefix of every Tj for j=SA[st], SA[st+1], …, SA[ed]  where Tj = j suffix of T = T[j..n].

 Example: For T= acacag$, range(T,ca)=[5..6]

SA[i] Suffix 7 $ 1 acacag$ 3 acag$ 5 ag$ 2 cacag$ 4 cag$ 6 g$ i 1 2 3 4 5 6 7

slide-106
SLIDE 106

Lemma 1 (Forward search)

 Assume [st..ed]=range(T,P).  We can compute [st’..ed’]=range(T,Pc)

in O(log n) time.

 Proof:  binary search on SA.

slide-107
SLIDE 107

Lemma 2

Assume [st1..ed1]=range(T,P1) and [st2..ed2]=range(T,P2).

We can compute [st..ed]=range(T,P1P2) in O(log n) time.

Proof:

Let the length of P1 be k.

Note that TSA[st1],TSA[st1+1],…,TSA[ed1] are lexicographically increasing.

Hence, TSA[st1]+k,TSA[st1+1]+k,…,TSA[ed1]+k are lexicographically increasing.

Thus, SA-1[SA[st1]+k] < SA-1[SA[st1+1]+k] <…< SA-1[SA[ed1]+k].

To find st and ed, we need to find

 the smallest st such that st2 < SA-1[SA[st]+k]<ed2 and  the largest ed such that st2 < SA-1[SA[ed]+k] < ed2.

This can be done by binary search.

slide-108
SLIDE 108

Example

 T=acacag$  P1=a, P2=ca.  range(T,P1)=[2..4], range(T,P2)=[5..6]  To find range(T,P1P2), we do the following:

 Note that TSA[2]<TSA[3]<TSA[4] are beginning with a.  So, TSA[2]+1<TSA[3]+1<TSA[4]+1.  Note that

 SA-1[SA[2]+1]=5, SA-1[SA[3]+1]=6, SA-1[SA[4]+1]=7.

 Hence, TSA[5]<TSA[6]<TSA[7].  Among the three suffixes, we need to identify suffix beginning

with P2.

 Since range(T,P2)=[5..6], both TSA[5] and TSA[6] contain P2.  As SA-1[SA[2]+1]=5 and SA-1[SA[3]+1]=6,

 we have range(T,P1P2) = [2..3].

SA[i] Suffix 7 $ 1 acacag$ 3 acag$ 5 ag$ 2 cacag$ 4 cag$ 6 g$ i 1 2 3 4 5 6 7

slide-109
SLIDE 109

Lemma 3 (Backward search)

 Assume [st..ed]=range(T,P).  We can compute [st’..ed’]=range(T,cP)

in O(log n) time.

 Proof:  Let P1=c, P2=P. By Lemma 2, range(T,P1P2)

can be computed in O(log n) time.

slide-110
SLIDE 110

Algorithm

1.

For j=m,m-1,…,1,

By backward search, find range(T,P[j..m]).

2.

For j=1,2,…,m,

By forward search, find range(T,P[1..j]).

3.

Report all occurrences of range(T,P[1..m])

4.

For j=1,2,…,m,

Let P1=P[1..j-1], P2=P[j+1..m]

For every character c≠P[j],

By forward search, find range(T,P1c)

By Lemma 2, find range(T,P1cP2)

Report all occurrences of range(T,P1cP2)

slide-111
SLIDE 111

Time analysis

 Ignoring the O(occ) reporting time!  Steps 1 and 2 take O(m log n) time.  Step 4 tries all possible mismatches.

 There are in total |Σ|m mismatches.  For each mismatch, it take O(log n) time.  So, Step 4 takes O(|Σ|m log n) time.

 In total, the running time is

O(|Σ|m log n + occ).

 Note that this solution can be generalized to handle

k-mismatch or k-difference.