On-line Construction of Compact Suffix Vectors and Maximal Repeats - - PowerPoint PPT Presentation

on line construction of compact suffix vectors and
SMART_READER_LITE
LIVE PREVIEW

On-line Construction of Compact Suffix Vectors and Maximal Repeats - - PowerPoint PPT Presentation

On-line Construction of Compact Suffix Vectors and Maximal Repeats Elise Prieur and Thierry Lecroq elise.prieur@univ-rouen.fr Laboratoire dInformatique de Traitement de lInformation et des Syst` emes. Journ ees Montoises August


slide-1
SLIDE 1

On-line Construction of Compact Suffix Vectors and Maximal Repeats

´ Elise Prieur and Thierry Lecroq

elise.prieur@univ-rouen.fr

Laboratoire d’Informatique de Traitement de l’Information et des Syst` emes.

Journ´ ees Montoises August 30th, 2006, Rennes

slide-2
SLIDE 2

Introduction Suffix Vectors Computing maximal repeats Conclusion

Plan

1

Introduction

2

Suffix Vectors

3

Computing maximal repeats

4

Conclusion

´ Elise Prieur Compact Suffix Vectors 2/24

slide-3
SLIDE 3

Introduction Suffix Vectors Computing maximal repeats Conclusion

1

Introduction Motivation Suffix trees Ukkonen’s algorithm

2 Suffix Vectors

Introduction Compact Suffix Vectors On-line construction of a compact suffix vector

3 Computing maximal repeats 4 Conclusion

´ Elise Prieur Compact Suffix Vectors 3/24

slide-4
SLIDE 4

Introduction Suffix Vectors Computing maximal repeats Conclusion

Motivation

Detecting repeats in long biological sequences. Adapted index structure.

´ Elise Prieur Compact Suffix Vectors 4/24

slide-5
SLIDE 5

Introduction Suffix Vectors Computing maximal repeats Conclusion

Notations y is a sequence of length n on the alphabet A. $ is a terminator symbol. Suffix tree index structure; all substrings represented; edges labeled (begin position, length); leaves represent suffixes. Suffix tree of tata$

3 4 2 (4,1) (1,1) 1 (2,3) (2,3) (0,2)ta ta a a (4,1)$ ta$ $ ta$ (4,1)$

´ Elise Prieur Compact Suffix Vectors 5/24

slide-6
SLIDE 6

Introduction Suffix Vectors Computing maximal repeats Conclusion

Ukkonen’s algorithm

On-line algorithm Construction split into n phases which are also split into extensions. During the phase i, construction of the implicit tree of y[0..i] from the one of y[0..i − 1]. During the extension j of the phase i, the suffix y[j + 1..i] is added to the tree. The last added substring is w = y[j + 1..i − 1].

´ Elise Prieur Compact Suffix Vectors 6/24

slide-7
SLIDE 7

Introduction Suffix Vectors Computing maximal repeats Conclusion

The 3 rules

Ukkonen’s algorithm is based on 3 rules expressed by Gusfield1: Rule 1

=y[j+1...i−1] w

1Algorithms on Strings, Trees and Sequences: Computer Science and

Computational Biology, Cambridge University Press, 1997

´ Elise Prieur Compact Suffix Vectors 7/24

slide-8
SLIDE 8

Introduction Suffix Vectors Computing maximal repeats Conclusion

The 3 rules

Ukkonen’s algorithm is based on 3 rules expressed by Gusfield: Rule 1

y[i]=y[j+1...i] w

´ Elise Prieur Compact Suffix Vectors 7/24

slide-9
SLIDE 9

Introduction Suffix Vectors Computing maximal repeats Conclusion

The 3 rules

Ukkonen’s algorithm is based on 3 rules expressed by Gusfield: Rule 2

x w

´ Elise Prieur Compact Suffix Vectors 7/24

slide-10
SLIDE 10

Introduction Suffix Vectors Computing maximal repeats Conclusion

The 3 rules

Ukkonen’s algorithm is based on 3 rules expressed by Gusfield: Rule 2

x y[i] w

´ Elise Prieur Compact Suffix Vectors 7/24

slide-11
SLIDE 11

Introduction Suffix Vectors Computing maximal repeats Conclusion

The 3 rules

Ukkonen’s algorithm is based on 3 rules expressed by Gusfield: Rule 3

y[i]x w

´ Elise Prieur Compact Suffix Vectors 7/24

slide-12
SLIDE 12

Introduction Suffix Vectors Computing maximal repeats Conclusion

Some properties

leaves are added in increasing order; rule 1 does not need any treatment; phase i begins at the extension jℓ + 1, where jℓ is the number

  • f the last created leaf;

phase i ends at the first extension j > jℓ such that rule 3 is applied.

´ Elise Prieur Compact Suffix Vectors 8/24

slide-13
SLIDE 13

Introduction Suffix Vectors Computing maximal repeats Conclusion

1 Introduction

Motivation Suffix trees Ukkonen’s algorithm

2

Suffix Vectors Introduction Compact Suffix Vectors On-line construction of a compact suffix vector

3 Computing maximal repeats 4 Conclusion

´ Elise Prieur Compact Suffix Vectors 9/24

slide-14
SLIDE 14

Introduction Suffix Vectors Computing maximal repeats Conclusion

Introduction to suffix vectors

3 4 2 (4,1) (1,1) 1 (2,3) (2,3) (0,2)ta ta a a (4,1)$ ta$ $ ta$ (4,1)$

t a t a $

3 4 1 2 2 1 3 3

(4, 1) (1,1) − (0, 2) −

(4,1) (4,1)

Root

´ Elise Prieur Compact Suffix Vectors 10/24

slide-15
SLIDE 15

Introduction Suffix Vectors Computing maximal repeats Conclusion

Introduction to suffix vectors

R (0, 1)a 3 2 3′ (2, 1)t (3, 1)t (2, 2)tt (1, 13) 7′ 2 6 (8, 6)tatta$ (4, 4)tatt (12, 2)a$ 7′′ 3 7 5 10 (8, 6)tatta$ (5, 1)a (6, 2)tt (12, 2)a$ (13, 1)$ 4 8 7′′′ (8, 6)tatta$ (12, 2)a$ 7 5 1 9 (12, 2)a$ (4, 4)tatt (8, 6)tatta$ (12, 2)a$ 5′ 11 (5, 1)a (6, 2)tt (13, 1)$ 12 (13, 1)$ 13 (13, 1)$

Root

(0, 1) − (2, 1) − (13, 1) 1 2 3 4 5 6 7 8 9 10 11 12 13

a a t t t a t t t a t t a $

1|2|(5, 1) 7|6|(12, 2) 6|6|(12, 2) 5|6|(12, 2) 4|6|(12, 2) 3|4|(12, 2) 2|4|(5, 1) 3|2|(13, 1) 2|2|(13, 1) 1|13|(2, 2) − (13, 1)

´ Elise Prieur Compact Suffix Vectors 11/24

slide-16
SLIDE 16

Introduction Suffix Vectors Computing maximal repeats Conclusion

Introduction to suffix vectors

Alternative data structure to suffix trees same information in reduced space introduced by K. Monostori in 2001

Root

(0, 1) − (2, 1) − (13, 1) 1 2 3 4 5 6 7 8 9 10 11 12 13

a a t t t a t t t a t t a $

1|2|(5, 1) 7|6|(12, 2) 6|6|(12, 2) 5|6|(12, 2) 4|6|(12, 2) 3|4|(12, 2) 2|4|(5, 1) 3|2|(13, 1) 2|2|(13, 1) 1|13|(2, 2) − (13, 1)

´ Elise Prieur Compact Suffix Vectors 11/24

slide-17
SLIDE 17

Introduction Suffix Vectors Computing maximal repeats Conclusion

Introduction to suffix vectors

Definition A succession of boxes whose lines contain: the depth of the node; the natural edge; the edge list. The root is a special box. Notations

  • Bj: box at position j in y,
  • The natural edge of a line in Bj is

the end position of the edge beginning by y[j + 1].

Root

(0, 1) − (2, 1) − (13, 1) 1 2 3 4 5 6 7 8 9 10 11 12 13

a a t t t a t t t a t t a $

1|2|(5, 1) 7|6|(12, 2) 6|6|(12, 2) 5|6|(12, 2) 4|6|(12, 2) 3|4|(12, 2) 2|4|(5, 1) 3|2|(13, 1) 2|2|(13, 1) 1|13|(2, 2) − (13, 1)

´ Elise Prieur Compact Suffix Vectors 11/24

slide-18
SLIDE 18

Introduction Suffix Vectors Computing maximal repeats Conclusion

Introduction to suffix vectors

Example tatt is a substring of y ? The root contains the edge (2, 1) beginning by t leading to B2. The edge (5, 1) by a leads to B5. The natural edge begins by tt.

Root

(0, 1) − (2,1) − (13, 1) 1 2 3 4 5 6 7 8 9 10 11 12 13

a a t t t a t t t a t t a $

1|1|(5,1) 7|6|(12, 2) 6|6|(12, 2) 5|6|(12, 2) 4|6|(12, 2) 3|4|(12, 2) 2|4|(5, 1) 3|2|(13, 1) 2|2|(13, 1) 1|13|(2, 2) − (13, 1)

´ Elise Prieur Compact Suffix Vectors 11/24

slide-19
SLIDE 19

Introduction Suffix Vectors Computing maximal repeats Conclusion

Compact a vector

Definition A group of nodes is a set of nodes which are in the same box and have exactly the same edges.

´ Elise Prieur Compact Suffix Vectors 12/24

slide-20
SLIDE 20

Introduction Suffix Vectors Computing maximal repeats Conclusion

Compact suffix vectors 3 rules of compaction of a box: Rule A the node with depth d − 2 has the same edges as the node with depth d − 1, Rule B the node with depth d − 1 has the same edges as the node with depth d and some extra edges, Rule C the node with depth d − 3 has different edges to the node with depth d − 2.

d−2 d−1 d−3 d Rule B Rule A Rule C

´ Elise Prieur Compact Suffix Vectors 13/24

slide-21
SLIDE 21

Introduction Suffix Vectors Computing maximal repeats Conclusion

Compacting V(aatttatttatta$)

= ⇒

Root

(0, 1) − (2, 1) − (13, 1) 1 2 3 4 5 6 7 8 9 10 11 12 13

a a t t t a t t t a t t a $

1|2|(5, 1) 7|6|(12, 2) 6|6|(12, 2) 5|6|(12, 2) 4|6|(12, 2) 3|4|(12, 2) 2|4|(5, 1) 3|2|(13, 1) 2|2|(13, 1) 1|13|(2, 2) − (13, 1)

Root

(0, 1) − (2, 1) − (13, 1) 1 2 3 4 5 6 7 8 9 10 11 12 13

a a t t t a t t t a t t a $

1|2|(5, 1) 7|6|(12, 2) 4 3|4|(12, 2) 2|4|(5, 1) 3|2|(13, 1) 2 1|13|(2, 2) − (13, 1)

´ Elise Prieur Compact Suffix Vectors 14/24

slide-22
SLIDE 22

Introduction Suffix Vectors Computing maximal repeats Conclusion

y Monostori − − − − − − →

O(n)

Extended vector Monostori − − − − − − →

O(n)

Compact vector

´ Elise Prieur Compact Suffix Vectors 15/24

slide-23
SLIDE 23

Introduction Suffix Vectors Computing maximal repeats Conclusion

On-line construction of a compact vector

✲ ✲ ✻ y extended vector compact vector Monostori Monostori O(n) O(n) Prieur, Lecroq O(n)

Faster and more space economical construction.

´ Elise Prieur Compact Suffix Vectors 16/24

slide-24
SLIDE 24

Introduction Suffix Vectors Computing maximal repeats Conclusion

On-line construction of a compact vector

Proposition When an edge is added to the node w of depth d in a box Bp, this edge will be added to all the nodes in Bp of depth smaller then d in the group of nodes of w.

i j p+1 j p’+1 a a a a i y y v v v v w w w w

´ Elise Prieur Compact Suffix Vectors 17/24

slide-25
SLIDE 25

Introduction Suffix Vectors Computing maximal repeats Conclusion

On-line construction of a compact vector

Skip k − 1 extensions where k is the number of the nodes in the group into the edge is added.

´ Elise Prieur Compact Suffix Vectors 18/24

slide-26
SLIDE 26

Introduction Suffix Vectors Computing maximal repeats Conclusion

1 Introduction

Motivation Suffix trees Ukkonen’s algorithm

2 Suffix Vectors

Introduction Compact Suffix Vectors On-line construction of a compact suffix vector

3

Computing maximal repeats

4 Conclusion

´ Elise Prieur Compact Suffix Vectors 19/24

slide-27
SLIDE 27

Introduction Suffix Vectors Computing maximal repeats Conclusion

Definition A maximal repeat in a string is a substring such that there exist at least 2 occurrences : a1ub1 and a2ub2 with a1 = a2, b1 = b2 and a1, a2, b1, b2 ∈ A. Example y =aatttatttatta$ tta is a maximal repeat at positions 5 and 12.

´ Elise Prieur Compact Suffix Vectors 20/24

slide-28
SLIDE 28

Introduction Suffix Vectors Computing maximal repeats Conclusion

Applying to suffix vectors

Proposition The deepest node of each group of nodes represents a maximal repeat.

´ Elise Prieur Compact Suffix Vectors 21/24

slide-29
SLIDE 29

Introduction Suffix Vectors Computing maximal repeats Conclusion Root

(0, 1) − (2, 1) − (13, 1) 1 2 3 4 5 6 7 8 9 10 11 12 13

a a t t t a t t t a t t a $

1|2|(5, 1) 7|6|(12, 2) 4 3|4|(12, 2) 2|4|(5, 1) 3|2|(13, 1) 2 1|13|(2, 2) − (13, 1)

Example Boxes 0, 2, 5 et 7 are reduced: a, t, tta, atttatt are maximal repeats. Box B3 is extended, the 2 lines have different edges: att, tt are maximal repeats.

´ Elise Prieur Compact Suffix Vectors 22/24

slide-30
SLIDE 30

Introduction Suffix Vectors Computing maximal repeats Conclusion

1 Introduction

Motivation Suffix trees Ukkonen’s algorithm

2 Suffix Vectors

Introduction Compact Suffix Vectors On-line construction of a compact suffix vector

3 Computing maximal repeats 4

Conclusion

´ Elise Prieur Compact Suffix Vectors 23/24

slide-31
SLIDE 31

Introduction Suffix Vectors Computing maximal repeats Conclusion

Conclusion

More economical construction of the compact suffix vector. Linear method to compute maximal repeats with a compact suffix vector.

´ Elise Prieur Compact Suffix Vectors 24/24