On-line Construction of Compact Suffix Vectors and Maximal Repeats - - PowerPoint PPT Presentation
On-line Construction of Compact Suffix Vectors and Maximal Repeats - - PowerPoint PPT Presentation
On-line Construction of Compact Suffix Vectors and Maximal Repeats Elise Prieur and Thierry Lecroq elise.prieur@univ-rouen.fr Laboratoire dInformatique de Traitement de lInformation et des Syst` emes. Journ ees Montoises August
Introduction Suffix Vectors Computing maximal repeats Conclusion
Plan
1
Introduction
2
Suffix Vectors
3
Computing maximal repeats
4
Conclusion
´ Elise Prieur Compact Suffix Vectors 2/24
Introduction Suffix Vectors Computing maximal repeats Conclusion
1
Introduction Motivation Suffix trees Ukkonen’s algorithm
2 Suffix Vectors
Introduction Compact Suffix Vectors On-line construction of a compact suffix vector
3 Computing maximal repeats 4 Conclusion
´ Elise Prieur Compact Suffix Vectors 3/24
Introduction Suffix Vectors Computing maximal repeats Conclusion
Motivation
Detecting repeats in long biological sequences. Adapted index structure.
´ Elise Prieur Compact Suffix Vectors 4/24
Introduction Suffix Vectors Computing maximal repeats Conclusion
Notations y is a sequence of length n on the alphabet A. $ is a terminator symbol. Suffix tree index structure; all substrings represented; edges labeled (begin position, length); leaves represent suffixes. Suffix tree of tata$
3 4 2 (4,1) (1,1) 1 (2,3) (2,3) (0,2)ta ta a a (4,1)$ ta$ $ ta$ (4,1)$
´ Elise Prieur Compact Suffix Vectors 5/24
Introduction Suffix Vectors Computing maximal repeats Conclusion
Ukkonen’s algorithm
On-line algorithm Construction split into n phases which are also split into extensions. During the phase i, construction of the implicit tree of y[0..i] from the one of y[0..i − 1]. During the extension j of the phase i, the suffix y[j + 1..i] is added to the tree. The last added substring is w = y[j + 1..i − 1].
´ Elise Prieur Compact Suffix Vectors 6/24
Introduction Suffix Vectors Computing maximal repeats Conclusion
The 3 rules
Ukkonen’s algorithm is based on 3 rules expressed by Gusfield1: Rule 1
=y[j+1...i−1] w
1Algorithms on Strings, Trees and Sequences: Computer Science and
Computational Biology, Cambridge University Press, 1997
´ Elise Prieur Compact Suffix Vectors 7/24
Introduction Suffix Vectors Computing maximal repeats Conclusion
The 3 rules
Ukkonen’s algorithm is based on 3 rules expressed by Gusfield: Rule 1
y[i]=y[j+1...i] w
´ Elise Prieur Compact Suffix Vectors 7/24
Introduction Suffix Vectors Computing maximal repeats Conclusion
The 3 rules
Ukkonen’s algorithm is based on 3 rules expressed by Gusfield: Rule 2
x w
´ Elise Prieur Compact Suffix Vectors 7/24
Introduction Suffix Vectors Computing maximal repeats Conclusion
The 3 rules
Ukkonen’s algorithm is based on 3 rules expressed by Gusfield: Rule 2
x y[i] w
´ Elise Prieur Compact Suffix Vectors 7/24
Introduction Suffix Vectors Computing maximal repeats Conclusion
The 3 rules
Ukkonen’s algorithm is based on 3 rules expressed by Gusfield: Rule 3
y[i]x w
´ Elise Prieur Compact Suffix Vectors 7/24
Introduction Suffix Vectors Computing maximal repeats Conclusion
Some properties
leaves are added in increasing order; rule 1 does not need any treatment; phase i begins at the extension jℓ + 1, where jℓ is the number
- f the last created leaf;
phase i ends at the first extension j > jℓ such that rule 3 is applied.
´ Elise Prieur Compact Suffix Vectors 8/24
Introduction Suffix Vectors Computing maximal repeats Conclusion
1 Introduction
Motivation Suffix trees Ukkonen’s algorithm
2
Suffix Vectors Introduction Compact Suffix Vectors On-line construction of a compact suffix vector
3 Computing maximal repeats 4 Conclusion
´ Elise Prieur Compact Suffix Vectors 9/24
Introduction Suffix Vectors Computing maximal repeats Conclusion
Introduction to suffix vectors
3 4 2 (4,1) (1,1) 1 (2,3) (2,3) (0,2)ta ta a a (4,1)$ ta$ $ ta$ (4,1)$
t a t a $
3 4 1 2 2 1 3 3
(4, 1) (1,1) − (0, 2) −
(4,1) (4,1)
Root
´ Elise Prieur Compact Suffix Vectors 10/24
Introduction Suffix Vectors Computing maximal repeats Conclusion
Introduction to suffix vectors
R (0, 1)a 3 2 3′ (2, 1)t (3, 1)t (2, 2)tt (1, 13) 7′ 2 6 (8, 6)tatta$ (4, 4)tatt (12, 2)a$ 7′′ 3 7 5 10 (8, 6)tatta$ (5, 1)a (6, 2)tt (12, 2)a$ (13, 1)$ 4 8 7′′′ (8, 6)tatta$ (12, 2)a$ 7 5 1 9 (12, 2)a$ (4, 4)tatt (8, 6)tatta$ (12, 2)a$ 5′ 11 (5, 1)a (6, 2)tt (13, 1)$ 12 (13, 1)$ 13 (13, 1)$
Root
(0, 1) − (2, 1) − (13, 1) 1 2 3 4 5 6 7 8 9 10 11 12 13
a a t t t a t t t a t t a $
1|2|(5, 1) 7|6|(12, 2) 6|6|(12, 2) 5|6|(12, 2) 4|6|(12, 2) 3|4|(12, 2) 2|4|(5, 1) 3|2|(13, 1) 2|2|(13, 1) 1|13|(2, 2) − (13, 1)
´ Elise Prieur Compact Suffix Vectors 11/24
Introduction Suffix Vectors Computing maximal repeats Conclusion
Introduction to suffix vectors
Alternative data structure to suffix trees same information in reduced space introduced by K. Monostori in 2001
Root
(0, 1) − (2, 1) − (13, 1) 1 2 3 4 5 6 7 8 9 10 11 12 13
a a t t t a t t t a t t a $
1|2|(5, 1) 7|6|(12, 2) 6|6|(12, 2) 5|6|(12, 2) 4|6|(12, 2) 3|4|(12, 2) 2|4|(5, 1) 3|2|(13, 1) 2|2|(13, 1) 1|13|(2, 2) − (13, 1)
´ Elise Prieur Compact Suffix Vectors 11/24
Introduction Suffix Vectors Computing maximal repeats Conclusion
Introduction to suffix vectors
Definition A succession of boxes whose lines contain: the depth of the node; the natural edge; the edge list. The root is a special box. Notations
- Bj: box at position j in y,
- The natural edge of a line in Bj is
the end position of the edge beginning by y[j + 1].
Root
(0, 1) − (2, 1) − (13, 1) 1 2 3 4 5 6 7 8 9 10 11 12 13
a a t t t a t t t a t t a $
1|2|(5, 1) 7|6|(12, 2) 6|6|(12, 2) 5|6|(12, 2) 4|6|(12, 2) 3|4|(12, 2) 2|4|(5, 1) 3|2|(13, 1) 2|2|(13, 1) 1|13|(2, 2) − (13, 1)
´ Elise Prieur Compact Suffix Vectors 11/24
Introduction Suffix Vectors Computing maximal repeats Conclusion
Introduction to suffix vectors
Example tatt is a substring of y ? The root contains the edge (2, 1) beginning by t leading to B2. The edge (5, 1) by a leads to B5. The natural edge begins by tt.
Root
(0, 1) − (2,1) − (13, 1) 1 2 3 4 5 6 7 8 9 10 11 12 13
a a t t t a t t t a t t a $
1|1|(5,1) 7|6|(12, 2) 6|6|(12, 2) 5|6|(12, 2) 4|6|(12, 2) 3|4|(12, 2) 2|4|(5, 1) 3|2|(13, 1) 2|2|(13, 1) 1|13|(2, 2) − (13, 1)
´ Elise Prieur Compact Suffix Vectors 11/24
Introduction Suffix Vectors Computing maximal repeats Conclusion
Compact a vector
Definition A group of nodes is a set of nodes which are in the same box and have exactly the same edges.
´ Elise Prieur Compact Suffix Vectors 12/24
Introduction Suffix Vectors Computing maximal repeats Conclusion
Compact suffix vectors 3 rules of compaction of a box: Rule A the node with depth d − 2 has the same edges as the node with depth d − 1, Rule B the node with depth d − 1 has the same edges as the node with depth d and some extra edges, Rule C the node with depth d − 3 has different edges to the node with depth d − 2.
d−2 d−1 d−3 d Rule B Rule A Rule C
´ Elise Prieur Compact Suffix Vectors 13/24
Introduction Suffix Vectors Computing maximal repeats Conclusion
Compacting V(aatttatttatta$)
= ⇒
Root
(0, 1) − (2, 1) − (13, 1) 1 2 3 4 5 6 7 8 9 10 11 12 13
a a t t t a t t t a t t a $
1|2|(5, 1) 7|6|(12, 2) 6|6|(12, 2) 5|6|(12, 2) 4|6|(12, 2) 3|4|(12, 2) 2|4|(5, 1) 3|2|(13, 1) 2|2|(13, 1) 1|13|(2, 2) − (13, 1)
Root
(0, 1) − (2, 1) − (13, 1) 1 2 3 4 5 6 7 8 9 10 11 12 13
a a t t t a t t t a t t a $
1|2|(5, 1) 7|6|(12, 2) 4 3|4|(12, 2) 2|4|(5, 1) 3|2|(13, 1) 2 1|13|(2, 2) − (13, 1)
´ Elise Prieur Compact Suffix Vectors 14/24
Introduction Suffix Vectors Computing maximal repeats Conclusion
y Monostori − − − − − − →
O(n)
Extended vector Monostori − − − − − − →
O(n)
Compact vector
´ Elise Prieur Compact Suffix Vectors 15/24
Introduction Suffix Vectors Computing maximal repeats Conclusion
On-line construction of a compact vector
✲ ✲ ✻ y extended vector compact vector Monostori Monostori O(n) O(n) Prieur, Lecroq O(n)
Faster and more space economical construction.
´ Elise Prieur Compact Suffix Vectors 16/24
Introduction Suffix Vectors Computing maximal repeats Conclusion
On-line construction of a compact vector
Proposition When an edge is added to the node w of depth d in a box Bp, this edge will be added to all the nodes in Bp of depth smaller then d in the group of nodes of w.
i j p+1 j p’+1 a a a a i y y v v v v w w w w
´ Elise Prieur Compact Suffix Vectors 17/24
Introduction Suffix Vectors Computing maximal repeats Conclusion
On-line construction of a compact vector
Skip k − 1 extensions where k is the number of the nodes in the group into the edge is added.
´ Elise Prieur Compact Suffix Vectors 18/24
Introduction Suffix Vectors Computing maximal repeats Conclusion
1 Introduction
Motivation Suffix trees Ukkonen’s algorithm
2 Suffix Vectors
Introduction Compact Suffix Vectors On-line construction of a compact suffix vector
3
Computing maximal repeats
4 Conclusion
´ Elise Prieur Compact Suffix Vectors 19/24
Introduction Suffix Vectors Computing maximal repeats Conclusion
Definition A maximal repeat in a string is a substring such that there exist at least 2 occurrences : a1ub1 and a2ub2 with a1 = a2, b1 = b2 and a1, a2, b1, b2 ∈ A. Example y =aatttatttatta$ tta is a maximal repeat at positions 5 and 12.
´ Elise Prieur Compact Suffix Vectors 20/24
Introduction Suffix Vectors Computing maximal repeats Conclusion
Applying to suffix vectors
Proposition The deepest node of each group of nodes represents a maximal repeat.
´ Elise Prieur Compact Suffix Vectors 21/24
Introduction Suffix Vectors Computing maximal repeats Conclusion Root
(0, 1) − (2, 1) − (13, 1) 1 2 3 4 5 6 7 8 9 10 11 12 13
a a t t t a t t t a t t a $
1|2|(5, 1) 7|6|(12, 2) 4 3|4|(12, 2) 2|4|(5, 1) 3|2|(13, 1) 2 1|13|(2, 2) − (13, 1)
Example Boxes 0, 2, 5 et 7 are reduced: a, t, tta, atttatt are maximal repeats. Box B3 is extended, the 2 lines have different edges: att, tt are maximal repeats.
´ Elise Prieur Compact Suffix Vectors 22/24
Introduction Suffix Vectors Computing maximal repeats Conclusion
1 Introduction
Motivation Suffix trees Ukkonen’s algorithm
2 Suffix Vectors
Introduction Compact Suffix Vectors On-line construction of a compact suffix vector
3 Computing maximal repeats 4
Conclusion
´ Elise Prieur Compact Suffix Vectors 23/24
Introduction Suffix Vectors Computing maximal repeats Conclusion
Conclusion
More economical construction of the compact suffix vector. Linear method to compute maximal repeats with a compact suffix vector.
´ Elise Prieur Compact Suffix Vectors 24/24