Efficient Seeds Computation Revisited Michalis Christou, Maxime - - PowerPoint PPT Presentation

efficient seeds computation revisited
SMART_READER_LITE
LIVE PREVIEW

Efficient Seeds Computation Revisited Michalis Christou, Maxime - - PowerPoint PPT Presentation

Efficient Seeds Computation Revisited Michalis Christou, Maxime Crochemore, Costas S. Iliopoulos, Marcin Kubica, Solon P. Pissis, Jakub Radoszewski , Wojciech Rytter, Bartosz Szreder, Tomasz Wale Kings College London & University of


slide-1
SLIDE 1

Efficient Seeds Computation Revisited

Michalis Christou, Maxime Crochemore, Costas S. Iliopoulos, Marcin Kubica, Solon P. Pissis, Jakub Radoszewski, Wojciech Rytter, Bartosz Szreder, Tomasz Waleń

King’s College London & University of Warsaw

CPM Mondello, Palermo, June 29, 2011

1/1

slide-2
SLIDE 2

Why quasiperiodicity?

Periodicity: a b a a a b a a a b a a a b a a

2/1

slide-3
SLIDE 3

Why quasiperiodicity?

Periodicity: a b a a a b a a a b a a a b a a a b

2/1

slide-4
SLIDE 4

Why quasiperiodicity?

Periodicity: a b a a a b a a a b a a a b a a a b a b a a b a a b a a a b a a a b

2/1

slide-5
SLIDE 5

Why quasiperiodicity?

Periodicity: a b a a a b a a a b a a a b a a a b a b a a b a a b a a a b a a a b Quasiperiodicity:

2/1

slide-6
SLIDE 6

Types of quasiperiodicity

a b a a b a a b a a a b a a

3/1

slide-7
SLIDE 7

Types of quasiperiodicity

a b a a b a a b a a a b a a Cover: every letter of the string is covered by some occurrence

  • f the cover

3/1

slide-8
SLIDE 8

Types of quasiperiodicity

a b a a b a a b a a a b a a Cover: every letter of the string is covered by some occurrence

  • f the cover

3/1

slide-9
SLIDE 9

Types of quasiperiodicity

a b a a b a a b a a a b a a a a b a

3/1

slide-10
SLIDE 10

Types of quasiperiodicity

a b a a b a a b a a a b a a a a b a Seed: every letter of the string is covered by some occurrence

  • f the seed, occurrences may be external

3/1

slide-11
SLIDE 11

Main related problems

Problem: Cover computation find the shortest cover (all the covers) of a string u

4/1

slide-12
SLIDE 12

Main related problems

Problem: Cover computation find the shortest cover (all the covers) of a string u Solution: Apostolico, Farach, and Iliopoulos (1991), Moore and Smyth (1994), O(n) time algorithms.

4/1

slide-13
SLIDE 13

Main related problems

Problem: Cover computation find the shortest cover (all the covers) of a string u Solution: Apostolico, Farach, and Iliopoulos (1991), Moore and Smyth (1994), O(n) time algorithms. Harder problem: Cover array compute C[1. . n], where C[i] is the shortest cover of the string u[1. . i]

4/1

slide-14
SLIDE 14

Main related problems

Problem: Cover computation find the shortest cover (all the covers) of a string u Solution: Apostolico, Farach, and Iliopoulos (1991), Moore and Smyth (1994), O(n) time algorithms. Harder problem: Cover array compute C[1. . n], where C[i] is the shortest cover of the string u[1. . i] Solution: Breslauer (1992), O(n) time algorithm.

4/1

slide-15
SLIDE 15

Main related problems

Problem: Cover computation find the shortest cover (all the covers) of a string u Solution: Apostolico, Farach, and Iliopoulos (1991), Moore and Smyth (1994), O(n) time algorithms. Harder problem: Cover array compute C[1. . n], where C[i] is the shortest cover of the string u[1. . i] Solution: Breslauer (1992), O(n) time algorithm. Another problem: Seed computation find the shortest seed (all the seeds) of a string

4/1

slide-16
SLIDE 16

Main related problems

Problem: Cover computation find the shortest cover (all the covers) of a string u Solution: Apostolico, Farach, and Iliopoulos (1991), Moore and Smyth (1994), O(n) time algorithms. Harder problem: Cover array compute C[1. . n], where C[i] is the shortest cover of the string u[1. . i] Solution: Breslauer (1992), O(n) time algorithm. Another problem: Seed computation find the shortest seed (all the seeds) of a string Solution: Iliopoulos, Moore & Park (1996), O(n log n) time algorithm.

4/1

slide-17
SLIDE 17

Main contributions

  • 1. Left seeds

We introduce a natural intermediate notion between seeds and covers and give O(n) time algorithms for computing the shortest left seed and the left seed array.

  • 2. Seed array

We show how to compute the seed array in O(n2) time.

  • 3. New (simpler) seeds computation

We present a novel approach to seed computation. Our algorithm works in o(n log n) time for some cases.

5/1

slide-18
SLIDE 18

Left/right seeds

a b a a b a a b a a a b a a Cover: a b a a b a a b a a a b a a a a b a Seed:

6/1

slide-19
SLIDE 19

Left/right seeds

a b a a b a a b a a a b a a Cover: a b a a b a a b a a a b a a a a b a Seed:

a b a a b a a b a a a b a a b a

6/1

slide-20
SLIDE 20

Left/right seeds

a b a a b a a b a a a b a a Cover: a b a a b a a b a a a b a a a a b a Seed:

a b a a b a a b a a a b a a b a Left seed: is a prefix of the string, however its occurrence may exceed the right end of the string

6/1

slide-21
SLIDE 21

Left/right seeds

a b a a b a a b a a a b a a Cover: a b a a b a a b a a a b a a a a b a Seed:

a a b a a b a a b a a a b a a

6/1

slide-22
SLIDE 22

Left/right seeds

a b a a b a a b a a a b a a Cover: a b a a b a a b a a a b a a a a b a Seed:

a a b a a b a a b a a a b a a Right seed: is a suffix of the string, however its occurrence may exceed the left end of the string

6/1

slide-23
SLIDE 23

Left/right seeds

a b a a b a a b a a a b a a Cover: a b a a b a a b a a a b a a a a b a Seed:

a b a a b a a b a a a b a a b a Left seed: is a prefix of the string, however its occurrence may exceed the right end of the string

6/1

slide-24
SLIDE 24

Left seeds computation

Problem: Left seed computation find the shortest lest seed of a string u

7/1

slide-25
SLIDE 25

Left seeds computation

Problem: Left seed computation find the shortest lest seed of a string u Harder problem: Left seed array compute LSeed[1. . n], where LSeed[i] is the shortest left seed

  • f the string u[1. . i]

7/1

slide-26
SLIDE 26

Left seeds computation

Problem: Left seed computation find the shortest lest seed of a string u Harder problem: Left seed array compute LSeed[1. . n], where LSeed[i] is the shortest left seed

  • f the string u[1. . i]

Solution: We present O(n) time algorithms solving both the problems.

7/1

slide-27
SLIDE 27

Left seeds computation

The period of a string

We say that a positive integer p is the (shortest) period of a string u = u1 . . . un (notation: p = per(u)) if p is the smallest positive number, such that ui = ui+p, for i = 1, . . . , n − p.

8/1

slide-28
SLIDE 28

Left seeds computation

The period of a string

We say that a positive integer p is the (shortest) period of a string u = u1 . . . un (notation: p = per(u)) if p is the smallest positive number, such that ui = ui+p, for i = 1, . . . , n − p.

  • Lemma. The length of the shortest left seed of u equals:

min{C[j] : per(u) ≤ j ≤ |u|} where C[1. . n] is the cover array of u.

8/1

slide-29
SLIDE 29

Left seeds computation

The period of a string

We say that a positive integer p is the (shortest) period of a string u = u1 . . . un (notation: p = per(u)) if p is the smallest positive number, such that ui = ui+p, for i = 1, . . . , n − p.

  • Lemma. The length of the shortest left seed of u equals:

min{C[j] : per(u) ≤ j ≤ |u|} where C[1. . n] is the cover array of u.

  • Corollary. The left seed of a string can be computed in O(n)

time.

8/1

slide-30
SLIDE 30

The proof

  • Lemma. The length of the shortest left seed of u equals:

min{C[j] : per(u) ≤ j ≤ |u|} where C[1. . n] is the cover array of u.

Proof. Assume that s is a left seed of u. u s s s s

1 n

9/1

slide-31
SLIDE 31

The proof

  • Lemma. The length of the shortest left seed of u equals:

min{C[j] : per(u) ≤ j ≤ |u|} where C[1. . n] is the cover array of u.

Proof. Assume that s is a left seed of u. u s s s s

1 n j

Then s is a cover of u[1. . j] for some j.

9/1

slide-32
SLIDE 32

The proof

  • Lemma. The length of the shortest left seed of u equals:

min{C[j] : per(u) ≤ j ≤ |u|} where C[1. . n] is the cover array of u.

Proof. Assume that s is a left seed of u. u s s s s

1 n j

Then s is a cover of u[1. . j] for some j. The string u has a border ≥ n − j, hence per(u) ≤ j.

9/1

slide-33
SLIDE 33

The proof

  • Lemma. The length of the shortest left seed of u equals:

min{C[j] : per(u) ≤ j ≤ |u|} where C[1. . n] is the cover array of u.

Proof. Assume that s is a left seed of u. u s s s s

1 n j

Then s is a cover of u[1. . j] for some j. The string u has a border ≥ n − j, hence per(u) ≤ j. (Recall that per(u) + border(u) = |u|).

9/1

slide-34
SLIDE 34

The proof

  • Lemma. The length of the shortest left seed of u equals:

min{C[j] : per(u) ≤ j ≤ |u|} where C[1. . n] is the cover array of u.

Proof (cont). We have proved that the shortest left seed of u corresponds to

  • ne of the covers C[j] for j ≥ per(u).

We need to show that each value C[j] for j ≥ per(u) corresponds to some left seed of u.

10/1

slide-35
SLIDE 35

The proof

  • Lemma. The length of the shortest left seed of u equals:

min{C[j] : per(u) ≤ j ≤ |u|} where C[1. . n] is the cover array of u.

Proof (cont). Assume that s is a cover of v = u[1. . j] for some j ≥ per(u). v s per(u)

1 j n

11/1

slide-36
SLIDE 36

The proof

  • Lemma. The length of the shortest left seed of u equals:

min{C[j] : per(u) ≤ j ≤ |u|} where C[1. . n] is the cover array of u.

Proof (cont). Assume that s is a cover of v = u[1. . j] for some j ≥ per(u). v s per(u)

1 j n

v v Then v is a left seed of u.

11/1

slide-37
SLIDE 37

The proof

  • Lemma. The length of the shortest left seed of u equals:

min{C[j] : per(u) ≤ j ≤ |u|} where C[1. . n] is the cover array of u.

Proof (cont). Assume that s is a cover of v = u[1. . j] for some j ≥ per(u). v s per(u)

1 j n

v v Then v is a left seed of u. Hence, s is also a left seed of u.

11/1

slide-38
SLIDE 38

Left seeds computation

  • Lemma. The length of the shortest left seed of u equals:

min{C[j] : per(u) ≤ j ≤ |u|} where C[1. . n] is the cover array of u.

12/1

slide-39
SLIDE 39

Left seeds computation

  • Lemma. The length of the shortest left seed of u equals:

min{C[j] : per(u) ≤ j ≤ |u|} where C[1. . n] is the cover array of u. Corollary 2. The left seed array can be computed as follows: LSeed[i] = min{C[j] : P[i] ≤ j ≤ i} where P[1. . n] is the period array

12/1

slide-40
SLIDE 40

Left seeds computation

  • Lemma. The length of the shortest left seed of u equals:

min{C[j] : per(u) ≤ j ≤ |u|} where C[1. . n] is the cover array of u. Corollary 2. The left seed array can be computed as follows: LSeed[i] = min{C[j] : P[i] ≤ j ≤ i} where P[1. . n] is the period array (recall that it can be computed in O(n) time).

12/1

slide-41
SLIDE 41

Left seeds computation

  • Lemma. The length of the shortest left seed of u equals:

min{C[j] : per(u) ≤ j ≤ |u|} where C[1. . n] is the cover array of u. Corollary 2. The left seed array can be computed as follows: LSeed[i] = min{C[j] : P[i] ≤ j ≤ i} where P[1. . n] is the period array (recall that it can be computed in O(n) time). The problem reduces to RMQ on the C array — O(n) time algorithm.

12/1

slide-42
SLIDE 42

Seed array

Problem: Seed array compute Seed[1. . n], where Seed[i] is the shortest seed of the string u[1. . i]

13/1

slide-43
SLIDE 43

Seed array

Problem: Seed array compute Seed[1. . n], where Seed[i] is the shortest seed of the string u[1. . i] A naive method yields O(n2 log n) time.

13/1

slide-44
SLIDE 44

Seed array

Problem: Seed array compute Seed[1. . n], where Seed[i] is the shortest seed of the string u[1. . i] A naive method yields O(n2 log n) time. We present an O(n2) time algorithm.

13/1

slide-45
SLIDE 45

The solution

ALGORITHM SeedArray(u)

1: Seed[1] := 1; 2: for i := 2 to n do 3:

Seed[i] := Seed[i − 1];

4:

while u[1. . i] does not have a seed of length Seed[i] do

5:

Seed[i] := Seed[i] + 1;

6: return Seed[1. . n];

14/1

slide-46
SLIDE 46

The solution

ALGORITHM SeedArray(u)

1: Seed[1] := 1; 2: for i := 2 to n do 3:

Seed[i] := Seed[i − 1];

4:

while u[1. . i] does not have a seed of length Seed[i] do

5:

Seed[i] := Seed[i] + 1;

6: return Seed[1. . n];

We develop an O(n) time test: SeedsOfAGivenLength(u, k) which checks if u has a seed of length k.

14/1

slide-47
SLIDE 47

The solution

ALGORITHM SeedArray(u)

1: Seed[1] := 1; 2: for i := 2 to n do 3:

Seed[i] := Seed[i − 1];

4:

while u[1. . i] does not have a seed of length Seed[i] do

5:

Seed[i] := Seed[i] + 1;

6: return Seed[1. . n];

We develop an O(n) time test: SeedsOfAGivenLength(u, k) which checks if u has a seed of length k. (This test uses the suffix arrays of u.)

14/1

slide-48
SLIDE 48

Seed computation

We present a new O(n log n) time algorithm for seed computation. It can be used to check if u has the shortest seed of length ≥ m in O(n log (n/m)) time. Hence, finding the shortest seed

  • f length m = Θ(n) can be done in O(n) time.

15/1

slide-49
SLIDE 49

Links between the notions

Seed: a a a b a a b a a b a a a b a a b a

16/1

slide-50
SLIDE 50

Links between the notions

Seed: a a a b a a b a a b a a a b a a b a

16/1

slide-51
SLIDE 51

Links between the notions

Seed: a a a b a a b a a b a a a b a a b a cover

16/1

slide-52
SLIDE 52

Links between the notions

Seed: a a a b a a b a a b a a a b a a b a cover left seed

16/1

slide-53
SLIDE 53

Links between the notions

Seed: a a a b a a b a a b a a a b a a b a cover left seed right seed

16/1

slide-54
SLIDE 54

Links between the notions

Seed: a a a b a a b a a b a a a b a a b a cover left seed right seed The string s is a cover of u[i. . j] if maxgap(s) ≤ |s|.

16/1

slide-55
SLIDE 55

The suffix tree

a b $ 6 3 ab$ b aab$ $ 1 4 aab$ $ 2 5 It suffices to compute maxgaps for all the explicit nodes of the suffix tree. E.g., maxgap(ab) = maxgap({1, 4}) = 3, maxgap(a) = maxgap({1, 3, 4}) = 2.

17/1

slide-56
SLIDE 56

The suffix tree

a b $ 6 3 ab$ b aab$ $ 1 4 aab$ $ 2 5 It suffices to compute maxgaps for all the explicit nodes of the suffix tree. E.g., maxgap(ab) = maxgap({1, 4}) = 3, maxgap(a) = maxgap({1, 3, 4}) = 2. It is easier to compute prefix maxgaps (i.e., the maxima of the maxgap values in the path from a node to the root).

17/1

slide-57
SLIDE 57

Prefix maxgaps

We show how to compute all the prefix maxgaps in a path down the suffix tree in linear time.

18/1

slide-58
SLIDE 58

Prefix maxgaps

We show how to compute all the prefix maxgaps in a path down the suffix tree in linear time. root T1 T2 T3 T4 We obtain a recursive algorithm.

18/1

slide-59
SLIDE 59

Prefix maxgaps

We show how to compute all the prefix maxgaps in a path down the suffix tree in linear time. root T1 T2 T3 T4 We obtain a recursive algorithm. Each time we choose the heaviest path in the suffix tree. We

  • btain O(log n) levels of recursion and O(n log n) total time.

18/1

slide-60
SLIDE 60

Prefix maxgaps

We show how to compute all the prefix maxgaps in a path down the suffix tree in linear time. root T1 T2 T3 T4 If we search for the shortest seed of length ≥ m then it suffices to consider several subtrees of the tree, each of size O(n/m). We obtain an O(n log (n/m)) time algorithm.

18/1

slide-61
SLIDE 61

Summary

Cover O(n) Cover array O(n) Left seed O(n) Left seed array O(n) Seed O(n log n) [ O(n log(n/m)) ] Seed array O(n2)

19/1

slide-62
SLIDE 62

Summary

Cover O(n) Cover array O(n) Left seed O(n) Left seed array O(n) Seed O(n log n) [ O(n log(n/m)) ] Seed array O(n2)

Thank you for your attention!

19/1