A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka , - - PowerPoint PPT Presentation

a linear time algorithm for seeds computation
SMART_READER_LITE
LIVE PREVIEW

A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka , - - PowerPoint PPT Presentation

A Linear Time Algorithm for Seeds Computation Tomasz Kociumaka , Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Wale University of Warsaw SODA 2012 Kyoto, January 18, 2012 Tomasz Kociumaka A Linear Time Algorithm for Seeds


slide-1
SLIDE 1

A Linear Time Algorithm for Seeds Computation

Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

University of Warsaw

SODA 2012 Kyoto, January 18, 2012

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 1/20

slide-2
SLIDE 2

Periodicity and quasiperiodicity

Periodicity:

a a a a a a a a a a a a b b b b = One of the key concepts in text algorithms.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 2/20

slide-3
SLIDE 3

Periodicity and quasiperiodicity

Periodicity:

a a a a a a a a a a a a b b b b = One of the key concepts in text algorithms.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 2/20

slide-4
SLIDE 4

Periodicity and quasiperiodicity

Periodicity:

a a a a a a a a a a a a b b b b = One of the key concepts in text algorithms.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 2/20

slide-5
SLIDE 5

Periodicity and quasiperiodicity

Periodicity:

a a a a a a a a a a a a b b b b = One of the key concepts in text algorithms.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 2/20

slide-6
SLIDE 6

Periodicity and quasiperiodicity

Periodicity:

a a a a a a a a a a a a b b b b One of the key concepts in text algorithms.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 2/20

slide-7
SLIDE 7

Periodicity and quasiperiodicity

Periodicity:

a a a a a a a a a a a a b b b b a b One of the key concepts in text algorithms.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 2/20

slide-8
SLIDE 8

Periodicity and quasiperiodicity

Periodicity:

a a a a a a a a a a a a b b b b a b

Quasiperiodicity:

a a a a a a a a a a a a a b b b b

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 2/20

slide-9
SLIDE 9

Periodicity and quasiperiodicity

Periodicity:

a a a a a a a a a a a a b b b b a b

Quasiperiodicity:

a a a a a a a a a a a a a b b b b b

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 2/20

slide-10
SLIDE 10

Covers and seeds

Cover:

a a a a a a a a a a a a b b b b b b a b b

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 3/20

slide-11
SLIDE 11

Covers and seeds

Cover:

a a a a a a a a a a a a b b b b b b a b b

Each letter of the word is covered by an occurrence

  • f the cover.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 3/20

slide-12
SLIDE 12

Covers and seeds

Seed:

a a a a a a a a a a a a b b b b b b

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 3/20

slide-13
SLIDE 13

Covers and seeds

Seed:

a a a a a a a a a a a a b b b b b b

Each letter of the word is covered by an occurrence

  • f the seed. The occurrences can be external.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 3/20

slide-14
SLIDE 14

The main problem

Problem (Shortest-Seed) Given a word w of length n over an alphabet Σ, compute the shortest seed of w.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 4/20

slide-15
SLIDE 15

The main problem

Problem (Shortest-Seed) Given a word w of length n over an alphabet Σ, compute the shortest seed of w. Problem (All-Seeds) Given a word w of length n over an alphabet Σ, compute an O(n)-sized representation of all the seeds of w.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 4/20

slide-16
SLIDE 16

The main problem

Problem (Shortest-Seed) Given a word w of length n over an alphabet Σ, compute the shortest seed of w. Problem (All-Seeds) Given a word w of length n over an alphabet Σ, compute an O(n)-sized representation of all the seeds of w. Theorem (Our result) The All-Seeds Problem for Σ =

  • 0, 1, . . . , nO(1)

can be solved in O(n) time.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 4/20

slide-17
SLIDE 17

Background

Seeds were introduced in 1993 by Iliopoulos, Moore & Park. In the same paper O(n log n)-time algorithm for the All-Seeds Problem over a fixed-size alphabet is given.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 5/20

slide-18
SLIDE 18

Background

Seeds were introduced in 1993 by Iliopoulos, Moore & Park. In the same paper O(n log n)-time algorithm for the All-Seeds Problem over a fixed-size alphabet is given. No o(n log n) algorithm even for the Shortest-Seed Problem for binary alphabet up to now.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 5/20

slide-19
SLIDE 19

Background

Seeds were introduced in 1993 by Iliopoulos, Moore & Park. In the same paper O(n log n)-time algorithm for the All-Seeds Problem over a fixed-size alphabet is given. No o(n log n) algorithm even for the Shortest-Seed Problem for binary alphabet up to now. W.F. Smyth stated finding a linear algorithm for the All-Seeds Problem as a hard open problem in his survey (2000).

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 5/20

slide-20
SLIDE 20

Background

An O(log n)-time PRAM algorithm for n processors, Ben-Amran et al., SODA 1994.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 6/20

slide-21
SLIDE 21

Background

An O(log n)-time PRAM algorithm for n processors, Ben-Amran et al., SODA 1994. For covers linear algorithms for similar problems are known:

shortest covers of each prefix (Breslauer, 1992) all covers (Moore & Smyth, SODA 1994)

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 6/20

slide-22
SLIDE 22

Background

An O(log n)-time PRAM algorithm for n processors, Ben-Amran et al., SODA 1994. For covers linear algorithms for similar problems are known:

shortest covers of each prefix (Breslauer, 1992) all covers (Moore & Smyth, SODA 1994)

Variants of seeds have been studied:

approximate seeds (Christodoulakis et al., 2003) λ-seeds (Guo, Zhang & Iliopoulos, 2006)

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 6/20

slide-23
SLIDE 23

Constraints for seeds

Two different types of constraints Border constraints, easier

a a a a a a a a a a a a b b b b b

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 7/20

slide-24
SLIDE 24

Constraints for seeds

Two different types of constraints Border constraints, easier Maxgap constrains, harder

a a a a a a a a a a a a b b b b b ≤ 5 ≤ 5 Maxgap is a maximal distance between the starting positions

  • f two consecutive occurrences of a given subword.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 7/20

slide-25
SLIDE 25

Quasiseeds

The All-Seeds Problem can be linearly reduced to computing the maxgaps of all subwords (encoded in a suffix tree). No o(n log n) algorithm known.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 8/20

slide-26
SLIDE 26

Quasiseeds

The All-Seeds Problem can be linearly reduced to computing the maxgaps of all subwords (encoded in a suffix tree). No o(n log n) algorithm known. Definition (Quasiseed) A subword v is a quasiseed of w if there there are less than |v| letters both before its first occurrence and after the last one and each letter between those two occurrences is covered by an occurrence of v.

a a a a a a a a a a a a b b b b b b all letters covered < 5 < 5

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 8/20

slide-27
SLIDE 27

Useful properties of quasiseeds

An O(n) representation on the suffix tree.

v a a a b a a a a a non-quasiseeds quasiseeds aaaa aaaaa aaaaaa aaaaaaa b b b aaa aaaaa b aaa aaa b

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 9/20

slide-28
SLIDE 28

Useful properties of quasiseeds

Lemma (Restricted-Quasiseeds) Given an integer d and a word w of length n, the representation of all quasiseeds of length in {d, d + 1, . . . , 2d} can be found in O(n) time.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 10/20

slide-29
SLIDE 29

Useful properties of quasiseeds

Lemma (Restricted-Quasiseeds) Given an integer d and a word w of length n, the representation of all quasiseeds of length in {d, d + 1, . . . , 2d} can be found in O(n) time. The All-Seeds Problem can be linearly reduced to computing (the representation of) all quasiseeds.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 10/20

slide-30
SLIDE 30

Main problem

Problem (All-Quasiseeds) Given a word of length n, compute the representation of all its quasiseeds.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 11/20

slide-31
SLIDE 31

Recursive structure of the algorithm

Interval m-staircase

w : m 3m

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 12/20

slide-32
SLIDE 32

Recursive structure of the algorithm

Interval m-staircase

w : m 3m

Lemma (Short Quasiseeds) A subword v of length < m is a quasiseed of w if and only if it is a quasiseed of each subword corresponding to an m-staircase interval.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 12/20

slide-33
SLIDE 33

Recursive structure of the algorithm

The total length of the intervals in the staircase (size of the staircase) is about 3n.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 13/20

slide-34
SLIDE 34

Recursive structure of the algorithm

The total length of the intervals in the staircase (size of the staircase) is about 3n. If it were 1

2n, the recursion could yield a linear

algorithm.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 13/20

slide-35
SLIDE 35

Recursive structure of the algorithm

The total length of the intervals in the staircase (size of the staircase) is about 3n. If it were 1

2n, the recursion could yield a linear

algorithm. We need to reduce the staircase.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 13/20

slide-36
SLIDE 36

Recursive structure of the algorithm

The total length of the intervals in the staircase (size of the staircase) is about 3n. If it were 1

2n, the recursion could yield a linear

algorithm. We need to reduce the staircase.

not needed

=

w :

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 13/20

slide-37
SLIDE 37

Recursive structure of the algorithm

Outline:

1

Find an appropriate reduced staircase

2

Find the long quasiseeds (non-recursively)

3

Find the short quasiseeds (recursive calls)

4

Merge the results of those calls

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 14/20

slide-38
SLIDE 38

Recursive structure of the algorithm

Outline:

1

Find an appropriate reduced staircase

2

Find the long quasiseeds (non-recursively)

3

Find the short quasiseeds (recursive calls)

4

Merge the results of those calls Main issue: How to find an appropriate m, so that simultaneously: the reduced staircase is small, long quasiseeds can be found in O(n). Due to the Restricted-Quasiseeds Lemma, m = Θ(n) would suffice for the second part.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 14/20

slide-39
SLIDE 39

Recursive structure of the algorithm

Outline:

1

Find an appropriate reduced staircase

2

Find the long quasiseeds (non-recursively)

3

Find the short quasiseeds (recursive calls)

4

Merge the results of those calls Main issue: How to find an appropriate m, so that simultaneously: the reduced staircase is small, long quasiseeds can be found in O(n). Due to the Restricted-Quasiseeds Lemma, m = Θ(n) would suffice for the second part. Merging is not as easy as it may seem (RMQ and static find-union).

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 14/20

slide-40
SLIDE 40

f -factorization

A variant of a well known LZ-factorization

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 15/20

slide-41
SLIDE 41

f -factorization

A variant of a well known LZ-factorization Definition (f -factorization) An f -factorization f1f2 . . . fk of w is constructed greedily: fi is either just the first occurrence of a letter or the longest prefix of the remaining suffix that is a subword of f1 . . . fi−1.

a a a a a a a a a b b b b b b c

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 15/20

slide-42
SLIDE 42

f -factorization

Theorem (Crochemore, 1983; Crochemore et al. 2009) The f -factorization over (constant) integer alphabet can be computed in O(n) time.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 16/20

slide-43
SLIDE 43

Quasiseeds, staircase and factorization

Lemma Let F be the f -factorization of w (|w| = n) and v be a quasiseed of w, |v| < n

  • 50. Then at most
  • 2n

|v|

  • − 1 factors from F lie within [2n

50, 49n 50 ].

2n 50 n 50

Not many middle factors

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 17/20

slide-44
SLIDE 44

Quasiseeds, staircase and factorization

Lemma Let F be the f -factorization of w (|w| = n) and v be a quasiseed of w, |v| < n

  • 50. Then at most
  • 2n

|v|

  • − 1 factors from F lie within [2n

50, 49n 50 ].

2n 50 n 50

Not many middle factors

Stairs lying within a single factor are not necessary.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 17/20

slide-45
SLIDE 45

Key lemmas

The algorithm does not know the quasiseed, but can find the number of middle factors.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 18/20

slide-46
SLIDE 46

Key lemmas

The algorithm does not know the quasiseed, but can find the number of middle factors. Let g be the number of middle factors of the word w, |w| = n > 200. Lemma There is no quasiseed v of w such that:

2n g+1 < |v| ≤ n 50.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 18/20

slide-47
SLIDE 47

Key lemmas

The algorithm does not know the quasiseed, but can find the number of middle factors. Let g be the number of middle factors of the word w, |w| = n > 200. Lemma There is no quasiseed v of w such that:

2n g+1 < |v| ≤ n 50.

Lemma If m ≤

n 50(g+1) then the size of the reduced staircase

is < n

2.

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 18/20

slide-48
SLIDE 48

Final structure of the algorithm

1

Find an f -factorization and the number of middle factors (g)

2

m :=

  • n

50(g+1)

  • 3

Compute the reduced staircase

4

Compute the long quasiseeds (belonging to two ranges of fixed ratio)

5

If m > 0 compute the short quasiseeds by recursive calls and merge the results

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 19/20

slide-49
SLIDE 49

Conclusions

We have presented a linear algorithm for the All-Quasiseeds Problem (over integer alphabet). This yields a linear algorithm for the All-Seeds Problem (over integer alphabet).

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 20/20

slide-50
SLIDE 50

Conclusions

We have presented a linear algorithm for the All-Quasiseeds Problem (over integer alphabet). This yields a linear algorithm for the All-Seeds Problem (over integer alphabet).

Thank you!

Tomasz Kociumaka A Linear Time Algorithm for Seeds Computation 20/20