Order-Preserving Incomplete Suffix Trees and Order-Preserving - - PowerPoint PPT Presentation

order preserving incomplete suffix trees and order
SMART_READER_LITE
LIVE PREVIEW

Order-Preserving Incomplete Suffix Trees and Order-Preserving - - PowerPoint PPT Presentation

Order-Preserving Incomplete Suffix Trees and Order-Preserving Indexes Maxime Crochemore 3 , 5 Costas S. Iliopoulos 3 , 4 Tomasz Kociumaka 1 Marcin Kubica 1 Alessio Langiu 3 Solon P. Pissis 3 , 4 Jakub Radoszewski 1 Wojciech Rytter 1 , 2 Tomasz


slide-1
SLIDE 1

Order-Preserving Incomplete Suffix Trees and Order-Preserving Indexes

Maxime Crochemore3,5 Costas S. Iliopoulos3,4 Tomasz Kociumaka1 Marcin Kubica1 Alessio Langiu3 Solon P. Pissis3,4 Jakub Radoszewski1 Wojciech Rytter 1,2 Tomasz Waleń1

1University of Warsaw, Warsaw, Poland 2Copernicus University, Toruń, Poland 3King’s College London, London, UK 4University of Western Australia, Perth, Australia 5Université Paris-Est, France

SPIRE 2013, 2013–10–09

1/19

slide-2
SLIDE 2

Order preserving model

Relation ≈

Two words x and y are called order-isomorphic, written as x ≈ y, iff: |x| = |y| and for all i, j we have xi ≤ xj ⇔ yi ≤ yj.

2/19

slide-3
SLIDE 3

Order preserving model

Relation ≈

Two words x and y are called order-isomorphic, written as x ≈ y, iff: |x| = |y| and for all i, j we have xi ≤ xj ⇔ yi ≤ yj.

Example

1 3 2 4 ≈ 2 6 3 8

slide-4
SLIDE 4

Order preserving model

Relation ≈

Two words x and y are called order-isomorphic, written as x ≈ y, iff: |x| = |y| and for all i, j we have xi ≤ xj ⇔ yi ≤ yj.

Example

1 3 2 4 ≈ 2 6 3 8 ≈ 3 7 4 5

2/19

slide-5
SLIDE 5

Order preserving model

Relation ≈

Two words x and y are called order-isomorphic, written as x ≈ y, iff: |x| = |y| and for all i, j we have xi ≤ xj ⇔ yi ≤ yj.

Example

1 3 2 4 ≈ 2 6 3 8 ≈ 3 7 4 5 i j xi < xj but i j yi > yj

2/19

slide-6
SLIDE 6

Applications

Motivation:

◮ melody matching of two musical scores, ◮ recognition of trends in the stock market, ◮ = is boring, ≈ has nice combinatorial definition.

Related problems:

◮ suffix trees for quasi-suffix families, ◮ pattern avoidance (as subsequences not as subword!), ◮ parametrized matching, ◮ partial words.

3/19

slide-7
SLIDE 7

Previous results

Pattern matching in order-preserving model

For a pattern of length m and text of length n detect

  • rder-preserving occurrences.

Known results

◮ single pattern matching:

O(n + m), Kubica et al. IPL 2013,

◮ multiple pattern matching:

O(n + M), Kim et al. arXiv 2013,

◮ pattern matching with k-mismatches:

O(n(log log m + k log log k)), Gawrychowski, Uznański, arXiv 2013.

4/19

slide-8
SLIDE 8

Our results

Problem

Preprocess text w of length n, in such a way that you can answer the occurrence queries efficiently. Our results:

◮ O(n log log n) — preprocessing time, ◮ O(m + Occ) — query time (for pattern of length m)

5/19

slide-9
SLIDE 9

Algorithm outline

◮ encoding function Code that reduces testing of ≈ relation into

regular equality,

◮ relaxation of suffix tree definition to make the implementation

easier,

◮ modification of Ukkonen’s algorithm, ◮ algorithmic toolbox for speeding-up the factors encoding and

suffix tree navigation.

6/19

slide-10
SLIDE 10

Encoding function (1/2)

For any i ∈ {1, . . . , n} define: αw(i) = distance to predecessor of w[i] among values from w[1..(i-1)] βw(i) = distance to successor of w[i] among values from w[1..(i-1)] 6 3 2 5 1 4 αw(6) = 4 βw(6) = 2

7/19

slide-11
SLIDE 11

Encoding function (2/2)

Code(w) = (αw(1), βw(1)), . . . , (αw(|w|), βw(|w|)).

Example

1 (−, −) 4 (1, −) 2 (2, 1) 3 (1, 2) w = Code(w) =

Observation

x ≈ y ⇔ Code(x) = Code(y).

8/19

slide-12
SLIDE 12

How to compute Encoding function?

Lemma (Off-line Code computation)

For a string w of length n, Code(w) can be computed in O(n) time.

Lemma (Arbitrary factor Code computation)

For a string w of length n, after O(n) preprocessing any element of Code(v) for any factor v of w can be computed in O(log n) time.

Restricted case

If we restrict computation of Code to sliding window over w we can reduce computation time to O(log log n) per code element.

9/19

slide-13
SLIDE 13

Order-preserving suffix trees

Order-preserving suffix tree of w (of length n) is a compacted TRIE

  • f all the sequences in:

{Code(w[1..n])#, Code(w[2..n])#, . . . , Code(w[n..n])#}

Example

w = (1, 2, 4, 4, 2, 5, 5, 1)

(1, 1) (1,1) (2, 4)

3

#

6

(1, 2) (1,1) (3, 3)

2

(4,3)

5

(1, 3)

1

(2,1) (2, 3)

4

#

7

#

8

Additionally each explicit node stores a suffix link.

10/19

slide-14
SLIDE 14

Suffix links

◮ in standard suffix trees the suffix links always point to explicit

nodes,

◮ in order-preserving suffix trees it may happen that suffix link

points to an implicit node.

11/19

slide-15
SLIDE 15

Incomplete suffix trees

Relaxed definition

The incomplete order-preserving suffix tree of w is an

  • rder-preserving suffix tree in which each explicit node v can have
  • ne outgoing edge that does not store its first character.

(2, 5) (3, 2) (5, 10) ? - this edge misses label parent(v) v

12/19

slide-16
SLIDE 16

Why incomplete edges are not harmful?

Lemma

Let x and y be two strings of length t and x′ = x[1 . . t − 1], y′ = y[1 . . t − 1]. Then: x ≈ y ⇔ x′ ≈ y′ ∧ (yi ≤ yt ≤ yj), where i = t − αx(t), j = t − βx(t) x

xi xj xt

βx(t) αx(t) y

yi yj yt

So we need Code only for x.

13/19

slide-17
SLIDE 17

Algorithm for constructing incomplete suffix tree

We basically re-implement Ukkonen’s algorithm. The suffix tree is constructed using two basic operations:

Branch(v, (p, q))

Create new branch starting in v with code (p, q). If v is an implicit node, then the existing edge becomes incomplete (that’s why the suffix tree can be incomplete).

14/19

slide-18
SLIDE 18

Algorithm for constructing incomplete suffix tree

We basically re-implement Ukkonen’s algorithm. The suffix tree is constructed using two basic operations:

Branch(v, (p, q))

Create new branch starting in v with code (p, q). If v is an implicit node, then the existing edge becomes incomplete (that’s why the suffix tree can be incomplete).

Example

v

14/19

slide-19
SLIDE 19

Algorithm for constructing incomplete suffix tree

We basically re-implement Ukkonen’s algorithm. The suffix tree is constructed using two basic operations:

Branch(v, (p, q))

Create new branch starting in v with code (p, q). If v is an implicit node, then the existing edge becomes incomplete (that’s why the suffix tree can be incomplete).

Example

(p, q) this edge is incomplete →v

14/19

slide-20
SLIDE 20

How to implement Transition

Transition(v, (p, q))

Checks if v has a child v′ such that the edge from v to v′ represents the code (p, q) and returns v′ in such case or nil if there is no such node.

Implementation

v (1, 2) (3, 1)

slide-21
SLIDE 21

How to implement Transition

Transition(v, (p, q))

Checks if v has a child v′ such that the edge from v to v′ represents the code (p, q) and returns v′ in such case or nil if there is no such node.

Implementation

v (1, 2) (3, 1) Case 1: (p, q) present among child edges (p, q) = (3, 1) v′

slide-22
SLIDE 22

How to implement Transition

Transition(v, (p, q))

Checks if v has a child v′ such that the edge from v to v′ represents the code (p, q) and returns v′ in such case or nil if there is no such node.

Implementation

v (1, 2) (3, 1) Case 2: (p, q) not present among child edges (p, q) = (4, 2) we have to verify (single) incomplete edge v′

15/19

slide-23
SLIDE 23

Algorithmic toolbox, continued

We also require the following data structures:

◮ Weak Character Oracle – data structure based on y-fast trees

(Willard 1983) for computing codes for newly created branches

  • f the tree,

◮ Dynamic Weighted Ancestor data structure (Kopelowitz,

Lewenstein 2007) used for fast navigation over constructed suffix tree.

16/19

slide-24
SLIDE 24

Example usage

Theorem

Given word w of length n, the incomplete order-preserving suffix tree can be constructed in O(n log log n) expected time.

Theorem

Given op-suffix tree T(w) and pattern x, we can locate all

  • rder-preserving occurrences of pattern x in word w in time

O(|x| + Occ).

◮ Compute Code(x) and traverse tree T(w) using successive

symbols of the code. At each step we use function Transition.

17/19

slide-25
SLIDE 25

Complete suffix trees for op-model

Theorem

The order-preserving suffix tree of a string of length n can be constructed in O(n log n/ log log n) expected time.

◮ This can be achieved by slightly different encoding function

that allows a character oracle with O(log n/ log log n) query time and o(n log n/ log log n) preprocessing.

18/19

slide-26
SLIDE 26

Thank you for your attention!

19/19