Sparse Compact Directed Acyclic Word Graphs Shunsuke Inenaga - - PowerPoint PPT Presentation

sparse compact directed acyclic word graphs
SMART_READER_LITE
LIVE PREVIEW

Sparse Compact Directed Acyclic Word Graphs Shunsuke Inenaga - - PowerPoint PPT Presentation

Sparse Compact Directed Acyclic Word Graphs Shunsuke Inenaga (Japan Society for the Promotion of Science & Kyushu University) Masayuki Takeda (Kyushu University & Japan Science and Technology Agency) Traditional Pattern Matching


slide-1
SLIDE 1

Sparse Compact Directed Acyclic Word Graphs

Shunsuke Inenaga

(Japan Society for the Promotion of Science & Kyushu University)

Masayuki Takeda

(Kyushu University & Japan Science and Technology Agency)

slide-2
SLIDE 2

Traditional Pattern Matching Problem

 Given:text T in Σ∗ and pattern P in Σ∗  Return:whether or not P appears in T

 Σ: alphabet (set of characters)  Σ∗: set of strings

 A text indexing structure for T enables you to solve

the above problem in O(m) time (for fixed Σ).

 m: the length of P

slide-3
SLIDE 3

Suffix Trie

 A trie representing all suffixes of T

a a c b $ c b $ c b $ b $ $ T = aacb$ aacb$ acb$ cb$ b$ $

slide-4
SLIDE 4

Introducing Word Separator #

 # : word separator - special symbol not in Σ  D = Σ∗ # : dictionary of words  Text T : an element of D+

(T is a sequence T1T2…Tk of k words in D)

 e.g., T = This#is#a#pen#

 Σ = {A,…,z}  D = {...,This#,...,a#,...is#,...pen#,...}

slide-5
SLIDE 5

Word-level Pattern Matching Problem

 Given: text T in D+ and pattern P in D+  Return: whether or not P appears at the beginning

  • f any word in T

e.g. T = The#space#runner#is#not#your#good#pace#runner# P = pace#runner#

slide-6
SLIDE 6

Word-level Pattern Matching Problem

 Given: text T in D+ and pattern P in D+  Return: whether or not P appears at the beginning

  • f any word in T

e.g. T = The#space#runner#is#not#your#good#pace#runner# P = pace#runner#

slide-7
SLIDE 7

Word Suffix Trie

 A trie representing the suffixes of T which begin at

a word.

T = aa#b# aa#b# a#b# #b# b# # a a # b # b #

slide-8
SLIDE 8

Normal and Word Suffix Tries

a a # b # b # a a # b # # b # # b # b # T = aa#b# Normal Suffix Trie Word Suffix Trie

slide-9
SLIDE 9

Normal and Word Suffix Trees

a a # b # b # a a # b # # b # # b # b # T = aa#b# Normal Suffix Tree Word Suffix Tree

slide-10
SLIDE 10

Sizes of Word Suffix Tries and Trees

 For text T = T1T2…Tk of length n,

 the word suffix trie of T requires O(nk) space, but  the word suffix tree of T requires O(k) space!!  because the word suffix tree has only k leaves and

has only branching internal nodes.

slide-11
SLIDE 11

Construction of Word Suffix Trees

 Algorithm by Andersson et al.(1996)

 for text T = T1T2…Tk of length n, constructs word suffix

trees in O(n) expected time with O(k) space.

 Our algorithm (CPM’06)

 builds word suffix trees in O(n) time in the worst case,

with O(k) space.

slide-12
SLIDE 12

Our Construction Algorithm

 We modify Ukkonen’s on-line normal suffix tree

construction algorithm by using minimum DFA accepting dictionary D

 We replace the root node of the suffix tree with the final

state of the DFA.

slide-13
SLIDE 13

Minimum DFA

 The minimum DFA accepting D = Σ∗ # clearly

requires constant space (for fixed Σ).

# Σ

slide-14
SLIDE 14

On-line Construction of Word Suffix Trees

T = aa#b# a,b a #

slide-15
SLIDE 15

T = aa#b# a,b a #

On-line Construction of Word Suffix Trees

slide-16
SLIDE 16

T = aa#b# a,b # a a

On-line Construction of Word Suffix Trees

slide-17
SLIDE 17

T = aa#b# a,b # a a

On-line Construction of Word Suffix Trees

slide-18
SLIDE 18

T = aa#b# a,b # a a #

On-line Construction of Word Suffix Trees

slide-19
SLIDE 19

T = aa#b# a,b # a a #

On-line Construction of Word Suffix Trees

slide-20
SLIDE 20

T = aa#b# a,b # a a # b b

On-line Construction of Word Suffix Trees

slide-21
SLIDE 21

T = aa#b# a,b # a a # b b

On-line Construction of Word Suffix Trees

slide-22
SLIDE 22

T = aa#b# a,b # a a # b b # #

On-line Construction of Word Suffix Trees

slide-23
SLIDE 23

Pseudo-Code

Just change here

slide-24
SLIDE 24

Compact Directed Acyclic Word Graphs

a a # b # # b # # b # b # T = aa#b# Suffix Tree a a # b # # b # # b # b #

Compact Directed Acyclic Word Graph (CDAWG) minimization

slide-25
SLIDE 25

Sparse CDAWGs

a a # b # b # T = aa#b# Word Suffix Tree

Sparse Compact Directed Acyclic Word Graph (SCDAWG) minimization

a a # b # b #

slide-26
SLIDE 26

Sparse CDAWGs [cont.]

a a # b # # T = a#b#a#bab# Word Suffix Tree SCDAWG

minimization

# b b a a b # # # b b a # b a # b a a # # # b b a # b a a # b b

slide-27
SLIDE 27

SCDAWG Construction

 SCDAWGs can be constructed by minimizing word

suffix trees in O(k) time.

 using Revuz’s DAG minimization algorithm (1992)

slide-28
SLIDE 28

SCDAWG Construction [cont.]

 Question : Direct construction for SCDAWGs?  Answer : YES!

Using minimal DFA accepting dictionary D, we can directly build SCDAWGs in O(n) time and O(k) space.

 We modify the CDAWG on-line construction algorithm

(Inenaga et al. 05) by using the above DFA.

slide-29
SLIDE 29

Pseudo-Code

Just change here Body is the same!!

a a # b # # b # # b # b # a a # b # b #

Different structures

a,b # Σ,#

slide-30
SLIDE 30

Some Events

 Basically on-line construction of SCDAWGs is

similar to that of word suffix trees.

 Except for the two following unique events:

 Edge merging  Node splitting

slide-31
SLIDE 31

Edge Merging

T = a#b#a#bab#bc... a # # b a # b b a,b # a # # b

slide-32
SLIDE 32

Edge Merging

T = a#b#a#bab#bc... a # # b a # b b a,b # a # # b a a

slide-33
SLIDE 33

Edge Merging

T = a#b#a#bab#bc... a # # b a # b b a,b # a # # b a a a

slide-34
SLIDE 34

Edge Merging

T = a#b#a#bab#bc... a # # b a # b b a,b # a a

slide-35
SLIDE 35

Node Splitting

T = a#b#a#bab#bc... a # # b a # b b a,b # a a b # b #

slide-36
SLIDE 36

Node Splitting

T = a#b#a#bab#bc... a # # b a # b b a,b # a a b # b # b b

slide-37
SLIDE 37

Node Splitting

T = a#b#a#bab#bc... a # # b a # b b a,b # a a b # b # b b a # # b a a b # b # b b

slide-38
SLIDE 38

Conclusion

 We introduced new text indexing structure sparse

compact directed acyclic word graphs (SCDAWGs) for word-level pattern matching.

 We presented an on-line algorithm to construct

SCDAWGs directly, in O(n) time with O(k) space.

 The key is the use of minimum DFA accepting

dictionary D.

slide-39
SLIDE 39

Related Work

 “Sparse Directed Acyclic Word Graphs”

by Shunsuke Inenaga and Masayuki Takeda Accepted to SPIRE’06