Succinct 2D Dictionary Matching with No Slowdown Shoshana Neuburger - - PowerPoint PPT Presentation

succinct 2d dictionary matching with no slowdown
SMART_READER_LITE
LIVE PREVIEW

Succinct 2D Dictionary Matching with No Slowdown Shoshana Neuburger - - PowerPoint PPT Presentation

Overview New Algorithm Conclusion Succinct 2D Dictionary Matching with No Slowdown Shoshana Neuburger and Dina Sokol City University of New York Shoshana Neuburger and Dina Sokol Succinct 2D Dictionary Matching with No Slowdown Overview


slide-1
SLIDE 1

Overview New Algorithm Conclusion

Succinct 2D Dictionary Matching with No Slowdown

Shoshana Neuburger and Dina Sokol

City University of New York

Shoshana Neuburger and Dina Sokol Succinct 2D Dictionary Matching with No Slowdown

slide-2
SLIDE 2

Overview New Algorithm Conclusion

Problem Definition

Dictionary Matching Input: Dictionary D = P1, P2, . . . , Pd containing d patterns. Text T of length n. Output: All positions in text at which a dictionary pattern occurs.

Shoshana Neuburger and Dina Sokol Succinct 2D Dictionary Matching with No Slowdown

slide-3
SLIDE 3

Overview New Algorithm Conclusion

Applications

Dictionary Matching Search for specific phrases in a book Scanning file for virus signatures Network intrusion detection systems Searching DNA sequence for a set of motifs

Shoshana Neuburger and Dina Sokol Succinct 2D Dictionary Matching with No Slowdown

slide-4
SLIDE 4

Overview New Algorithm Conclusion

Small-Space 1D

In many devices, storage capacity is limited. Goal: efficient algorithms with respect to both time and space .

Shoshana Neuburger and Dina Sokol Succinct 2D Dictionary Matching with No Slowdown

slide-5
SLIDE 5

Overview New Algorithm Conclusion

Small-Space 1D

In many devices, storage capacity is limited. Goal: efficient algorithms with respect to both time and space . 1D single pattern matching in linear time and O(1) working space: Galil and Seiferas (1981) Crochemore and Perrin (1991) Rytter (2003)

Shoshana Neuburger and Dina Sokol Succinct 2D Dictionary Matching with No Slowdown

slide-6
SLIDE 6

Overview New Algorithm Conclusion

Small-Space 1D

1D dictionary matching in small space:

Space (bits) Search Time Reference O(ℓ log ℓ) O(n + occ)

Aho-Corasick (1975)

O(ℓ) O((n + occ) log2 ℓ)

Chan et al. (2007)

ℓHk(D) + o(ℓ log σ) + O(d log ℓ) O(n(logǫ l + log d) + occ)

Hon et al. (2008)

ℓ(H0 + O(1)) + O(d log(ℓ/d)) O(n + occ)

Belazzougui (2010)

ℓHk(D) + O(ℓ) O(n + occ)

Hon et al. (2010) Shoshana Neuburger and Dina Sokol Succinct 2D Dictionary Matching with No Slowdown

slide-7
SLIDE 7

Overview New Algorithm Conclusion

2D Dictionary Matching

Existing 2D dictionary matching algorithms: Bird (1977) / Baker (1978) Amir, Farach (1992) Idury, Schaffer (1993) Require working space proportional to dictionary size.

Shoshana Neuburger and Dina Sokol Succinct 2D Dictionary Matching with No Slowdown

slide-8
SLIDE 8

Overview New Algorithm Conclusion

2D Dictionary Matching

Bird / Baker Convert 2D data to 1D representation. Name patterns rows. Name text positions. Use 1D dictionary matching to find pattern occurrences.

Shoshana Neuburger and Dina Sokol Succinct 2D Dictionary Matching with No Slowdown

slide-9
SLIDE 9

Overview New Algorithm Conclusion

2D Dictionary Matching

Bird / Baker Convert 2D data to 1D representation. Name patterns rows. Name text positions. Use 1D dictionary matching to find pattern occurrences. Text is processed once! Our work: mimic Bird/Baker algorithm in small space.

Shoshana Neuburger and Dina Sokol Succinct 2D Dictionary Matching with No Slowdown

slide-10
SLIDE 10

Overview New Algorithm Conclusion

Bird /Baker Algorithm

  • Shoshana Neuburger and Dina Sokol

Succinct 2D Dictionary Matching with No Slowdown

slide-11
SLIDE 11

Overview New Algorithm Conclusion

Bird /Baker Algorithm

Pattern Preprocessing

  • Shoshana Neuburger and Dina Sokol

Succinct 2D Dictionary Matching with No Slowdown

slide-12
SLIDE 12

Overview New Algorithm Conclusion

Bird /Baker Algorithm

Text Scanning

  • Shoshana Neuburger and Dina Sokol

Succinct 2D Dictionary Matching with No Slowdown

slide-13
SLIDE 13

Overview New Algorithm Conclusion

Bird /Baker Algorithm

Text Scanning

  • Shoshana Neuburger and Dina Sokol

Succinct 2D Dictionary Matching with No Slowdown

slide-14
SLIDE 14

Overview New Algorithm Conclusion

Bird /Baker Algorithm

Text Scanning

  • Shoshana Neuburger and Dina Sokol

Succinct 2D Dictionary Matching with No Slowdown

slide-15
SLIDE 15

Overview New Algorithm Conclusion

Problem Definition

2D Dictionary Matching Input: Dictionary of d patterns, each is m × m in size. Text T of size n × n. Output: All positions in text at which a dictionary pattern occurs.

Shoshana Neuburger and Dina Sokol Succinct 2D Dictionary Matching with No Slowdown

slide-16
SLIDE 16

Overview New Algorithm Conclusion Preprocessing Text Scanning

Preprocessing Space

Bird and Baker: Aho-Corasick automaton of pattern rows. O(dm2 log dm2) extra bits of preprocessing space. New algorithm: Compressed Aho-Corasick automaton of pattern rows. Groups pattern rows into equivalence classes. O(dm log dm) extra bits of preprocessing space.

Shoshana Neuburger and Dina Sokol Succinct 2D Dictionary Matching with No Slowdown

slide-17
SLIDE 17

Overview New Algorithm Conclusion Preprocessing Text Scanning

Text Scanning Space

Bird and Baker Process entire text at once. O(n2 log dm) bits of space to label text. To save space Small overlapping text blocks of size 3m/2 × 3m/2. O(m2 log dm) bits of space to label text. Working space is independent of text size.

Shoshana Neuburger and Dina Sokol Succinct 2D Dictionary Matching with No Slowdown

slide-18
SLIDE 18

Overview New Algorithm Conclusion Preprocessing Text Scanning

Our Method

Compressed AC automaton [Hon et al. (2010)]: Separates the three functions of the AC automaton. Encodes each function differently. Space complexity meets Hk(D) , kth order empirical entropy. Black-box replacement for AC automata in Bird / Baker algorithm.

Shoshana Neuburger and Dina Sokol Succinct 2D Dictionary Matching with No Slowdown

slide-19
SLIDE 19

Overview New Algorithm Conclusion Preprocessing Text Scanning

Dictionary Size

Using compressed AC automata in small blocks of text Theorem If d > m, we can solve the 2D dictionary matching problem in linear O(dm2 + n2) time and ℓHk(D) + O(ℓ) + O(dm log dm) bits

  • f space.

ℓ is the number of states in the AC automaton of pattern rows

Shoshana Neuburger and Dina Sokol Succinct 2D Dictionary Matching with No Slowdown

slide-20
SLIDE 20

Overview New Algorithm Conclusion Preprocessing Text Scanning

Dictionary Size

Using compressed AC automata in small blocks of text Theorem If d > m, we can solve the 2D dictionary matching problem in linear O(dm2 + n2) time and ℓHk(D) + O(ℓ) + O(dm log dm) bits

  • f space.

ℓ is the number of states in the AC automaton of pattern rows We focus on case when d < m.

Shoshana Neuburger and Dina Sokol Succinct 2D Dictionary Matching with No Slowdown

slide-21
SLIDE 21

Overview New Algorithm Conclusion Preprocessing Text Scanning

1D Periodicity

Definition A string p is periodic in u if p = u′uk where u′ is a suffix of u, u is primitive, and k ≥ 2. We divide patterns into 2 groups based on 1D periodicity. In each case, different difficulties to overcome.

Shoshana Neuburger and Dina Sokol Succinct 2D Dictionary Matching with No Slowdown

slide-22
SLIDE 22

Overview New Algorithm Conclusion Preprocessing Text Scanning

Types of Patterns

Case I: Patterns in which all pattern rows are periodic, period ≤ m/4. Problem: can have more candidates than the space we allow. Case II: Patterns contain aperiodic row or row with period > m/4 . Problem: several patterns can overlap in both directions.

Shoshana Neuburger and Dina Sokol Succinct 2D Dictionary Matching with No Slowdown

slide-23
SLIDE 23

Overview New Algorithm Conclusion Preprocessing Text Scanning

Types of Patterns

Case I: Patterns in which all pattern rows are periodic, period ≤ m/4. Problem: can have more candidates than the space we allow. Algorithm published in CPM 2010 for compressed data. Use conjugacy of periods to group similar pattern rows in the same equivalence class.

Shoshana Neuburger and Dina Sokol Succinct 2D Dictionary Matching with No Slowdown

slide-24
SLIDE 24

Overview New Algorithm Conclusion Preprocessing Text Scanning

Types of Patterns

Case I: Patterns in which all pattern rows are periodic, period ≤ m/4. Problem: can have more candidates than the space we allow. Lemma At most one maximal periodic substring of length ≥ m with period ≤ m/4 can occur in a text block row of size 3m/2.

Shoshana Neuburger and Dina Sokol Succinct 2D Dictionary Matching with No Slowdown

slide-25
SLIDE 25

Overview New Algorithm Conclusion Preprocessing Text Scanning

Types of Patterns

Case II: Patterns contain aperiodic row or row with period > m/4 . Problem: several patterns can overlap in both directions.

Shoshana Neuburger and Dina Sokol Succinct 2D Dictionary Matching with No Slowdown

slide-26
SLIDE 26

Overview New Algorithm Conclusion Preprocessing Text Scanning

Types of Patterns

Case II: Patterns contain aperiodic row or row with period > m/4 . Problem: several patterns can overlap in both directions. Many 1D names can overlap in a text block row. Identification of candidates is simpler. Identify candidates with aperiodic row of each pattern. Difficult to verify in single pass over text.

Shoshana Neuburger and Dina Sokol Succinct 2D Dictionary Matching with No Slowdown

slide-27
SLIDE 27

Overview New Algorithm Conclusion Preprocessing Text Scanning

Pattern Preprocessing

Pattern Preprocessing:

1 Construct (compressed) AC automaton of first aperiodic row

  • f each pattern.

Store row number of each of these rows within the patterns.

2 Form a compressed AC automaton of the pattern rows. 3 Name pattern rows.

Index 1D patterns of names in suffix tree.

4 Construct witness tree of pattern rows.

Preprocess for LCA. Time: O(dm2) Extra Space: O(dm log m) bits

Shoshana Neuburger and Dina Sokol Succinct 2D Dictionary Matching with No Slowdown

slide-28
SLIDE 28

Overview New Algorithm Conclusion Preprocessing Text Scanning

Searching Text

Text Scanning:

1 Identify candidates. 2 Eliminate inconsistent candidates. 3 Verify pattern occurrences. Shoshana Neuburger and Dina Sokol Succinct 2D Dictionary Matching with No Slowdown

slide-29
SLIDE 29

Overview New Algorithm Conclusion Preprocessing Text Scanning

Searching Text

Step 1: Identify candidates 1D dictionary matching of a non-periodic row of each pattern. O(dm) candidates in a text block. Possibly several candidates at a single text position.

Shoshana Neuburger and Dina Sokol Succinct 2D Dictionary Matching with No Slowdown

slide-30
SLIDE 30

Overview New Algorithm Conclusion Preprocessing Text Scanning

Searching Text

Text Scanning:

1 Identify candidates. 2 Eliminate inconsistent candidates. 3 Verify pattern occurrences. Shoshana Neuburger and Dina Sokol Succinct 2D Dictionary Matching with No Slowdown

slide-31
SLIDE 31

Overview New Algorithm Conclusion Preprocessing Text Scanning

Searching Text

Step 2: Eliminate inconsistent candidates in each column Two candidates are consistent if all positions of overlap match. Vertically consistent candidates:

  • In the same column.
  • Suffix/prefix match in 1D representations.

Overlapping segments of consistent candidates can be verified simultaneously ⇒ single pass verification.

Shoshana Neuburger and Dina Sokol Succinct 2D Dictionary Matching with No Slowdown

slide-32
SLIDE 32

Overview New Algorithm Conclusion Preprocessing Text Scanning

Searching Text

Step 2: Eliminate inconsistent candidates in each column How to eliminate inconsistent candidates? duels . Dueling for 2D single pattern matching [Amir et al. (1994)] * Store witness for all conflicting overlaps. * No witness ⇒ consistent candidates. * Duel: compare text location to witness, kill 1+ candidates. Dictionary matching: candidates for different patterns. Too many witnesses to store? Dynamic dueling generates.

Shoshana Neuburger and Dina Sokol Succinct 2D Dictionary Matching with No Slowdown

slide-33
SLIDE 33

Overview New Algorithm Conclusion Preprocessing Text Scanning

Searching Text

Step 2: Eliminate inconsistent candidates in each column Duels from top to bottom of rows. Consistency is transitive. Duel between vertically inconsistent candidates.

Shoshana Neuburger and Dina Sokol Succinct 2D Dictionary Matching with No Slowdown

slide-34
SLIDE 34

Overview New Algorithm Conclusion Preprocessing Text Scanning

Searching Text

Step 2: Eliminate inconsistent candidates in each column

  • Shoshana Neuburger and Dina Sokol

Succinct 2D Dictionary Matching with No Slowdown

slide-35
SLIDE 35

Overview New Algorithm Conclusion Preprocessing Text Scanning

Searching Text

Step 2: Eliminate inconsistent candidates in each column

  • Shoshana Neuburger and Dina Sokol

Succinct 2D Dictionary Matching with No Slowdown

slide-36
SLIDE 36

Overview New Algorithm Conclusion Preprocessing Text Scanning

Searching Text

Step 2: Eliminate inconsistent candidates in each column

  • Shoshana Neuburger and Dina Sokol

Succinct 2D Dictionary Matching with No Slowdown

slide-37
SLIDE 37

Overview New Algorithm Conclusion Preprocessing Text Scanning

Searching Text

Step 2: Eliminate inconsistent candidates in each column

  • Shoshana Neuburger and Dina Sokol

Succinct 2D Dictionary Matching with No Slowdown

slide-38
SLIDE 38

Overview New Algorithm Conclusion Preprocessing Text Scanning

Searching Text

Step 2: Eliminate inconsistent candidates in each column

  • If last candidate wins duel

Shoshana Neuburger and Dina Sokol Succinct 2D Dictionary Matching with No Slowdown

slide-39
SLIDE 39

Overview New Algorithm Conclusion Preprocessing Text Scanning

Searching Text

Step 2: Eliminate inconsistent candidates in each column

  • If new candidate wins duel

Shoshana Neuburger and Dina Sokol Succinct 2D Dictionary Matching with No Slowdown

slide-40
SLIDE 40

Overview New Algorithm Conclusion Preprocessing Text Scanning

Searching Text

Step 2: Eliminate inconsistent candidates in each column How to duel between candidates?

  • Shoshana Neuburger and Dina Sokol

Succinct 2D Dictionary Matching with No Slowdown

slide-41
SLIDE 41

Overview New Algorithm Conclusion Preprocessing Text Scanning

Searching Text

Step 2: Eliminate inconsistent candidates in each column How to duel between candidates?

1 Consider 1D representation of pattern names.

Compute LCP of suffixes to find a row-witness .

2 Generate witness between row names.

LCA query in witness tree.

Shoshana Neuburger and Dina Sokol Succinct 2D Dictionary Matching with No Slowdown

slide-42
SLIDE 42

Overview New Algorithm Conclusion Preprocessing Text Scanning

Searching Text

Step 2: Eliminate inconsistent candidates in each column How to generate witness?

  • Shoshana Neuburger and Dina Sokol

Succinct 2D Dictionary Matching with No Slowdown

slide-43
SLIDE 43

Overview New Algorithm Conclusion Preprocessing Text Scanning

Searching Text

Text Scanning:

1 Identify candidates. 2 Eliminate inconsistent candidates. 3 Verify pattern occurrences. Shoshana Neuburger and Dina Sokol Succinct 2D Dictionary Matching with No Slowdown

slide-44
SLIDE 44

Overview New Algorithm Conclusion Preprocessing Text Scanning

Searching Text

Step 3: Verify pattern occurrences. Limited to vertically consistent candidates. Single scan of text block. Process one row at a time. Mark text positions that expect pattern row. Verify with compressed AC automaton of pattern rows.

Shoshana Neuburger and Dina Sokol Succinct 2D Dictionary Matching with No Slowdown

slide-45
SLIDE 45

Overview New Algorithm Conclusion Preprocessing Text Scanning

Searching Text

Text Scanning:

1 Identify candidates. 2 Eliminate inconsistent candidates. 3 Verify pattern occurrences.

Time: O(m2) linear Extra Space: O(dm log m) bits

Shoshana Neuburger and Dina Sokol Succinct 2D Dictionary Matching with No Slowdown

slide-46
SLIDE 46

Overview New Algorithm Conclusion

Summary

New approach to 2D dictionary matching in small-space: Time complexity is linear. Preprocess dictionary in time proportional to dictionary size. Store dictionary in entropy-compressed space. Scan text in time proportional to text, independent of dictionary size. Overall, O(dm log dm) extra bits of space.

Shoshana Neuburger and Dina Sokol Succinct 2D Dictionary Matching with No Slowdown

slide-47
SLIDE 47

Overview New Algorithm Conclusion

Thank you!

Shoshana Neuburger and Dina Sokol Succinct 2D Dictionary Matching with No Slowdown