Fully compressed pattern matching by recompression Artur Je - - PowerPoint PPT Presentation

fully compressed pattern matching by recompression
SMART_READER_LITE
LIVE PREVIEW

Fully compressed pattern matching by recompression Artur Je - - PowerPoint PPT Presentation

Fully compressed pattern matching by recompression Artur Je University of Wrocaw 9 VII 2012 FCPM by recompression 9 VII 2012 1 / 18 Artur Je SLP Definition (SLP: Straight Line Programme) CFG generating exactly one word X i X j X


slide-1
SLIDE 1

Fully compressed pattern matching by recompression

Artur Jeż University of Wrocław 9 VII 2012

Artur Jeż FCPM by recompression 9 VII 2012 1 / 18

slide-2
SLIDE 2

SLP

Definition (SLP: Straight Line Programme)

CFG generating exactly one word Xi → XjXk or Xi → a

Artur Jeż FCPM by recompression 9 VII 2012 2 / 18

slide-3
SLIDE 3

SLP

Definition (SLP: Straight Line Programme)

CFG generating exactly one word Xi → XjXk or Xi → a

Example

X0 = a, X1 = b, Xn+1 = Xn−1Xn−2 a, b, ba, bab, babba, babbababb, . . .

Artur Jeż FCPM by recompression 9 VII 2012 2 / 18

slide-4
SLIDE 4

SLP

Definition (SLP: Straight Line Programme)

CFG generating exactly one word Xi → XjXk or Xi → a

Example

X0 = a, X1 = b, Xn+1 = Xn−1Xn−2 a, b, ba, bab, babba, babbababb, . . .

Relations to LZ and LZW

LZW rules Xi → aXj, text is X1X2X3 . . . LZ LZ to SLP: from n to O(n log(N/n))

Artur Jeż FCPM by recompression 9 VII 2012 2 / 18

slide-5
SLIDE 5

SLP

Definition (SLP: Straight Line Programme)

CFG generating exactly one word Xi → XjXk or Xi → a

Example

X0 = a, X1 = b, Xn+1 = Xn−1Xn−2 a, b, ba, bab, babba, babbababb, . . .

Relations to LZ and LZW

LZW rules Xi → aXj, text is X1X2X3 . . . LZ LZ to SLP: from n to O(n log(N/n)) many algorithms for SLPs CPM for LZ [Gawrychowski ESA’11] in theory (word equations, equations in groups, verification...)

Artur Jeż FCPM by recompression 9 VII 2012 2 / 18

slide-6
SLIDE 6

This talk

Definition (CPM, FCPM)

Compressed pattern matching: text is compressed, pattern not. Fully Compressed pattern matching: both text and pattern are compressed.

Artur Jeż FCPM by recompression 9 VII 2012 3 / 18

slide-7
SLIDE 7

This talk

Definition (CPM, FCPM)

Compressed pattern matching: text is compressed, pattern not. Fully Compressed pattern matching: both text and pattern are compressed.

Results

An O((n + m) log M) algorithm for FCPM for SLP. (Previously: O(nm2), [Lifshits, CPM’07]).

Artur Jeż FCPM by recompression 9 VII 2012 3 / 18

slide-8
SLIDE 8

This talk

Definition (CPM, FCPM)

Compressed pattern matching: text is compressed, pattern not. Fully Compressed pattern matching: both text and pattern are compressed.

Results

An O((n + m) log M) algorithm for FCPM for SLP. (Previously: O(nm2), [Lifshits, CPM’07]).

Different approach

A new technique; recompression. decompresses text and pattern compresses them again (in the same way) in the end: pattern is a single symbol

Artur Jeż FCPM by recompression 9 VII 2012 3 / 18

slide-9
SLIDE 9

Technique

Where it comes from

Mehlhorn, Gawry

Artur Jeż FCPM by recompression 9 VII 2012 4 / 18

slide-10
SLIDE 10

Technique

Where it comes from

Mehlhorn, Gawry

Applicable to

Fully Compressed Membership Problem [∈ NP] Word equations [alternative PSPACE algorithm] Fully Compressed Pattern Matching [SLPs, LZ, O((n + m) log M log(n + m))] construction of a grammar for a string [alternative log(N/n) approximation algorithm]

  • ther?

Artur Jeż FCPM by recompression 9 VII 2012 4 / 18

slide-11
SLIDE 11

Example

Equality of strings

How to test equality of strings?

a a a a b b a b c a b b a b c a b a a a a b b a b c a b b a b c a b

Artur Jeż FCPM by recompression 9 VII 2012 5 / 18

slide-12
SLIDE 12

Example

Equality of strings

How to test equality of strings?

a a a a b b a b c a b b a b c a b a a a a b b a b c a b b a b c a b

Artur Jeż FCPM by recompression 9 VII 2012 5 / 18

slide-13
SLIDE 13

Example

Equality of strings

How to test equality of strings?

a3 a b b a b c a b b a b c a b a3 a b b a b c a b b a b c a b

Artur Jeż FCPM by recompression 9 VII 2012 5 / 18

slide-14
SLIDE 14

Example

Equality of strings

How to test equality of strings?

a3 a b b a b c a b2 a b c a b a3 a b b a b c a b2 a b c a b

Artur Jeż FCPM by recompression 9 VII 2012 5 / 18

slide-15
SLIDE 15

Example

Equality of strings

How to test equality of strings?

a3 d b c a b2 c a b a3 b c a b2 c a b d d d d d

Artur Jeż FCPM by recompression 9 VII 2012 5 / 18

slide-16
SLIDE 16

Example

Equality of strings

How to test equality of strings?

a3 d b c a b2 c e a3 b c a b2 c e d d d d d

Artur Jeż FCPM by recompression 9 VII 2012 5 / 18

slide-17
SLIDE 17

Example

Equality of strings

How to test equality of strings?

a3 d b c a b2 c e a3 b c a b2 c e d d d d d

Artur Jeż FCPM by recompression 9 VII 2012 5 / 18

slide-18
SLIDE 18

Example

Equality of strings

How to test equality of strings?

a3 d b c a b2 c e a3 b c a b2 c e d d d d d

Iterate!

Artur Jeż FCPM by recompression 9 VII 2012 5 / 18

slide-19
SLIDE 19

How to generalise?

Idea

For both strings replace pairs of letters replace (maximal) blocks of the same letter When every letter is compressed, the length reduces by half in an iteration.

Artur Jeż FCPM by recompression 9 VII 2012 6 / 18

slide-20
SLIDE 20

How to generalise?

Idea

For both strings replace pairs of letters replace (maximal) blocks of the same letter When every letter is compressed, the length reduces by half in an iteration.

TODO

formalise for SLPs for pattern matching running time

Artur Jeż FCPM by recompression 9 VII 2012 6 / 18

slide-21
SLIDE 21

Formalisation

In one phase

Artur Jeż FCPM by recompression 9 VII 2012 7 / 18

slide-22
SLIDE 22

Formalisation

In one phase

L ←list of letters, P ←list of pairs of letters

Artur Jeż FCPM by recompression 9 VII 2012 7 / 18

slide-23
SLIDE 23

Formalisation

In one phase

L ←list of letters, P ←list of pairs of letters for every letter a ∈ L do replace (maximal) blocks aℓ with aℓ

Artur Jeż FCPM by recompression 9 VII 2012 7 / 18

slide-24
SLIDE 24

Formalisation

In one phase

L ←list of letters, P ←list of pairs of letters for every letter a ∈ L do replace (maximal) blocks aℓ with aℓ for every pair of letter ab ∈ P do replace pairs ab with c

Artur Jeż FCPM by recompression 9 VII 2012 7 / 18

slide-25
SLIDE 25

Formalisation

In one phase

L ←list of letters, P ←list of pairs of letters for every letter a ∈ L do replace (maximal) blocks aℓ with aℓ for every pair of letter ab ∈ P do replace pairs ab with c It will shorten the strings by constant factor.

Artur Jeż FCPM by recompression 9 VII 2012 7 / 18

slide-26
SLIDE 26

Formalisation

In one phase

L ←list of letters, P ←list of pairs of letters for every letter a ∈ L do replace (maximal) blocks aℓ with aℓ for every pair of letter ab ∈ P do replace pairs ab with c It will shorten the strings by constant factor. Loop, while nontrivial. (O(log M) iterations).

Artur Jeż FCPM by recompression 9 VII 2012 7 / 18

slide-27
SLIDE 27

SLPs

Grammar form

More general rules: Xi → uXjvXkw, j, k < i.

Artur Jeż FCPM by recompression 9 VII 2012 8 / 18

slide-28
SLIDE 28

SLPs

Grammar form

More general rules: Xi → uXjvXkw, j, k < i.

Lemma

There are |G| + 4n different maximal lengths of blocks in G.

Proof.

blocks contained in explicit words: assign to explicit letters blocks not contained in explicit words: at most 4 per rule

Artur Jeż FCPM by recompression 9 VII 2012 8 / 18

slide-29
SLIDE 29

SLPs

Grammar form

More general rules: Xi → uXjvXkw, j, k < i.

Lemma

There are |G| + 4n different maximal lengths of blocks in G.

Proof.

blocks contained in explicit words: assign to explicit letters blocks not contained in explicit words: at most 4 per rule

Lemma

There are |G| + 4n different pairs of letters in G.

Artur Jeż FCPM by recompression 9 VII 2012 8 / 18

slide-30
SLIDE 30

Blocks compression

Compression of a

Artur Jeż FCPM by recompression 9 VII 2012 9 / 18

slide-31
SLIDE 31

Blocks compression

Compression of a

X1 → baaba, X2 → aaX1baX1baa

Artur Jeż FCPM by recompression 9 VII 2012 9 / 18

slide-32
SLIDE 32

Blocks compression

Compression of a

X1 → baaba, X2 → aaX1baX1baa (no problem)

Artur Jeż FCPM by recompression 9 VII 2012 9 / 18

slide-33
SLIDE 33

Blocks compression

Compression of a

X1 → baaba, X2 → aaX1baX1baa (no problem) X1 → a, X2 → aX1aX1a

Artur Jeż FCPM by recompression 9 VII 2012 9 / 18

slide-34
SLIDE 34

Blocks compression

Compression of a

X1 → baaba, X2 → aaX1baX1baa (no problem) X1 → a, X2 → aX1aX1a (problem)

Artur Jeż FCPM by recompression 9 VII 2012 9 / 18

slide-35
SLIDE 35

Blocks compression

Compression of a

X1 → baaba, X2 → aaX1baX1baa (no problem) X1 → a, X2 → aX1aX1a (problem) X1 → abaaba, X2 → aX1aX1a

Artur Jeż FCPM by recompression 9 VII 2012 9 / 18

slide-36
SLIDE 36

Blocks compression

Compression of a

X1 → baaba, X2 → aaX1baX1baa (no problem) X1 → a, X2 → aX1aX1a (problem) X1 → abaaba, X2 → aX1aX1a (problem)

Artur Jeż FCPM by recompression 9 VII 2012 9 / 18

slide-37
SLIDE 37

Blocks compression

Compression of a

X1 → baaba, X2 → aaX1baX1baa (no problem) X1 → a, X2 → aX1aX1a (problem) X1 → abaaba, X2 → aX1aX1a (problem)

Definition (Crossing block)

a has a crossing block if some of its maximal blocks is contained in Xi but not in explicit words in Xi’s rule.

Artur Jeż FCPM by recompression 9 VII 2012 9 / 18

slide-38
SLIDE 38

Blocks compression

Compression of a

X1 → baaba, X2 → aaX1baX1baa (no problem) X1 → a, X2 → aX1aX1a (problem) X1 → abaaba, X2 → aX1aX1a (problem)

Definition (Crossing block)

a has a crossing block if some of its maximal blocks is contained in Xi but not in explicit words in Xi’s rule.

When a has no crossing block

1: for all maximal blocks aℓ of a do 2:

let aℓ ∈ Σ be an unused letter

3:

replace each explicit maximal aℓ in rules’ bodies by aℓ

Artur Jeż FCPM by recompression 9 VII 2012 9 / 18

slide-39
SLIDE 39

What about crossing blocks?

Idea

change the rules when Xi defines aℓiwari → w replace Xi in rules by aℓiwari

Artur Jeż FCPM by recompression 9 VII 2012 10 / 18

slide-40
SLIDE 40

What about crossing blocks?

Idea

change the rules when Xi defines aℓiwari → w replace Xi in rules by aℓiwari

CutPrefSuff(a)

1: for i ← 1 to n do 2:

calculate and remove a-prefix aℓi and a-suffix ari of Xi

3:

replace each Xi in rules bodies by aℓiXiari

Artur Jeż FCPM by recompression 9 VII 2012 10 / 18

slide-41
SLIDE 41

What about crossing blocks?

Idea

change the rules when Xi defines aℓiwari → w replace Xi in rules by aℓiwari

CutPrefSuff(a)

1: for i ← 1 to n do 2:

calculate and remove a-prefix aℓi and a-suffix ari of Xi

3:

replace each Xi in rules bodies by aℓiXiari

Lemma

After CutPrefSuff(a) letter a has no crossing block.

Artur Jeż FCPM by recompression 9 VII 2012 10 / 18

slide-42
SLIDE 42

What about crossing blocks?

Idea

change the rules when Xi defines aℓiwari → w replace Xi in rules by aℓiwari

CutPrefSuff(a)

1: for i ← 1 to n do 2:

calculate and remove a-prefix aℓi and a-suffix ari of Xi

3:

replace each Xi in rules bodies by aℓiXiari

Lemma

After CutPrefSuff(a) letter a has no crossing block. So a’s blocks can be easily compressed.

Artur Jeż FCPM by recompression 9 VII 2012 10 / 18

slide-43
SLIDE 43

What about crossing blocks?

Idea

change the rules when Xi defines aℓiwari → w replace Xi in rules by aℓiwari

CutPrefSuff(a)

1: for i ← 1 to n do 2:

calculate and remove a-prefix aℓi and a-suffix ari of Xi

3:

replace each Xi in rules bodies by aℓiXiari

Lemma

After CutPrefSuff(a) letter a has no crossing block. So a’s blocks can be easily compressed. Parallelly for many letters!

Artur Jeż FCPM by recompression 9 VII 2012 10 / 18

slide-44
SLIDE 44

What about crossing blocks?

Idea

change the rules when Xi defines aℓiwbri → w replace Xi in rules by aℓiwbri

CutPrefSuff

1: for i ← 1 → n do 2:

let Xi begin with a and end with b

3:

calculate and remove a-prefix aℓ and b-suffix br of Xi

4:

replace each Xi in rules bodies by aℓXibr

Lemma

After CutPrefSuff no letter has a crossing block. So all blocks can be easily compressed.

Artur Jeż FCPM by recompression 9 VII 2012 10 / 18

slide-45
SLIDE 45

Pair compression

X1 → ababcab, X2 → abcbX1abX1a

Artur Jeż FCPM by recompression 9 VII 2012 11 / 18

slide-46
SLIDE 46

Pair compression

X1 → ababcab, X2 → abcbX1abX1a compression of ab: easy

Artur Jeż FCPM by recompression 9 VII 2012 11 / 18

slide-47
SLIDE 47

Pair compression

X1 → ababcab, X2 → abcbX1abX1a compression of ab: easy compression of ba: problem

Artur Jeż FCPM by recompression 9 VII 2012 11 / 18

slide-48
SLIDE 48

Pair compression

X1 → ababcab, X2 → abcbX1abX1a compression of ab: easy compression of ba: problem pairs may overlap (problem: sequentially, not parallely)

Artur Jeż FCPM by recompression 9 VII 2012 11 / 18

slide-49
SLIDE 49

Crossing pairs

When ab has a ‘crossing’ appearance: aXi or Xib Xi defines bw → w, replace Xi by bXi symmetrically for ending a

Artur Jeż FCPM by recompression 9 VII 2012 12 / 18

slide-50
SLIDE 50

Crossing pairs

When ab has a ‘crossing’ appearance: aXi or Xib Xi defines bw → w, replace Xi by bXi symmetrically for ending a

LeftPop(b)

1: for i=1 to n do 2:

if the first symbol in Xi → α is b then

3:

remove this b

4:

replace Xi in productions by bXi

Lemma

After LeftPop(b) and RightPop(a) the ab is no longer crossing.

Artur Jeż FCPM by recompression 9 VII 2012 12 / 18

slide-51
SLIDE 51

Crossing pairs

When ab has a ‘crossing’ appearance: aXi or Xib Xi defines bw → w, replace Xi by bXi symmetrically for ending a

LeftPop(b)

1: for i=1 to n do 2:

if the first symbol in Xi → α is b then

3:

remove this b

4:

replace Xi in productions by bXi

Lemma

After LeftPop(b) and RightPop(a) the ab is no longer crossing. Can be done in parallel!

Artur Jeż FCPM by recompression 9 VII 2012 12 / 18

slide-52
SLIDE 52

Crossing pairs

When ab ∈ Σ1Σ2 has a crossing appearance: aXi or Xib Xi defines bw → w, replace Xi by aXi symmetrically for ending a

LeftPop

1: for i=1 to n do 2:

if the first symbol in Xi → α is b ∈ Σ2 then

3:

remove this b

4:

replace Xi in productions by bXi

Lemma

After LeftPop and RightPop the pairs Σ1Σ2 are no longer crossing.

Artur Jeż FCPM by recompression 9 VII 2012 12 / 18

slide-53
SLIDE 53

Running time

Blocks compression: O(|G|) time non-crossing pairs: O(|G|) time crossing pairs: O(n + m) time per partition (Σ1, Σ2)

Artur Jeż FCPM by recompression 9 VII 2012 13 / 18

slide-54
SLIDE 54

Running time

Blocks compression: O(|G|) time non-crossing pairs: O(|G|) time crossing pairs: O(n + m) time per partition (Σ1, Σ2)

Lemma

There are O(n + m) crossing pairs.

Artur Jeż FCPM by recompression 9 VII 2012 13 / 18

slide-55
SLIDE 55

Running time

Blocks compression: O(|G|) time non-crossing pairs: O(|G|) time crossing pairs: O(n + m) time per partition (Σ1, Σ2)

Lemma

There are O(n + m) crossing pairs. crossing pairs: O((n + m)2) time.

Artur Jeż FCPM by recompression 9 VII 2012 13 / 18

slide-56
SLIDE 56

Running time

Blocks compression: O(|G|) time non-crossing pairs: O(|G|) time crossing pairs: O(n + m) time per partition (Σ1, Σ2)

Lemma

There are O(n + m) crossing pairs. crossing pairs: O((n + m)2) time.

Running time

Running time: O(|G| + (n + m)2).

Artur Jeż FCPM by recompression 9 VII 2012 13 / 18

slide-57
SLIDE 57

Shortening of the string

consider pair ab in the text if a = b: it is compressed if a = b: it is compressed unless a or b was compressed already consider four consecutive symbols: something in them is compressed text compresses by a constant factor in each phase O(| log M|) phases

Artur Jeż FCPM by recompression 9 VII 2012 14 / 18

slide-58
SLIDE 58

Overall running time and grammar size

Grammar size

In each phase size of grammar increases by O((n + m)2)

◮ CutPrefSuff ◮ LeftPop, RightPop

shortening G: the same analysis as for pattern

◮ shortens by a constant factor in a phase

G is O((n + m)2) Running time is O((n + m)2 log M) Can be reduced to O((n + m) log M)

Artur Jeż FCPM by recompression 9 VII 2012 15 / 18

slide-59
SLIDE 59

Turning to the pattern matching

Problem with the ends

text: abababab, pattern baba, compression of ab text: abababab, pattern aba, compression of ab text: aaaaaaaa, pattern aaa, compression of a blocks

Artur Jeż FCPM by recompression 9 VII 2012 16 / 18

slide-60
SLIDE 60

Turning to the pattern matching

Problem with the ends

text: abababab, pattern baba, compression of ab text: abababab, pattern aba, compression of ab text: aaaaaaaa, pattern aaa, compression of a blocks

Fixing the ends

Compress the starting and ending pair, if possible (so ba in the first case) not possible, when the first and last letter is the same, say a replace leading a by aL, ending by aR spawn a into aRaL

Artur Jeż FCPM by recompression 9 VII 2012 16 / 18

slide-61
SLIDE 61

Questions? Other applications?

Artur Jeż FCPM by recompression 9 VII 2012 17 / 18