Smallest grammar by recompression Artur Je z Max Planck Institute - - PowerPoint PPT Presentation

smallest grammar by recompression
SMART_READER_LITE
LIVE PREVIEW

Smallest grammar by recompression Artur Je z Max Planck Institute - - PowerPoint PPT Presentation

Smallest grammar by recompression Artur Je z Max Planck Institute for Informatics 17.06.2013 Grammar based-compression Represent w as a CFG generating it. 17.06.2013 2/17 Grammar based-compression Represent w as a CFG generating it.


slide-1
SLIDE 1

Smallest grammar by recompression

Artur Je˙ z

Max Planck Institute for Informatics

17.06.2013

slide-2
SLIDE 2

Grammar based-compression

Represent w as a CFG generating it.

17.06.2013 2/17

slide-3
SLIDE 3

Grammar based-compression

Represent w as a CFG generating it.

Advantages

it is usually small (at most quadratic vs. LZ) compression is fast it is exponential on good data

17.06.2013 2/17

slide-4
SLIDE 4

Grammar based-compression

Represent w as a CFG generating it.

Advantages

it is usually small (at most quadratic vs. LZ) compression is fast it is exponential on good data extracts hierarchical structure it is easy to work on

17.06.2013 2/17

slide-5
SLIDE 5

Grammar based-compression

Represent w as a CFG generating it.

Advantages

it is usually small (at most quadratic vs. LZ) compression is fast it is exponential on good data extracts hierarchical structure it is easy to work on related to LZW and LZ

17.06.2013 2/17

slide-6
SLIDE 6

Smallest grammar

Problem Given w return smallest CFG Gw such that L(Gw) = w.

17.06.2013 3/17

slide-7
SLIDE 7

Smallest grammar

Problem Given w return smallest CFG Gw such that L(Gw) = w. With O(1) increase in size, this is an SLP .

Definition (SLP: Straight Line Programme)

CFG with

  • rdered nonterminals X1, X2, . . .

Chomsky normal form for Xi → XjXk we have j, k < i

17.06.2013 3/17

slide-8
SLIDE 8

What is known

Best approximation ratio

O(log(n/g)), where g is the size of the optimal grammar.

17.06.2013 4/17

slide-9
SLIDE 9

What is known

Best approximation ratio

O(log(n/g)), where g is the size of the optimal grammar. Rytter – represent w as LZ, size ℓ ≤ g – translation of LZ into SLP

, size O(ℓ log(n/ℓ)) ≤ O(g log(n/g))

– the intermediate grammar is balanced (AVL-type condition)

17.06.2013 4/17

slide-10
SLIDE 10

What is known

Best approximation ratio

O(log(n/g)), where g is the size of the optimal grammar. Rytter – represent w as LZ, size ℓ ≤ g – translation of LZ into SLP

, size O(ℓ log(n/ℓ)) ≤ O(g log(n/g))

– the intermediate grammar is balanced (AVL-type condition) Charikar et al.: – similar as Rytter – different balance criterion (length of word)

17.06.2013 4/17

slide-11
SLIDE 11

What is known

Best approximation ratio

O(log(n/g)), where g is the size of the optimal grammar. Rytter – represent w as LZ, size ℓ ≤ g – translation of LZ into SLP

, size O(ℓ log(n/ℓ)) ≤ O(g log(n/g))

– the intermediate grammar is balanced (AVL-type condition) Charikar et al.: – similar as Rytter – different balance criterion (length of word) Sakamoto – local replacement rules (plus a global partition): pairs and blocks – analysis vs LZ

17.06.2013 4/17

slide-12
SLIDE 12

What is known

Best approximation ratio

O(log(n/g)), where g is the size of the optimal grammar. Rytter – represent w as LZ, size ℓ ≤ g – translation of LZ into SLP

, size O(ℓ log(n/ℓ)) ≤ O(g log(n/g))

– the intermediate grammar is balanced (AVL-type condition) Charikar et al.: – similar as Rytter – different balance criterion (length of word) Sakamoto – local replacement rules (plus a global partition): pairs and blocks – analysis vs LZ Linear time.

17.06.2013 4/17

slide-13
SLIDE 13

This talk

Very simple linear-time algorithm, O(log(n/g)) approximation.

17.06.2013 5/17

slide-14
SLIDE 14

This talk

Very simple linear-time algorithm, O(log(n/g)) approximation. analysis in the recompression framework, vs. SLP – very robust – good: easier to show better approximation? – bad: might be in fact larger

17.06.2013 5/17

slide-15
SLIDE 15

This talk

Very simple linear-time algorithm, O(log(n/g)) approximation. analysis in the recompression framework, vs. SLP – very robust – good: easier to show better approximation? – bad: might be in fact larger not balanced – good: easier to show approximation? – bad: worse for further processing

17.06.2013 5/17

slide-16
SLIDE 16

This talk

Very simple linear-time algorithm, O(log(n/g)) approximation. analysis in the recompression framework, vs. SLP – very robust – good: easier to show better approximation? – bad: might be in fact larger not balanced – good: easier to show approximation? – bad: worse for further processing height O(log n), when aℓ has height 1

17.06.2013 5/17

slide-17
SLIDE 17

This talk

Very simple linear-time algorithm, O(log(n/g)) approximation. analysis in the recompression framework, vs. SLP – very robust – good: easier to show better approximation? – bad: might be in fact larger not balanced – good: easier to show approximation? – bad: worse for further processing height O(log n), when aℓ has height 1 Algorithm similar to Sakamoto, different analysis.

17.06.2013 5/17

slide-18
SLIDE 18

Example

a a a a b b a b c a b b a b c a b

17.06.2013 6/17

slide-19
SLIDE 19

Example

a a a a b b a b c a b b a b c a b

17.06.2013 6/17

slide-20
SLIDE 20

Example

a3 a b b a b c a b b a b c a b a3 → a3

17.06.2013 6/17

slide-21
SLIDE 21

Example

a3 a b b a b c a b2 a b c a b a3 → a3, b2 → b2

17.06.2013 6/17

slide-22
SLIDE 22

Example

a3 b c a b2 c a b d d d a3 → a3, b2 → b2, d → ab

17.06.2013 6/17

slide-23
SLIDE 23

Example

a3 b c a b2 c e d d d a3 → a3, b2 → b2, d → ab, e → ba

17.06.2013 6/17

slide-24
SLIDE 24

Example

a3 b c a b2 c e d d d a3 → a3, b2 → b2, d → ab, e → ba

17.06.2013 6/17

slide-25
SLIDE 25

Example

a3 b c a b2 c e d d d a3 → a3, b2 → b2, d → ab, e → ba

Intuition Phases: compress only pairs and block from the beginning of a phase. Treat nonterminals as letters. To speed up, we make some pair compression simultaneously (partition Σ to Σℓ, Σr, pairs from ΣℓΣr)

17.06.2013 6/17

slide-26
SLIDE 26

Algorithm

1: while |T| > 1 do

17.06.2013 7/17

slide-27
SLIDE 27

Algorithm

1: while |T| > 1 do 2:

L ← list of letters in T

3:

for each a ∈ L do ⊲ Blocks compression

4:

compress maximal blocks of a ⊲ O(|T|)

17.06.2013 7/17

slide-28
SLIDE 28

Algorithm

1: while |T| > 1 do 2:

L ← list of letters in T

3:

for each a ∈ L do ⊲ Blocks compression

4:

compress maximal blocks of a ⊲ O(|T|)

5:

P ← list of pairs

6:

find partition of Σ into Σℓ and Σr

7:

⊲ Try to maximize the occurrences from ΣℓΣr in T.

17.06.2013 7/17

slide-29
SLIDE 29

Algorithm

1: while |T| > 1 do 2:

L ← list of letters in T

3:

for each a ∈ L do ⊲ Blocks compression

4:

compress maximal blocks of a ⊲ O(|T|)

5:

P ← list of pairs

6:

find partition of Σ into Σℓ and Σr

7:

⊲ Try to maximize the occurrences from ΣℓΣr in T.

8:

for ab ∈ P ∩ ΣℓΣr do ⊲ These pairs do not overlap

9:

compress pair ab ⊲ Pair compression

17.06.2013 7/17

slide-30
SLIDE 30

Algorithm

1: while |T| > 1 do 2:

L ← list of letters in T

3:

for each a ∈ L do ⊲ Blocks compression

4:

compress maximal blocks of a ⊲ O(|T|)

5:

P ← list of pairs

6:

find partition of Σ into Σℓ and Σr

7:

⊲ Try to maximize the occurrences from ΣℓΣr in T.

8:

for ab ∈ P ∩ ΣℓΣr do ⊲ These pairs do not overlap

9:

compress pair ab ⊲ Pair compression

10: return the constructed grammar

17.06.2013 7/17

slide-31
SLIDE 31

Partition

1/4 appearances covered

A partition ΣℓΣr such that 1/4 of pairs is covered.

17.06.2013 8/17

slide-32
SLIDE 32

Partition

1/4 appearances covered

A partition ΣℓΣr such that 1/4 of pairs is covered. After block compression aa does not appear. Random partition: 1/4 pairs can be covered. derandomise (expected value) we need number of appearances of ab: RadixSort O(|T|).

17.06.2013 8/17

slide-33
SLIDE 33

Size reduction

Size drop

Consider set of two consecutive letters ab in T. For 1/4 of them one letter is compressed in a phase. Length drops by a constant factor.

17.06.2013 9/17

slide-34
SLIDE 34

Size reduction

Size drop

Consider set of two consecutive letters ab in T. For 1/4 of them one letter is compressed in a phase. – if a = b: it is compressed Length drops by a constant factor.

17.06.2013 9/17

slide-35
SLIDE 35

Size reduction

Size drop

Consider set of two consecutive letters ab in T. For 1/4 of them one letter is compressed in a phase. – if a = b: it is compressed – if a = b: 1/4 of those pairs is in ΣℓΣr

When we consider ab we replace it, unless one letter was already replaced.

Length drops by a constant factor.

17.06.2013 9/17

slide-36
SLIDE 36

Size reduction

Size drop

Consider set of two consecutive letters ab in T. For 1/4 of them one letter is compressed in a phase. – if a = b: it is compressed – if a = b: 1/4 of those pairs is in ΣℓΣr

When we consider ab we replace it, unless one letter was already replaced.

Length drops by a constant factor.

Towards running time

It is enough to show that one round runs in O(|T|).

17.06.2013 9/17

slide-37
SLIDE 37

Running time

Partition

O(|T|) time.

Block compression

By RadixSort, O(|T|) time.

Pair compression

By RadixSort, O(|T|) time.

17.06.2013 10/17

slide-38
SLIDE 38

Number of nonterminals

Representation cost

17.06.2013 11/17

slide-39
SLIDE 39

Number of nonterminals

Representation cost

when c replaces ab we add rule c → ab, representation cost 1

17.06.2013 11/17

slide-40
SLIDE 40

Number of nonterminals

Representation cost

when c replaces ab we add rule c → ab, representation cost 1 when aℓ1, aℓ2, . . . , aℓk are replaced with aℓ1, aℓ2, . . . , aℓk (ℓ1 < ℓ2 . . . < ℓk):

17.06.2013 11/17

slide-41
SLIDE 41

Number of nonterminals

Representation cost

when c replaces ab we add rule c → ab, representation cost 1 when aℓ1, aℓ2, . . . , aℓk are replaced with aℓ1, aℓ2, . . . , aℓk (ℓ1 < ℓ2 . . . < ℓk): – first represent aℓ2−ℓ1, aℓ3−ℓ2, . . . , aℓk−ℓk−1 as aℓ2−ℓ1, aℓ3−ℓ2, . . . ,

aℓk−ℓk−1

– do this by binary expansion

(make new rules a2 → aa, a4 → a2a2, a8 → a4a4, . . . )

17.06.2013 11/17

slide-42
SLIDE 42

Number of nonterminals

Representation cost

when c replaces ab we add rule c → ab, representation cost 1 when aℓ1, aℓ2, . . . , aℓk are replaced with aℓ1, aℓ2, . . . , aℓk (ℓ1 < ℓ2 . . . < ℓk): – first represent aℓ2−ℓ1, aℓ3−ℓ2, . . . , aℓk−ℓk−1 as aℓ2−ℓ1, aℓ3−ℓ2, . . . ,

aℓk−ℓk−1

– do this by binary expansion

(make new rules a2 → aa, a4 → a2a2, a8 → a4a4, . . . )

– aℓi+1 → aℓi+1−ℓiaℓi

17.06.2013 11/17

slide-43
SLIDE 43

Number of nonterminals

Representation cost

when c replaces ab we add rule c → ab, representation cost 1 when aℓ1, aℓ2, . . . , aℓk are replaced with aℓ1, aℓ2, . . . , aℓk (ℓ1 < ℓ2 . . . < ℓk): – first represent aℓ2−ℓ1, aℓ3−ℓ2, . . . , aℓk−ℓk−1 as aℓ2−ℓ1, aℓ3−ℓ2, . . . ,

aℓk−ℓk−1

– do this by binary expansion

(make new rules a2 → aa, a4 → a2a2, a8 → a4a4, . . . )

– aℓi+1 → aℓi+1−ℓiaℓi – representation cost

O k−1

  • i=1

log(ℓi+1 − ℓi)

  • 17.06.2013

11/17

slide-44
SLIDE 44

Analysis outline

We begin with a G generating T (mental experiment) in each moment we keep G generating the current T

17.06.2013 12/17

slide-45
SLIDE 45

Analysis outline

We begin with a G generating T (mental experiment) in each moment we keep G generating the current T – we apply the compression to G – it is changed so that this can be done

17.06.2013 12/17

slide-46
SLIDE 46

Analysis outline

We begin with a G generating T (mental experiment) in each moment we keep G generating the current T – we apply the compression to G – it is changed so that this can be done representation cost is calculated using G

17.06.2013 12/17

slide-47
SLIDE 47

Analysis outline

We begin with a G generating T (mental experiment) in each moment we keep G generating the current T – we apply the compression to G – it is changed so that this can be done representation cost is calculated using G G is of more general form: Xi → uXjvXkw explicit letters have credit representation cost is paid by released credit:

17.06.2013 12/17

slide-48
SLIDE 48

Analysis outline

We begin with a G generating T (mental experiment) in each moment we keep G generating the current T – we apply the compression to G – it is changed so that this can be done representation cost is calculated using G G is of more general form: Xi → uXjvXkw explicit letters have credit representation cost is paid by released credit: – ab is replaced by c – we need 1 representation cost – each ab in G is replaced with c, 1 credit is released

17.06.2013 12/17

slide-49
SLIDE 49

Analysis outline

We begin with a G generating T (mental experiment) in each moment we keep G generating the current T – we apply the compression to G – it is changed so that this can be done representation cost is calculated using G G is of more general form: Xi → uXjvXkw explicit letters have credit representation cost is paid by released credit: – ab is replaced by c – we need 1 representation cost – each ab in G is replaced with c, 1 credit is released – (bit more tricky for blocks)

17.06.2013 12/17

slide-50
SLIDE 50

Analysis outline

We begin with a G generating T (mental experiment) in each moment we keep G generating the current T – we apply the compression to G – it is changed so that this can be done representation cost is calculated using G G is of more general form: Xi → uXjvXkw explicit letters have credit representation cost is paid by released credit: – ab is replaced by c – we need 1 representation cost – each ab in G is replaced with c, 1 credit is released – (bit more tricky for blocks) we only need to count the number of created credit

17.06.2013 12/17

slide-51
SLIDE 51

Pair compression

X1 → ababcab, X2 → abcbX1abX1a

17.06.2013 13/17

slide-52
SLIDE 52

Pair compression

X1 → ababcab, X2 → abcbX1abX1a compression of ab: easy

17.06.2013 13/17

slide-53
SLIDE 53

Pair compression

X1 → ababcab, X2 → abcbX1abX1a compression of ab: easy compression of ba: problem

17.06.2013 13/17

slide-54
SLIDE 54

Pair compression

X1 → ababcab, X2 → abcbX1abX1a compression of ab: easy compression of ba: problem

Definition (Non-crossing pairs)

ab is non-crossing pair iff none of the below happens aX appears in a rule, X begins with b Xb appears in a rule, X ends with a

17.06.2013 13/17

slide-55
SLIDE 55

Pair compression

X1 → ababcab, X2 → abcbX1abX1a compression of ab: easy compression of ba: problem

Definition (Non-crossing pairs)

ab is non-crossing pair iff none of the below happens aX appears in a rule, X begins with b Xb appears in a rule, X ends with a When each pair from ΣℓΣr is non-crossing, replace all those pairs in G (no new credit).

17.06.2013 13/17

slide-56
SLIDE 56

Making pairs non-crossing

When ab has a crossing appearance: aXi or Xib Xi defines bw: change it to w, replace Xi by bXi symmetrically for ending a

17.06.2013 14/17

slide-57
SLIDE 57

Making pairs non-crossing

When ab has a crossing appearance: aXi or Xib Xi defines bw: change it to w, replace Xi by bXi symmetrically for ending a

LeftPop(b)

1: for i ← 1 . . g − 1 do 2:

if the first symbol in Xi → α is b then

3:

remove this b

4:

replace Xi in productions by bXi

Lemma

After LeftPop(b) and RightPop(a) the ab is non-crossing.

17.06.2013 14/17

slide-58
SLIDE 58

Making pairs non-crossing

When ab has a crossing appearance: aXi or Xib Xi defines bw: change it to w, replace Xi by bXi symmetrically for ending a

LeftPop(b)

1: for i ← 1 . . g − 1 do 2:

if the first symbol in Xi → α is b then

3:

remove this b

4:

replace Xi in productions by bXi

Lemma

After LeftPop(b) and RightPop(a) the ab is non-crossing. Can be done in parallel for all ab ∈ ΣℓΣr.

17.06.2013 14/17

slide-59
SLIDE 59

Making pairs non-crossing

When ab ∈ ΣℓΣr has a crossing appearance: aXi or Xib Xi defines bw: change it to w, replace Xi by aXi symmetrically for ending a

LeftPop

1: for i ← 1 . . g − 1 do 2:

if the first symbol in Xi → α is b ∈ Σr then

3:

remove this b

4:

replace Xi in productions by bXi

Lemma

After LeftPop and RightPop the pairs ΣℓΣr are non-crossing. Can be done in parallel for all ab ∈ ΣℓΣr.

17.06.2013 14/17

slide-60
SLIDE 60

Making pairs non-crossing

When ab ∈ ΣℓΣr has a crossing appearance: aXi or Xib Xi defines bw: change it to w, replace Xi by aXi symmetrically for ending a

LeftPop

1: for i ← 1 . . g − 1 do 2:

if the first symbol in Xi → α is b ∈ Σr then

3:

remove this b

4:

replace Xi in productions by bXi

Lemma

After LeftPop and RightPop the pairs ΣℓΣr are non-crossing. Can be done in parallel for all ab ∈ ΣℓΣr. Credit increases by O(g)

17.06.2013 14/17

slide-61
SLIDE 61

Blocks & Wrap up

Idea

Similarly as pairs Xi defines aℓiwbri: change it to w replace Xi in rules by aℓiXibri

17.06.2013 15/17

slide-62
SLIDE 62

Blocks & Wrap up

Idea

Similarly as pairs Xi defines aℓiwbri: change it to w replace Xi in rules by aℓiXibri analysis: more tricky but works O(g)

17.06.2013 15/17

slide-63
SLIDE 63

Blocks & Wrap up

Idea

Similarly as pairs Xi defines aℓiwbri: change it to w replace Xi in rules by aℓiXibri analysis: more tricky but works O(g)

In total

O(g) per phase O(log n) phases O(g log n) credit in total (= size of created grammar) can be improved to O(g log(n/g))

17.06.2013 15/17

slide-64
SLIDE 64

Acknowledgments

  • M. Lohrey

Suggesting the analysis.

17.06.2013 16/17

slide-65
SLIDE 65

Acknowledgments

  • M. Lohrey

Suggesting the analysis.

P . Gawrychowski

introducing to the topic literature – K. Mehlhorn, R. Sundar and Ch. Uhrig, Maintaining Dynamic

Sequences under Equality Tests in Polylogarithmic Time, ‘97

– H. Sakamoto, A fully linear-time approximation algorithm for

grammar-based compression, ’05

– M. Lohrey and Ch. Mathissen, Compressed Membership in

Automata with Compressed Labels, ’11

17.06.2013 16/17

slide-66
SLIDE 66

Open problems, related research

Open problems

better approximation simpler computational model (no RadixSort) addition chains (O(

log n log log n) approximation known)

17.06.2013 17/17

slide-67
SLIDE 67

Open problems, related research

Open problems

better approximation simpler computational model (no RadixSort) addition chains (O(

log n log log n) approximation known)

Other applications: recompression

compressed membership fully compressed pattern matching word equations

17.06.2013 17/17