Analysis of Lempel-Ziv 78 for Markov sources Ph Jacquet, W. - - PowerPoint PPT Presentation

analysis of lempel ziv 78 for markov sources
SMART_READER_LITE
LIVE PREVIEW

Analysis of Lempel-Ziv 78 for Markov sources Ph Jacquet, W. - - PowerPoint PPT Presentation

Analysis of Lempel-Ziv 78 for Markov sources Ph Jacquet, W. Szpankowski Inria Purdue U the material is made available under the CC-BY-4.0 license https://creativecommons.org/licenses/by-nc-nd/4.0/legalcode Lempel Ziv algorithm Among


slide-1
SLIDE 1

Analysis of Lempel-Ziv 78 for Markov sources

Ph Jacquet, W. Szpankowski Inria – Purdue U

the material is made available under the CC-BY-4.0 license https://creativecommons.org/licenses/by-nc-nd/4.0/legalcode

slide-2
SLIDE 2

Lempel Ziv algorithm

  • Among the 10 most daily used algorithms

– Unix, gif, pdf, etc

slide-3
SLIDE 3

Huge literature in IT and algorithm community

  • D. Aldous, and P. Shields, A Diffusion Limit for a Class of Random-Growing Binary

Trees, Probab. Th. Rel. Fields, 1988.

  • Merhav, Universal Coding with Minimum Probability of Codeword Length

Overflow, IEEE Trans. Information Theory, 1991

  • P. Jacquet, W. Szpankowski, Asymptotic behavior of the Lempel-Ziv parsing scheme

and digital search trees. Theoretical Computer Science, 1995

  • W Schachinger On the variance of a class of inductive valuations of data structures

for digital search, Theoretical computer science, 1995

  • N. Merhav, and J. Ziv, On the Amount of Statistical Side Information Required for

Lossy Data Compression, IEEE Trans. Information Theory, 1997

  • R. Neininger and L. Rüschendorf, A General Limit Theorem for Recursive

Algorithms and Combinatorial Structures, The Annals of Applied Probability, 2004

  • J. Fayolle, M. D. Ward, Analysis of the average depth in a suffix tree under a

Markov model, DMTCS, 2005.

  • K. Leckey, R. Neininger and W. Szpankowski, Towards More Realistic Probabilistic

Models for Data Structures: The External Path Length in Tries under the Markov Model, Algorithms, SODA, 2013

slide-4
SLIDE 4

LZ compression process

  • A text
  • is fragmented into phrases (not grammatical).
  • Each phrase replaced by a short code

(#+symbol)

slide-5
SLIDE 5

Phrase breaking process

  • The next phrase is the longest copy of a

previously seen phrase

  • Plus an extra symbol
  • The code of the new phrase is
  • Final code sequence

1 2 3 2+a 1 2 3 a 0+a 1+b 1+a 2+a

slide-6
SLIDE 6

Breaking process via Digital Search Trees

  • Build the DST of the current phrases
  • Use the path made by the remaining text to

find the next phrase

a

slide-7
SLIDE 7

Two models

  • The DST “m” model

– m independent infinite strings inserted in a DST – Lm the path length

  • The LZ “n” model

– A text of length n broken into LZ phrases – Mn number of phrases

a 1 2 3 ……

slide-8
SLIDE 8

Equivalence of DST m and LZ n models

  • When the text source is memoryless the two

models are equivalent

– backward independence: the current DST and the rest of the text are independent

Jacquet, P., & Szpankowski, W. (1995). Asymptotic behavior of the Lempel-Ziv parsing scheme and digital search trees. Theoretical Computer Science, 144(1-2), 161-197.

slide-9
SLIDE 9

The memoryless source on m model

  • Infinite Text is from a memoryless source
  • Tractable because phrases are independent

– P. Jacquet, W. Szpankowski, Asymptotic behavior of the Lempel- Ziv parsing scheme and digital search trees. Theoretical Computer Science, 1995

  • For m phrases the proportion of covered text Lm

– Tends to be normal when 𝑛 → ∞ – Mean 𝐹 𝑀! = ℓ 𝑛 = !

" log 𝑛 + 𝛾(𝑛) with

𝛾 𝑛 = 𝑃(1) – Variance 𝑤𝑏𝑠 𝑀! = !

" 𝑤(𝑛) with 𝑤(𝑛) = 𝑃(log 𝑛)

slide-10
SLIDE 10

The probability generating function and the non linear differential equation

  • Let ∑!,# 𝑄 𝑀! = 𝑜 𝑣# $#

!! = 𝑄(𝑨, 𝑣)

𝜖 𝜖𝑨 𝑄 𝑨, 𝑣 = 𝑄 𝑞!𝑣𝑨, 𝑣 𝑄(𝑞"𝑣𝑨, 𝑣) 𝑄 𝑀# − 𝐹[𝑀#] 𝑤𝑏𝑠(𝑀#) ∈ [𝑦, 𝑦 + 𝑒𝑦[ → 1 2𝜌 exp − 𝑦$ 2 𝑒𝑦

slide-11
SLIDE 11

From phrase to text compression

  • Number of phrases Mn

– Using renewal: 𝑄 𝑁= > 𝑛 = 𝑄 𝑀> < 𝑜

  • Asymptotically normal
  • Mean
  • variance

– Compression rate: 𝐷= = (log 𝑜 + log 𝐵) ?!

=

  • Average redundancy

𝐹 𝑁% = ℓ&' 𝑜 + 𝑃 𝑜( , 𝑥𝑗𝑢ℎ 𝜀 > 1/2 𝐹 𝑁% ~ ℎ𝑜 log 𝑜 𝑤𝑏𝑠 𝑁% ~ 𝑤 ℓ&' 𝑜 ℓ) ℓ&' 𝑜

$ = 𝑃

𝑜 log$ 𝑜 𝐹 𝐷% − ℎ~ℎ log 𝐵 − 𝛾 ℓ&' 𝑜 log 𝑜 = 𝑃 1 log 𝑜

slide-12
SLIDE 12

DST m model and LZ n model no longer equivalent for markovian text

  • A Markovian generation incurs dependencies

time forward and time backward

bbaababbaababaaababbaababababbabbaababbbbaabaababbaaaabbabbbabbbbba correlation

slide-13
SLIDE 13

Our result about LZ compression performance on a Markovian text

  • The number of phrase ∀𝜀 > 1/2
  • The distribution of first symbol in phrases is

determined and does NOT converge to the stationary distribution of Markov.

  • Redundancy satisfies

𝐹 𝑁% = ℓ&' 𝑜 + 𝑃 𝑜( , 𝑥𝑗𝑢ℎ ℓ 𝑛 ~𝑛 log 𝑛 ℎ 𝑤𝑏𝑠 𝑁% = 𝑃(𝑜$() 𝐹 𝐷% = 𝑃 1 log 𝑜

slide-14
SLIDE 14

The main top difficulty

  • The DST m model and LZ n model are non longer

equivalent

  • We need
slide-15
SLIDE 15

How far can we go with the m model

  • n markovian sources
  • Classic markovian source

– One must track the initial symbol – Path length asymptotically normal

  • Jacquet, P., Szpankowski, W., & Tang, J. (2001). Average profile of

the Lempel-Ziv parsing scheme for a Markovian source. Algorithmica, 31(3), 318-360.

a b 𝑄

#,% !

= 𝑄 𝑀# = 𝑜 𝑏𝑚𝑚 𝑡𝑢𝑏𝑠𝑢𝑡 𝑥𝑗𝑢ℎ 𝑏 = 𝑄(𝑀#

! = 𝑜)

𝜖 𝜖𝑨 𝑄

! 𝑨, 𝑣 = 𝑄 !(𝑞!!𝑣𝑨, 𝑣)𝑄 "(𝑞!"𝑣𝑨, 𝑣)

𝐹 𝑀#

!

= 𝑛 ℎ (log 𝑛 + 𝛾!(𝑛)) 𝑤𝑏𝑠 𝑀#

!

= 𝑛𝑤! 𝑛 = 𝑃(𝑛 log 𝑛)

slide-16
SLIDE 16

m model basic results

  • Asymptotically indifferent of first symbol

– with 𝑄!(. ) periodic when the transition matrix is rational, 𝛾 𝑛 = ̅ 𝛾, otherwise

𝛾+ 𝑛 = 𝛾 𝑛 + 𝑃(𝑛&,) 𝛾 𝑛 = ̅ 𝛾 + 𝑄

  • (log 𝑛)
slide-17
SLIDE 17

Extended m model with tail symbol

  • The tail symbol is the next symbol

after insertion in the DST

– It would be the first symbol of the next phrase in the n model – Tm number of tail symbols equal to “a”

a b b 𝑄

#,.,% +

= 𝑄 𝑈

# = 𝑙 & 𝑀# = 𝑜 𝑏𝑚𝑚 𝑡𝑢𝑏𝑠𝑢 𝑥𝑗𝑢ℎ 𝑑)

𝑑 ∈ {𝑏, 𝑐} 𝑄

+ 𝑨, 𝑣, 𝑤 = ] #,.,%

𝑄

#,.,% +

𝑣%𝑤. 𝑨# 𝑛! 𝜖 𝜖𝑨 𝑄

+ 𝑨, 𝑣, 𝑤 = 𝑞+!𝑤 + 𝑞+" 𝑄 ! 𝑞+!𝑣𝑨, 𝑣, 𝑤 𝑄 "(𝑞+"𝑣𝑨, 𝑣, 𝑤)

slide-18
SLIDE 18

Extended m model analytical results

  • Refining the techniques of the previous m

models (limited to binary alphabet)

– (𝑀"

# , 𝑈 " # ) is asymptotically normal

– 𝐹 𝑈

" #

= 𝑛𝜐#(𝑛), 𝜐# 𝑛 = 𝜐 𝑛 + 𝑃(𝑛$%)

  • 𝜐 𝑛 = ̅

𝜐 + 𝑄

_(log 𝑛) with P1(.) periodic when the

transition matrix is rational, 𝜐 𝑛 = ̅ 𝜐, otherwise

  • Notice the asymptotic tail symbol distribution is NOT

the Markov stationary distribution

– 𝑑𝑝𝑤 𝑀"

# , 𝑈 " #

= 𝑃(𝑛 log 𝑛)

slide-19
SLIDE 19

The remaining very hard nut to crack

  • Coming back to the

n model

– Remember: DST m model and LZ n model are NOT equivalent with Markov sources

slide-20
SLIDE 20

How will it be if m and n models were equivalent for Markov

  • LZ n model: let 𝒬

",' =

𝑄(𝑛 𝑔𝑗𝑠𝑡𝑢 𝑞ℎ𝑠𝑏𝑡𝑓𝑡 ℎ𝑏𝑤𝑓 𝑢𝑝𝑢𝑏𝑚 𝑚𝑓𝑜𝑕𝑢ℎ 𝑜)

  • With memoryless sources we have 𝒬

",' = 𝑄 ",'

because m model and n models are equivalent

  • For a Markov source a convolution from initial symbol

and tail symbols?

  • But this is wrong!

a b b 𝒬

#,% =

]

#!,.,%!

𝑄

#!,.,%! !

𝑄

#&#!,#!&.,%&%! "

slide-21
SLIDE 21

What is failing in the transition DST to LZ?

  • Carving phrases in the text
  • Arranging phrase in a DST

b b b a b a b a a b b b a a a a b b b 𝜏 = (𝑏, 𝑐, 𝑏, 𝑐, 𝑐, 𝑐) 𝜏! = 𝑏, 𝑐, 𝑐 . 𝜏" = 𝑏, 𝑐, 𝑐

slide-22
SLIDE 22

Enumerating permutations in the n and m models

  • Let σ a permutation of m symbols

– σ indicates the sequence of tail symbols in the text (n model). – 𝜏! indicates the tail symbol sequence in DST c-subtree (m model) – We have 𝒬

",$ = ∑ % &" 𝒬 %,$ and 𝑄",',$ !

= ∑ % &", % /&' 𝑄

%,$ !

– But we will see that don’t have the m-n convolution – In other words

𝒬

0,% = 𝑄(𝑛 𝑔𝑗𝑠𝑡𝑢 𝑢𝑏𝑗𝑚 𝑡𝑧𝑛𝑐𝑝𝑚 𝑔𝑝𝑚𝑚𝑝𝑥 𝜏 & 𝑑𝑝𝑤𝑓𝑠 𝑚𝑓𝑜𝑕𝑢ℎ 𝑜)

𝑄

0,% +

= 𝑄 𝐸𝑇𝑈 𝑢𝑏𝑗𝑚 𝑡𝑧𝑛𝑐𝑝𝑚 𝑔𝑝𝑚𝑚𝑝𝑥 𝜏 & 𝑞𝑏𝑢ℎ 𝑚𝑓𝑜𝑕𝑢ℎ 𝑗𝑡 𝑜 𝑡𝑓𝑟𝑣𝑓𝑜𝑑𝑓𝑡 𝑡𝑢𝑏𝑠𝑢 𝑥𝑗𝑢ℎ 𝑑) 𝒬

#,% =

]

0! 1|0"|3#

𝑄

0!,%! !

𝑄

0",%&%! "

𝒬

#,% ≠

]

#! ,%!,.

𝑄

#!,.,%! !

𝑄

#&#!,#!&.,%&%! "

slide-23
SLIDE 23

The lost permutations

  • The following case is not feasible

b a a b a a b b a a b a

slide-24
SLIDE 24

From DST model to LZ model: an upperbound convolution

  • Including lost permutations (restricted to

cyclic permutations)

  • Extending to all permutations

– To simplify just remember

𝒬

#,% ≤ ∑ 0! 1|0"|3# 𝑄 0!,%! !

𝑄

0",%&%! "

= ∑#! ,%!,. 𝑄

#!,.,%! !

𝑄

#&#!,#!&.,%&%! "

𝒬

#,% ≤

]

#! ,%!,.

𝑄

#!,.,%! !

𝑄

#&#!,#!&.,%&%! "

+ ∑#! ,%!,. 𝑄

#!,.,%! !

𝑄

#&#!,#!&.&',%&%! "

+ ∑#! ,%!,. 𝑄

#!,.,%! !

𝑄

#&#!,#!&.1',%&%! "

𝒬

#,% ≤ 3 ∑#! ,%!,. 𝑄 #!,.,%! !

𝑄

#&#!,#!&.,%&%! "

slide-25
SLIDE 25

From DST to LZ a useful inequality (handwaving)

  • If the 𝑄",',$

!

were exactly gaussian, then ∑"/ ,$/,' 𝑄"/,',$/

(

𝑄")"/,"/)',$)$/

*

would be a gaussian convolution, thus for some C>0

– The distribution would be sub-gaussian and the claimed results would hold.

𝒬

#,% ≤

3 2𝜌 𝐷𝑛 log 𝑛 exp − 𝑜 − 𝑛 log 𝑛 ℎ − 𝛾(𝑛) − 𝑛 ℎ ̅ 𝜐 ℎ − 1

$

2𝑛 𝐷 𝑛 log 𝑛

slide-26
SLIDE 26

From DST to LZ a useful inequality (exact analysis)

  • for all δ>1/2 there exists B and C>0 such that

(in the irrational case)

– The distribution is two sided exponential.

𝒬

#,% ≤ 𝐶𝑛'1( exp − 𝑜 − 𝑛 log 𝑛

ℎ − 𝛾 𝑛 − 𝑛 ℎ ̅ 𝜐 ℎ − 1 𝐷𝑛( ℎ 𝑦 = −𝑦 log 𝑦 − 1 − 𝑦 log(1 − 𝑦)

slide-27
SLIDE 27

Simulation m model versus n model

slide-28
SLIDE 28

Further work

  • Results will hold with more than binary finite

alphabet

– But will result in a notational storm – Infinite alphabet?

  • Result will hold with Markov source with finite

memory (higher than 1)

– But tail words may overlap several phrases – Infinite memory?