analysis of lempel ziv 78 for markov sources
play

Analysis of Lempel-Ziv 78 for Markov sources Ph Jacquet, W. - PowerPoint PPT Presentation

Analysis of Lempel-Ziv 78 for Markov sources Ph Jacquet, W. Szpankowski Inria Purdue U the material is made available under the CC-BY-4.0 license https://creativecommons.org/licenses/by-nc-nd/4.0/legalcode Lempel Ziv algorithm Among


  1. Analysis of Lempel-Ziv 78 for Markov sources Ph Jacquet, W. Szpankowski Inria – Purdue U the material is made available under the CC-BY-4.0 license https://creativecommons.org/licenses/by-nc-nd/4.0/legalcode

  2. Lempel Ziv algorithm • Among the 10 most daily used algorithms – Unix, gif, pdf, etc

  3. Huge literature in IT and algorithm community D. Aldous, and P. Shields, A Diffusion Limit for a Class of Random-Growing Binary • Trees, Probab. Th. Rel. Fields, 1988. Merhav, Universal Coding with Minimum Probability of Codeword Length • Overflow, IEEE Trans. Information Theory, 1991 P. Jacquet, W. Szpankowski, Asymptotic behavior of the Lempel-Ziv parsing scheme • and digital search trees. Theoretical Computer Science, 1995 W Schachinger On the variance of a class of inductive valuations of data structures • for digital search, Theoretical computer science, 1995 N. Merhav, and J. Ziv, On the Amount of Statistical Side Information Required for • Lossy Data Compression, IEEE Trans. Information Theory, 1997 R. Neininger and L. Rüschendorf, A General Limit Theorem for Recursive • Algorithms and Combinatorial Structures, The Annals of Applied Probability, 2004 J. Fayolle, M. D. Ward, Analysis of the average depth in a suffix tree under a • Markov model, DMTCS, 2005. K. Leckey, R. Neininger and W. Szpankowski, Towards More Realistic Probabilistic • Models for Data Structures: The External Path Length in Tries under the Markov Model, Algorithms, SODA, 2013

  4. LZ compression process • A text • is fragmented into phrases (not grammatical). • Each phrase replaced by a short code (#+symbol)

  5. Phrase breaking process • The next phrase is the longest copy of a previously seen phrase 1 2 3 • Plus an extra symbol 1 2 3 a • The code of the new phrase is 2+a • Final code sequence 0+a 1+b 1+a 2+a

  6. Breaking process via Digital Search Trees • Build the DST of the current phrases • Use the path made by the remaining text to find the next phrase a

  7. Two models • The DST “ m ” model • The LZ “ n ” model – m independent infinite – A text of length n broken strings inserted in a DST into LZ phrases – L m the path length – M n number of phrases 1 2 3 …… a

  8. Equivalence of DST m and LZ n models • When the text source is memoryless the two models are equivalent – backward independence: the current DST and the rest of the text are independent Jacquet, P., & Szpankowski, W. (1995). Asymptotic behavior of the Lempel-Ziv parsing scheme and digital search trees. Theoretical Computer Science, 144(1-2), 161-197.

  9. The memoryless source on m model • Infinite Text is from a memoryless source • Tractable because phrases are independent – P. Jacquet, W. Szpankowski, Asymptotic behavior of the Lempel- Ziv parsing scheme and digital search trees. Theoretical Computer Science, 1995 • For m phrases the proportion of covered text L m – Tends to be normal when 𝑛 → ∞ – Mean 𝐹 𝑀 ! = ℓ 𝑛 = ! " log 𝑛 + 𝛾(𝑛) with 𝛾 𝑛 = 𝑃(1) – Variance 𝑤𝑏𝑠 𝑀 ! = ! " 𝑤(𝑛) with 𝑤(𝑛) = 𝑃(log 𝑛)

  10. The probability generating function and the non linear differential equation • Let ∑ !,# 𝑄 𝑀 ! = 𝑜 𝑣 # $ # !! = 𝑄(𝑨, 𝑣) 𝜖 𝜖𝑨 𝑄 𝑨, 𝑣 = 𝑄 𝑞 ! 𝑣𝑨, 𝑣 𝑄(𝑞 " 𝑣𝑨, 𝑣) exp − 𝑦 $ 𝑄 𝑀 # − 𝐹[𝑀 # ] 1 ∈ [𝑦, 𝑦 + 𝑒𝑦[ → 𝑒𝑦 2 2𝜌 𝑤𝑏𝑠(𝑀 # )

  11. From phrase to text compression • Number of phrases M n – Using renewal: 𝑄 𝑁 = > 𝑛 = 𝑄 𝑀 > < 𝑜 • Asymptotically normal • Mean 𝐹 𝑁 % = ℓ &' 𝑜 + 𝑃 𝑜 ( , 𝑥𝑗𝑢ℎ 𝜀 > 1/2 𝐹 𝑁 % ~ ℎ𝑜 log 𝑜 𝑤 ℓ &' 𝑜 • variance 𝑜 𝑤𝑏𝑠 𝑁 % ~ $ = 𝑃 log $ 𝑜 ℓ ) ℓ &' 𝑜 – Compression rate: 𝐷 = = (log 𝑜 + log 𝐵) ? ! = • Average redundancy 𝐹 𝐷 % − ℎ~ℎ log 𝐵 − 𝛾 ℓ &' 𝑜 1 = 𝑃 log 𝑜 log 𝑜

  12. DST m model and LZ n model no longer equivalent for markovian text • A Markovian generation incurs dependencies time forward and time backward correlation bbaababbaababaaababbaababababbabbaababbbbaabaababbaaaabbabbbabbbbba

  13. Our result about LZ compression performance on a Markovian text • The number of phrase ∀𝜀 > 1/2 𝐹 𝑁 % = ℓ &' 𝑜 + 𝑃 𝑜 ( , 𝑥𝑗𝑢ℎ ℓ 𝑛 ~𝑛 log 𝑛 ℎ 𝑤𝑏𝑠 𝑁 % = 𝑃(𝑜 $( ) • The distribution of first symbol in phrases is determined and does NOT converge to the stationary distribution of Markov. • Redundancy satisfies 1 𝐹 𝐷 % = 𝑃 log 𝑜

  14. The main top difficulty • The DST m model and LZ n model are non longer equivalent • We need

  15. How far can we go with the m model on markovian sources a b • Classic markovian source – One must track the initial symbol ! = 𝑜) ! 𝑄 = 𝑄 𝑀 # = 𝑜 𝑏𝑚𝑚 𝑡𝑢𝑏𝑠𝑢𝑡 𝑥𝑗𝑢ℎ 𝑏 = 𝑄(𝑀 # #,% 𝜖 𝜖𝑨 𝑄 ! 𝑨, 𝑣 = 𝑄 ! (𝑞 !! 𝑣𝑨, 𝑣)𝑄 " (𝑞 !" 𝑣𝑨, 𝑣) – Path length asymptotically normal = 𝑛 ! 𝐹 𝑀 # ℎ (log 𝑛 + 𝛾 ! (𝑛)) ! 𝑤𝑏𝑠 𝑀 # = 𝑛𝑤 ! 𝑛 = 𝑃(𝑛 log 𝑛) • Jacquet, P., Szpankowski, W., & Tang, J. (2001). Average profile of the Lempel-Ziv parsing scheme for a Markovian source. Algorithmica, 31(3), 318-360.

  16. ̅ ̅ m model basic results • Asymptotically indifferent of first symbol 𝛾 + 𝑛 = 𝛾 𝑛 + 𝑃(𝑛 &, ) 𝛾 𝑛 = 𝛾 + 𝑄 - (log 𝑛) – with 𝑄 ! (. ) periodic when the transition matrix is rational, 𝛾 𝑛 = 𝛾 , otherwise

  17. Extended m model with tail symbol • The tail symbol is the next symbol after insertion in the DST a b – It would be the first symbol of the next phrase in the n model – T m number of tail symbols equal to “ a ” b + 𝑄 = 𝑄 𝑈 # = 𝑙 & 𝑀 # = 𝑜 𝑏𝑚𝑚 𝑡𝑢𝑏𝑠𝑢 𝑥𝑗𝑢ℎ 𝑑) 𝑑 ∈ {𝑏, 𝑐} #,.,% 𝑣 % 𝑤 . 𝑨 # + 𝑄 + 𝑨, 𝑣, 𝑤 = ] 𝑄 #,.,% 𝑛! #,.,% 𝜖 𝜖𝑨 𝑄 + 𝑨, 𝑣, 𝑤 = 𝑞 +! 𝑤 + 𝑞 +" 𝑄 ! 𝑞 +! 𝑣𝑨, 𝑣, 𝑤 𝑄 " (𝑞 +" 𝑣𝑨, 𝑣, 𝑤)

  18. Extended m model analytical results • Refining the techniques of the previous m models (limited to binary alphabet) # , 𝑈 # ) is asymptotically normal – (𝑀 " " # = 𝑛𝜐 # (𝑛) , 𝜐 # 𝑛 = 𝜐 𝑛 + 𝑃(𝑛 $% ) – 𝐹 𝑈 " • 𝜐 𝑛 = ̅ 𝜐 + 𝑄 _ (log 𝑛) with P 1 (.) periodic when the transition matrix is rational, 𝜐 𝑛 = ̅ 𝜐 , otherwise • Notice the asymptotic tail symbol distribution is NOT the Markov stationary distribution # , 𝑈 # – 𝑑𝑝𝑤 𝑀 " = 𝑃(𝑛 log 𝑛) "

  19. The remaining very hard nut to crack • Coming back to the n model – Remember : DST m model and LZ n model are NOT equivalent with Markov sources

  20. How will it be if m and n models were equivalent for Markov • LZ n model: let 𝒬 ",' = 𝑄(𝑛 𝑔𝑗𝑠𝑡𝑢 𝑞ℎ𝑠𝑏𝑡𝑓𝑡 ℎ𝑏𝑤𝑓 𝑢𝑝𝑢𝑏𝑚 𝑚𝑓𝑜𝑕𝑢ℎ 𝑜) • With memoryless sources we have 𝒬 ",' = 𝑄 ",' because m model and n models are equivalent • For a Markov source a convolution from initial symbol and tail symbols? ! " 𝒬 #,% = ] 𝑄 𝑄 # ! ,.,% ! #&# ! ,# ! &.,%&% ! a b # ! ,.,% ! • But this is wrong! b

  21. What is failing in the transition DST to LZ? • Carving phrases in the text a b a b b b a 𝜏 = (𝑏, 𝑐, 𝑏, 𝑐, 𝑐, 𝑐) • Arranging phrase in a DST a a b a a b b b a b b b 𝜏 ! = 𝑏, 𝑐, 𝑐 . 𝜏 " = 𝑏, 𝑐, 𝑐

  22. Enumerating permutations in the n and m models • Let σ a permutation of m symbols – σ indicates the sequence of tail symbols in the text (n model). 𝒬 0,% = 𝑄(𝑛 𝑔𝑗𝑠𝑡𝑢 𝑢𝑏𝑗𝑚 𝑡𝑧𝑛𝑐𝑝𝑚 𝑔𝑝𝑚𝑚𝑝𝑥 𝜏 & 𝑑𝑝𝑤𝑓𝑠 𝑚𝑓𝑜𝑕𝑢ℎ 𝑜) – 𝜏 ! indicates the tail symbol sequence in DST c-subtree (m model) + 𝑄 = 𝑄 𝐸𝑇𝑈 𝑢𝑏𝑗𝑚 𝑡𝑧𝑛𝑐𝑝𝑚 𝑔𝑝𝑚𝑚𝑝𝑥 𝜏 & 𝑞𝑏𝑢ℎ 𝑚𝑓𝑜𝑕𝑢ℎ 𝑗𝑡 𝑜 𝑡𝑓𝑟𝑣𝑓𝑜𝑑𝑓𝑡 𝑡𝑢𝑏𝑠𝑢 𝑥𝑗𝑢ℎ 𝑑) 0,% ! ! ",$ = ∑ % &" 𝒬 = ∑ % &", % / &' 𝑄 – We have 𝒬 %,$ and 𝑄 ",',$ %,$ – But we will see that don’t have the m-n convolution ! " 𝒬 #,% = ] 𝑄 𝑄 0 ! ,% ! 0 " ,%&% ! 0 ! 1|0 " |3# – In other words ! " 𝒬 #,% ≠ ] 𝑄 𝑄 # ! ,.,% ! #&# ! ,# ! &.,%&% ! # ! ,% ! ,.

  23. The lost permutations • The following case is not feasible a a a b a b b a a a b b

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend