steepest descent O FF -L INE scheme LZ macro schemes ([Ziv, - PowerPoint PPT Presentation

S OME T HEORY AND P RACTICE OF G REEDY O FF -L INE T EXTUAL S UBSTITUTION Alberto Apostolico Stefano Lonardi U NIVERSIT ` P URDUE U NIVERSITY and A DI P ADOVA

Lossless Compression by Textual Substitution the optimum encoding problem for most macro schemes is P -complete [Storer, N Szymanski 82] ▲ ▼ steepest descent O FF -L INE scheme ▲ ▼ LZ macro schemes ([Ziv, Lempel 77], [Ziv, Lempel 78]) ❏ have linear time implementations (e.g., [Rodeh, Pratt, Even 81] ) ❏ are highly constrained (unidirectional pointers, . . . )

Findings ❏ uniform improvement over P ACK (Huffman) and C OMPRESS (LZ-78) ❏ improvement over GZ IP (LZ-77) and BZ IP [Burrows, Wheeler 94] for highly random inputs (e.g., genetic sequences) ❏ computationally intensive ❏ viable to parallel implementation where advantageous ❏ some unexpected tradeoffs ❏ some interesting algorithmic and programming problems

Overall structure of O FF -L INE x = < read the original text > ; repeat D = < build a data structure containing, for every substring of the text x , the number of its non overlapped occurrences > ; s = < choose from D the substring that maximizes the compression > ; x = < substitute all the occurrences of s in > ; x until < no further compression of x can be obtained > ; < run Huffman on the encoding > ; a b a a b a b a a b a a b a b a a b a b a 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

File Size Huffman LZ-78 LZ-77 BWT (bytes) PACK COMPRESS OFF-LINE GZIP BZIP2 bib 111,261 5.23 3.34 2.98 2.52 1.97 book1 768,771 4.56 3.45 3.43 3.26 2.42 book2 610,856 4.82 3.28 2.88 2.70 2.06 geo 102,400 5.69 6.07 5.57 5.35 4.44 news 377,109 5.22 3.86 3.26 3.07 2.51 obj1 21,504 6.07 5.22 4.45 3.84 4.01 obj2 246,814 6.30 4.17 3.50 2.64 2.47 paper1 53,161 5.03 3.77 3.29 2.79 2.49 paper2 82,199 4.64 3.51 3.19 2.89 2.43 pic 513,216 1.66 0.96 0.96 0.87 0.77 progc 39,611 5.25 3.86 3.29 2.68 2.53 progl 71,646 4.81 3.03 2.50 1.81 1.73 progp 49,379 4.91 3.11 2.70 1.82 1.73 trans 93,695 5.57 3.26 2.40 1.62 1.52 average 224,402 4.98 3.63 3.17 2.70 2.36 mito 78,521 1.84 1.82 1.73 1.97 1.84 chrI 230,195 2.19 2.18 2.16 2.30 2.16 chrVI 270,148 2.19 2.18 2.17 2.33 2.18

How to . . . ❏ . . . count the number of non overlapped occurrences of each substring ➠ augmented suffix tree ❏ . . . search and substitute all the occurrences of a particular substring ➠ balanced tree of text fragments

a aba..$ a b ab a ba aba 1 aba..$ ba$ $ $ $ 21 19 6 14 9 ba aba aba..$ 4 $ ba$ 17 12 ab a b a ab a aba..$ 3 aba..$ $ ba$ 8 16 11 ba a ba ba a ba aba..$ 2 aba..$ ba$ $ $ 20 7 15 10 ba aba aba..$ 5 $ $ ba$ 22 18 13 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 a b a a b a b a a b a a b a b a a b a b a $

Augmented Suffix Tree ❏ collects compactly all the suffixes of a string and counts the number non-overlapped occurrences ❏ construction: brute force ) ; clever n ) [Apostolico, Preparata 2 2 O ( n O ( n log 96] ❏ query: O ( m ) ❏ space: n ) (probably ( n ) [Mignosi, Breslauer p.c.]) O ( n log O ❏ brute force construction: on average O ( n log n )

a aba..$ a b ab a ba aba 13 8 3 2 2 1 1 5 aba..$ ba$ $ $ $ 21 19 6 14 9 ba aba aba..$ 3 2 4 $ ba$ 17 12 ab a b a ab a aba..$ 2 2 3 4 3 3 1 aba..$ $ ba$ 8 16 11 ba a ba ba a ba aba..$ 8 4 3 2 2 1 2 aba..$ ba$ $ $ 20 7 15 10 ba aba aba..$ 3 2 5 $ $ ba$ 22 18 13 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 a b a a b a b a a b a a b a b a a b a b a $

Balanced Tree of Text Fragments 22 a b a a b a b a a b a a b a b a a b a b a $ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 4 2 7 a b a a b a b a a b a a b a b a a b a b a $ 1 2 3 4 5 6 7

Choosing and Computing a Gain measure (1/2) Leave one of the w non overlapped occurrences of w in the text, substitute the f other 1 with a pointer to the original one f � w Assume an integer z can be encoded with ) bits, j , = the l ( z m = j w B w average length of a symbol in bits 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 a b a a b a b a a b a a b a b a a b a b a $ B f m w w (9,3) (9,3) b a a b a (9,3) b a (9,3) b a $ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 B m + ( f � 1)(1 + l ( n ) + l ( m )) + 1 w w w

Choosing and Computing a Gain Measure (2/2) Remove all the w non overlapped occurrences of w in the text, save w , f j , w and the list of occurrences, compact the text m = j w f w 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 a b a a b a b a a b a a b a b a a b a b a $ B f m w w a b a 1 4 9 12 17 b a b a b a $ 3 5 1 2 3 4 5 6 7 B m + l ( m ) + l ( f ) + f l ( n ) w w w w Remark: compute G only at explicit nodes of the tree

O FF - LINE variants ❏ Q substrings selection/substitution are performed between two consecutive updates of the tree ➠ O FF - LINE - SLOPPY ❏ all the prefixes of the current selection are replaced if capable to produce compression ➠ O FF - LINE - PREF ❏ only substrings which have length less than H are considered ➠ O FF - LINE - PRUNED

O FF - LINE - SLOPPY on mito (size 78521) Heap Size ( Q ) Substitutions Trees Ratio Time size % 1 787 788 1.0 100.0% 32,798 100.00% 10 799 83 9.6 12.11% 32,837 100.11% 100 910 13 70.0 4.21% 33,113 100.96% 1,000 1,174 4 293.5 4.44% 33,688 102.71% O FF -L INE - SLOPPY on paper2 (size 82201) Heap Size ( Q ) Substitutions Trees Ratio Time size % 1 165 165 1.0 100.0% 17,074 100.0% 10 170 22 7.7 14.8% 17,141 100.4% 100 303 7 43.3 7.06% 17,440 102.1% 1,000 619 3 206.3 8.86% 17,861 104.6%

File Size Iterations (bytes) OFF-LINE bib 111,261 927 book1 768,771 5,255 book2 610,856 4,193 geo 102,400 764 news 377,109 2,902 obj1 21,504 215 obj2 246,814 1,751 paper1 53,161 663 paper2 82,199 811 pic 513,216 113 progc 39,611 537 progl 71,646 611 progp 49,379 453 trans 93,695 616 mito 78,521 170 chrI 230,195 77 chrVI 270,148 35

Final remarks ❏ Data structures and algorithms ➭ parallel implementation ➭ update the (pruned) augmented suffix tree ❏ Empirical studies ➭ fine-tune the function G ➭ reiterate the compression on the substrings removed ➭ experiment other encodings (arithmetic, move to front) ➭ hybrid with other schemes

Suffix Tree ❏ collects compactly all the suffixes of x $ ❏ construction: brute force ) ; clever ( n ) [Weiner 73], [McCreight 76], 2 O ( n O [Ukkonen 95] - in parallel n ) using n processors [AILSV 83] O (log ❏ query time: O ( m ) ❏ space: O ( n ) ❏ brute force construction: on average n ) (e.g. [Aho, Hopcroft, Ullman O ( n log 74], [Apostolico, Szpankowski 92], [Chang, Lawler 94]) ❏ occurrences of a substring w = leaves reachable from the node rooted at w But . . . we need the statistic of non overlapped occurrences

a aba..$ a b ab a ba aba 13 8 4 3 2 1 aba..$ ba$ $ $ $ 21 19 6 14 9 ba aba aba..$ 3 2 4 $ ba$ 17 12 ab a b a ab a aba..$ 4 2 3 3 aba..$ $ ba$ 8 16 11 ba a ba ba a ba aba..$ 8 4 3 2 2 aba..$ ba$ $ $ 20 7 15 10 ba aba aba..$ 3 2 5 $ $ ba$ 22 18 13 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 a b a a b a b a a b a a b a b a a b a b a $

Allocating the Augmented Suffix Tree ❏ structure of the node (array, linked list, balanced search tree, global hash table) ❏ space considerations (linked list 20 bytes per node), avg 1.5 node per symbol ➟ avg 30 bytes per symbol (bps) ➭ two indices [ i; j ] ➭ one pointer to list of children ➭ one pointer to list of siblings ➭ one counter for the number of non overlapped occurrences ❏ variations (Patricia 12bps [Morrison 68], suffix-array 6bps [Manber, Myers 93], suffix-cactus 9bps [Kaerkkaeinen 95], level compressed trie 11bps [Andersson, Nilsson 95])

Dynamic text and statistics indexing problem ❏ the augmented suffix tree is a suitable data structure for our needs ❏ how the tree is modified if we delete a char in the text? ❏ what happens if we delete all the occurrences of a substring? ❏ is there an algorithm to “update” efficiently the tree and its statistics? ➭ dynamic text problem [McCreight 76], [Fiala, Green 89], [Gu, Farach, Beigel 94], [Ferragina 97]

steepest descent O FF -L INE scheme LZ macro schemes ([Ziv, - PowerPoint PPT Presentation

S OME T HEORY AND P RACTICE OF G REEDY O FF -L INE T EXTUAL S UBSTITUTION Alberto Apostolico Stefano Lonardi U NIVERSIT ` P URDUE U NIVERSITY and A DI P ADOVA Lossless Compression by Textual Substitution the optimum encoding problem for most

Conjugate gradient training algorithm Steepest descent algorithm Definitions: So far: j

Section 1 Commitment Schemes Commitment Schemes Commitment Schemes Digital analogue of a safe.

Continuous Descent Operation (CDO) Continuous Descent Operation (CDO) Doc 9331 Doc 9331 Erwin

Tribute to Jean Claude ZIV Jean Claude ZIV Jean Claude & CODATU In 1980, with two French

Lempel- -Ziv Ziv- -Welch (LZW) Welch (LZW) Lempel Data Compressing Model Data Compressing

With Splitting Steepest Descent Splitting yields adaptive net structure optimization Questions

A comparative study of extrapolation methods, sequence transformations and steepest descent

Notes Recall: plain CG Smoke: CG is guaranteed to converge faster than steepest descent

Hybrid Steepest Descent Method for Variational Inequality Problem over Fixed Point Sets of

John Butters John Butters John Butters John Butters Macro Analysis johnbutters.org Two Types

Scheme Announcements Scheme Scheme is a Dialect of Lisp 4 Scheme is a Dialect of Lisp What

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

10. Unconstrained minimization terminology and assumptions gradient descent method

August 7 2019 www.firstbanknigeria.com Outline Macro, T rending News..Global Macro, T

Supplementary Notes to Horizon II macro 1 Horizon II macro Intro CONTENTS Page

MACRO GROUP | PRESENTATION MACRO GROUP | SHAREHOLDERS Ctia Rebelo, Financial, HR & Quality

Algorand: Scaling Byzantine Agreements for Cryptocurrencies Presented by: Jeremy Lin and Daniel

School of Computer Science Group Advising Meeting Fall 2014 Todays Goals Course Overview

Monge blunts Bayes: Hardness Results for Adversarial Training Zac Cranko Aditya Krishna Menon

Steve Ashbaker Reliability Initiative Director RC Footprints 2 Board RC Technical Session

Clustering Reference:http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/ Dr Ahmed

P

BONGARD-LOGO: A NEW BENCHMARK FOR HUMAN-LEVEL CONCEPT LEARNING AND REASONING Weili Nie Zhiding

Methods of Adding Vectors Geometrically MCV4U: Calculus & Vectors Recall that two vectors are

steepest descent O FF -L INE scheme LZ macro schemes ([Ziv, - PowerPoint PPT Presentation

S OME T HEORY AND P RACTICE OF G REEDY O FF -L INE T EXTUAL S UBSTITUTION Alberto Apostolico Stefano Lonardi U NIVERSIT ` P URDUE U NIVERSITY and A DI P ADOVA Lossless Compression by Textual Substitution the optimum encoding problem for most

Conjugate gradient training algorithm Steepest descent algorithm Definitions: So far: j

Section 1 Commitment Schemes Commitment Schemes Commitment Schemes Digital analogue of a safe.

Continuous Descent Operation (CDO) Continuous Descent Operation (CDO) Doc 9331 Doc 9331 Erwin

Tribute to Jean Claude ZIV Jean Claude ZIV Jean Claude &amp; CODATU In 1980, with two French

Lempel- -Ziv Ziv- -Welch (LZW) Welch (LZW) Lempel Data Compressing Model Data Compressing

With Splitting Steepest Descent Splitting yields adaptive net structure optimization Questions

A comparative study of extrapolation methods, sequence transformations and steepest descent

Notes Recall: plain CG Smoke: CG is guaranteed to converge faster than steepest descent

Hybrid Steepest Descent Method for Variational Inequality Problem over Fixed Point Sets of

John Butters John Butters John Butters John Butters Macro Analysis johnbutters.org Two Types

Scheme Announcements Scheme Scheme is a Dialect of Lisp 4 Scheme is a Dialect of Lisp What

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

10. Unconstrained minimization terminology and assumptions gradient descent method

August 7 2019 www.firstbanknigeria.com Outline Macro, T rending News..Global Macro, T

Supplementary Notes to Horizon II macro 1 Horizon II macro Intro CONTENTS Page

MACRO GROUP | PRESENTATION MACRO GROUP | SHAREHOLDERS Ctia Rebelo, Financial, HR &amp; Quality

Algorand: Scaling Byzantine Agreements for Cryptocurrencies Presented by: Jeremy Lin and Daniel

School of Computer Science Group Advising Meeting Fall 2014 Todays Goals Course Overview

Monge blunts Bayes: Hardness Results for Adversarial Training Zac Cranko Aditya Krishna Menon

Steve Ashbaker Reliability Initiative Director RC Footprints 2 Board RC Technical Session

Clustering Reference:http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/ Dr Ahmed

P

BONGARD-LOGO: A NEW BENCHMARK FOR HUMAN-LEVEL CONCEPT LEARNING AND REASONING Weili Nie Zhiding

Methods of Adding Vectors Geometrically MCV4U: Calculus &amp; Vectors Recall that two vectors are

Tribute to Jean Claude ZIV Jean Claude ZIV Jean Claude & CODATU In 1980, with two French

MACRO GROUP | PRESENTATION MACRO GROUP | SHAREHOLDERS Ctia Rebelo, Financial, HR & Quality

Methods of Adding Vectors Geometrically MCV4U: Calculus & Vectors Recall that two vectors are