Improved Address-Calculation Coding of Integer Arrays Jyrki - PowerPoint PPT Presentation

Improved Address-Calculation Coding of Integer Arrays Jyrki Katajainen 1 , 2 Amr Elmasry 3 , Jukka Teuhola 4 1 University of Copenhagen 2 Jyrki Katajainen and Company 3 Alexandria University 4 University of Turku c � Performance Engineering Laboratory SPIRE 2012, Cartagena (1)

Problem formulation Given: An array of integers Many solutions known, see the { x i | i ∈ { 1 , 2 , . . . , n }} list of references in the paper Wanted: Compressed represen- Theoretical approaches tation, fast random access • O (1) worst-case-time access Operations: • overhead of o ( n ) bits with access ( i ): retrieve x i respect to some measure of insert ( i , v ): insert v before x i compactness delete ( i ): remove x i • complicated Other: omitted in this talk Practical approaches sum ( j ): retrieve � j i =1 x i • slower access search ( p ): find the rank of the • O ( n ) bits of overhead given prefix sum p • implementable modify ( i , v ): change x i to v • fast in practice c � Performance Engineering Laboratory SPIRE 2012, Cartagena (2)

Measures of compactness What is optimal? Data-independent measures n : # integers Compact representation: x = max n ˆ i =1 x i n lg(1 + s/n ) + O ( n ) bits s = � n i =1 x i Apply Jensen’s inequality to Data-aware measure the raw representation and Raw representation: accept a linear overhead � n x n ⌉ i =1 ⌈ lg(1 + x i ) ⌉ bits Lower bound 1 : ⌈ lg ˆ x n : Overhead: In order to support ˆ The number of se- random access we expect to quences of n positive integers need some more bits whose value is at most ˆ x � s − 1 � �� Lower bound 2 : lg n − 1 � s − 1 � : The number of se- n − 1 quences of n positive integers that add up to s c � Performance Engineering Laboratory SPIRE 2012, Cartagena (3)

Two trivial “solutions” Uncompressed array Fixed-length coding a : a : x = max n w : size of a machine word ˆ i =1 x i β = ⌈ lg(1 + ˆ x ) ⌉ Space: w · n + O ( w ) bits access ( i ): a [ i ] Space: β · n + O ( w ) bits access ( i ): Access times on my computer: • compute the word address sequential random n • read one or two words 2 10 0.89 1.1 • mask the bits needed 2 15 0.74 1.4 2 20 – one outlier ruins the com- 0.89 7.1 2 25 pactness 0.74 10.9 տ ns per operation + relatively fast – no compression Q: How would you support insert + fast and delete for these structures? c � Performance Engineering Laboratory SPIRE 2012, Cartagena (4)

Two examples x 1 = n 2 , x i = 1 for i ∈ { 2 , . . . , n } x 1 = n , x i = 1 for i ∈ { 2 , . . . , n } Raw representation: Raw representation: n + O (lg n ) bits n + O (lg n ) bits Fixed-length coding: Compact representation: n ⌈ lg(1 + n ) ⌉ bits n lg n + Θ( n ) bits Lower bound 1 : Lower bound 1 : ⌈ n lg n ⌉ bits ⌈ 2 n lg n ⌉ bits Lower bound 2 : n lg n + Θ( n ) bits N.B. All our representations are compact, but we do not claim them to be optimal c � Performance Engineering Laboratory SPIRE 2012, Cartagena (5)

Our contribution Teuhola 2011 This paper Interpolative coding of integer Space: n lg(1+ s/n )+ O ( n ) bits, sequences supporting log-time i.e. compact random access, Inform. Process. access : O (lg lg( n + s )) worst- Manag. 47 ,5, 742–761 case time in the static case and O (lg n ) worst-case time Space: n lg(1+ s/n )+ O ( n ) bits, in the dynamic case i.e. compact insert , delete : O (lg n + w 2 ) worst- access : O (lg( n + s )) worst-case case time time insert , delete : not supported n : # integers (assume n ≥ w ) s : sum of the integers w : size of a machine word c � Performance Engineering Laboratory SPIRE 2012, Cartagena (6)

Address-calculation coding 21 14 7 Space: Compact by the magical formula 9 5 2 5 access : O (lg n ) worst-case time 2 3 2 4 4 1 5 0 (assuming that the position of the most significant one 01110 1001 0100 010 010 10 100 10101 bit in a word can be deter- • encoding in depth-first order mined in O (1) time) • yellow nodes not stored insert , delete : not supported • skip subtrees using the formula t = ⌈ lg(1 + s ) ⌉ Magical formula n ( t − lg n + 1) + ⌊ s ( n − 1)  2 t − 1 ⌋ − t − 1 , if s ≥ n/ 2  B ( n, s ) = 2 t + ⌊ s (2 − 1 2 t − 1 ) ⌋ − t − 1 + s (lg n − t ) , otherwise  c � Performance Engineering Laboratory SPIRE 2012, Cartagena (7)

Indexed address-calculation coding c : a tuning parameter, c ≥ 1 Analysis s i : sum of the numbers in the i th roots: chunk ⌈ n/k ⌉·⌈ lg(1+ s ) ⌉ ≤ n/c + O ( w ) index; fixed-length coding pointers: ⌈ n/k ⌉ · (lg n + lg lg(1+ s/n ) + chunk size: k = ⌊ c · lg( n + s ) ⌋ O (1)) ≤ n/c + O ( w ) # chunks: t = ⌈ n/k ⌉ chunks: root: ⌈ lg(1 + s ) ⌉ bits � t pointer: lg n + lg lg(1 + s/n ) + O (1) bits i =1 [ k · lg(1+ s i /k )+ O ( k )] ≤ chunks; address-calculation coding n lg(1 + s/n ) + O ( n ) c � Performance Engineering Laboratory SPIRE 2012, Cartagena (8)

Other applications of indexing Indexed Elias delta coding Indexed fixed-length coding c : a tuning parameter, c ≥ 1 c : a tuning parameter, c ≥ 1 x = max n ˆ i =1 x i index; fixed-length coding index; fixed-length coding chunk size: k = ⌊ c · (lg n + lg lg s ) ⌋ # chunks: t = ⌈ n/k ⌉ chunk size: k = ⌊ c · (lg n + lg lg ˆ x ) ⌋ pointer: lg n + lg lg(1 + s/n ) + O (1) bits # chunks: t = ⌈ n/k ⌉ chunks; Elias delta coding pointer: lg n + lg lg(1 + ˆ x ) + O (1) bits offsets; fixed-length coding Space: raw + O ( � n i =1 lg lg x i ) access : O (lg n +lg lg s ) worst-case landmark + offset data; raw coding time Space: raw + O ( n lg lg( n + ˆ x )) access : O (1) worst-case time c � Performance Engineering Laboratory SPIRE 2012, Cartagena (9)

Dynamization c : a tuning parameter, c ≥ 1 Use the zone technique: w : size of a machine word • align chunks to word bound- aries index; balanced search tree • keep chunks of the same size in separate zones • only w zones • maintain zones as rotated ar- chunk size: k = cw/ 2..2 cw rays (one chunk may be split) # chunks: t = ⌈ n/ (2 cw ) ⌉ .. ⌈ 2 n/ ( cw ) ⌉ root: w bits Space: Still compact pointer: w bits access : O (lg n ) worst-case time chunks; address-calculation coding ( n ≥ w )) insert , delete : O (lg n + w 2 ) worst- case time c � Performance Engineering Laboratory SPIRE 2012, Cartagena (10)

Experimental setup Benchmark data: Processor: � Xeon R � CPU 1.8 GHz Intel R n integers – uniformly distributed × 2 – exponentially distributed Programming language: Repetitions: C Each experiment repeated r Compiler: times for sufficiently large r gcc with optimization -O3 Reported value: Source code: Measurement result divided Available from Jukka’s home by r × n page c � Performance Engineering Laboratory SPIRE 2012, Cartagena (11)

Experimental results: Overhead Indexed modifiable array Indexed modifiable array Indexed static array Indexed static array 16 Basic AC-coded array Basic AC-coded array 10 Entropy Entropy 14 Bits per source integer Bits per source integer 12 8 10 6 8 6 4 4 2 2 2 4 8 16 32 64 128 256 512 1024 1/64 1/32 1/16 1/8 1/4 1/2 1 2 4 8 Range size Lambda – entropy of x i : expected information content of x i − ln(1 − y i ) � � – for a random floating-point number y i , y i ≥ 0, x i = λ c � Performance Engineering Laboratory SPIRE 2012, Cartagena (12)

Experimental results: access , search , modify Basic AC-coded array, access Indexed modifiable array, modify 2.0 Basic AC-coded array, search Indexed modifiable array, access Indexed static array, search 6 Time per operation (microsec.) Time per operation (microsec.) Indexed static array, access 5 1.5 4 1.0 3 2 0.5 1 1 000 10 000 100 000 1 000 000 1 000 10 000 100 000 1 000 000 Number of source integers Number of source integers – uniformly-distributed integers drawn from [0..63] c � Performance Engineering Laboratory SPIRE 2012, Cartagena (13)

Further work Theory Practice • Try to understand better the • As to the speed of access , we trade-off between the speed showed that O (lg lg( n + s )) is of access and the amount of better than O (lg( n + s )). Can overhead in the data-aware you show that O (1) is better case. than O (lg lg( n + s ))? • Independent of the theoreti- Applications cal running time, can one get • Can some of you convince me the efficiency of access closer that compressed arrays are to that provided by uncom- useful—or even necessary— pressed arrays? in some information-retrieval To do application(s)? • A thorough experimental comparison! c � Performance Engineering Laboratory SPIRE 2012, Cartagena (14)

Improved Address-Calculation Coding of Integer Arrays Jyrki - PowerPoint PPT Presentation

Improved Address-Calculation Coding of Integer Arrays Jyrki Katajainen 1 , 2 Amr Elmasry 3 , Jukka Teuhola 4 1 University of Copenhagen 2 Jyrki Katajainen and Company 3 Alexandria University 4 University of Turku c Performance Engineering

Arrays (2) Higher-Dimensional Arrays Arrays of Character Strings Topics Variables and Arrays

Data Abstraction Copying Arrays. Sorting Arrays. 2D Arrays. Janyl Jumadinova September 30 and

Formal Modeling in Cognitive Science 1 Coding Theorems Lecture 28: Kraft Inequality; Source Coding

Lecture 11 Multidimensional arrays Two-dimensional Arrays Just an array of arrays useful

Lecture 11 Multidimensional arrays Two-dimensional Arrays Just an array of arrays useful

Arrays Arrays and Methods Searching Sorting Arrays Reading: => Continue with

Objectives: Discuss arrays Syntax Multi-dimensional arrays Arrays

Image and Video Coding: Video Coding Extensions Screen Content Coding Screen Content Coding

ADVANCED MULTIMEDIA ADVANCED MULTIMEDIA CODING CODING Fernando Pereira Instituto Superior

Dynamical systems Expanding maps on the circle. Coding Jana Rodriguez Hertz ICTP 2018 coding

Statements and open sentences Statements: 2 is an even integer. 3 is an even integer.

Image and Video Coding: Improved Inter-Picture Prediction Review of Hybrid Video Coding Last

6 KEYNOTE ADDRESS SLIDES 7 KEYNOTE ADDRESS SLIDES 8 KEYNOTE ADDRESS SLIDES 9 KEYNOTE ADDRESS

Risk-Based Coding and Reimbursement What is Risk-Based Coding? Risk-Based Coding Overview A

Entropy Coding Definition of Entropy Three Entropy coding techniques: (taken from the

Coding and Applications in Sensor Networks Coding and Applications in Sensor Networks Why coding?

Video Coding using Dual- Tree Wavelet Transform Beibei Wang 1 , Yao Wang 1 , Ivan Selesnick 1 ,

SILT A Memory-Efficient, High-Performance Key- Value Store Based on paper of H. Lim, B. Fan,

SF 6 Emission Reduction At the Point of SF 6 Production 28/05/2014 AGENDA 1. Solvay brief

Focus Group Laurie Williams Sr. Manager, Reliability Compliance PNM Resources, Inc. WECC EPAS

CS109B Advanced Section : A Tour of Variational Inference Professor : Pavlos Protopapas, TF :

In the name of Allah In the name of Allah the compassionate, the merciful the compassionate, the

Structural Information Theory: Principles for Distinguishing Order From Disorder Angsheng Li

Nonstochastic Information for Worst-Case Networked Estimation and Control Girish Nair Department

Improved Address-Calculation Coding of Integer Arrays Jyrki - PowerPoint PPT Presentation

Improved Address-Calculation Coding of Integer Arrays Jyrki Katajainen 1 , 2 Amr Elmasry 3 , Jukka Teuhola 4 1 University of Copenhagen 2 Jyrki Katajainen and Company 3 Alexandria University 4 University of Turku c Performance Engineering

Arrays (2) Higher-Dimensional Arrays Arrays of Character Strings Topics Variables and Arrays

Data Abstraction Copying Arrays. Sorting Arrays. 2D Arrays. Janyl Jumadinova September 30 and

Formal Modeling in Cognitive Science 1 Coding Theorems Lecture 28: Kraft Inequality; Source Coding

Lecture 11 Multidimensional arrays Two-dimensional Arrays Just an array of arrays useful

Lecture 11 Multidimensional arrays Two-dimensional Arrays Just an array of arrays useful

Arrays Arrays and Methods Searching Sorting Arrays Reading: =&gt; Continue with

Objectives: Discuss arrays Syntax Multi-dimensional arrays Arrays

Image and Video Coding: Video Coding Extensions Screen Content Coding Screen Content Coding

ADVANCED MULTIMEDIA ADVANCED MULTIMEDIA CODING CODING Fernando Pereira Instituto Superior

Dynamical systems Expanding maps on the circle. Coding Jana Rodriguez Hertz ICTP 2018 coding

Statements and open sentences Statements: 2 is an even integer. 3 is an even integer.

Image and Video Coding: Improved Inter-Picture Prediction Review of Hybrid Video Coding Last

6 KEYNOTE ADDRESS SLIDES 7 KEYNOTE ADDRESS SLIDES 8 KEYNOTE ADDRESS SLIDES 9 KEYNOTE ADDRESS

Risk-Based Coding and Reimbursement What is Risk-Based Coding? Risk-Based Coding Overview A

Entropy Coding Definition of Entropy Three Entropy coding techniques: (taken from the

Coding and Applications in Sensor Networks Coding and Applications in Sensor Networks Why coding?

Video Coding using Dual- Tree Wavelet Transform Beibei Wang 1 , Yao Wang 1 , Ivan Selesnick 1 ,

SILT A Memory-Efficient, High-Performance Key- Value Store Based on paper of H. Lim, B. Fan,

SF 6 Emission Reduction At the Point of SF 6 Production 28/05/2014 AGENDA 1. Solvay brief

Focus Group Laurie Williams Sr. Manager, Reliability Compliance PNM Resources, Inc. WECC EPAS

CS109B Advanced Section : A Tour of Variational Inference Professor : Pavlos Protopapas, TF :

In the name of Allah In the name of Allah the compassionate, the merciful the compassionate, the

Structural Information Theory: Principles for Distinguishing Order From Disorder Angsheng Li

Nonstochastic Information for Worst-Case Networked Estimation and Control Girish Nair Department

Arrays Arrays and Methods Searching Sorting Arrays Reading: => Continue with