 
              Indexcompressionand efgicientqueryprocessing COMP90042 LECTURE 3, THE UNIVERSITY OF MELBOURNE by Matthias Petri Tue 12/3/2019
Index compression 1/37 Indexcompression
Inverted Index - Recap in 4 where 3 sleep 5 house 4 night 52 the Index compression 7 term t old 3 big 1 6 and 2/37 Postings list for t f t � 1 , 6 , 7 , 8 , 9 , 12 � , � 1 , 2 , 1 , 3 , 1 , 2 � � 2 , 5 , 42 � , � 1 , 1 , 1 � � 32 � , � 4 � � 2 , 3 , 5 , 6 , 8 , 14 , 25 � , � 1 , 1 , 4 , 1 , 5 , 3 , 1 � � 1 , 2 , 3 , 4 , 5 , 7 , 8 , 9 , . . . � , � 10 , 21 , 10 , 42 , 12 , 14 , 12 , 4 , . . . � � 1 , 12 , 13 , 14 � , � 2 , 2 , 1 , 3 � � 6 , 21 , 32 , 33 , 43 � , � 2 , 3 , 4 , 2 , 1 � � 1 , 51 , 53 � , � 1 , 2 , 3 � � 1 , 3 , 4 , 6 � , � 1 , 1 , 2 , 1 �
Index Compression - Motivation Index compression Space reduction can lead to substantial cost reductions queries Inverted Index mostly stored in RAM (query performance) Uncompressed Storage Cost Postings Terms Documents 3/37 Saving 5% means shutting down 50/1000 machines! Inverted Index size for 420 GB of web data (tiny) 25 Million 35 Million 6 Billion ≈ 32 GB Companies run 1000 s of machines to answer search
Index compression Index compression 4/37 Benefits of index compression: Reduce storage requirements Keep larger parts of the index in memory Faster query processing Example A state-of-the-art inverted index of 25 million websites (420GB) milliseconds. requires only 5GB ( 1 . 2 % ) and can answer queries in ≈ 10 32 GB → 5 GB corresponds to a 700 % space reduction!
Compression Principles Index compression length of T . n f s f s Intuition: Spend less bits on items that occur ofuen. Entropy H : Information content of a text T is a characterized by its a data set Compressibility is bounded by the information content of 5/37 � H ( T ) = − n log 2 s ∈ Σ where f s is the frequency of symbol s in T and n is the For example, H ( abracadabra ) = 2 . 040373 bits with n = 11 , f a = 5 , f b = 2 , f c = 1 , f d = 1 , f r = 2 .
Posting list Compression Index compression 6/37 Minimize storage costs Fast sequential access Support GEQ( x ) operation: Return the smallest item in the list that is greater or equal to x
Posting list Compression - Concepts … … aeronaut ids: gaps: ids: house Index compression … gaps: gaps: … 7/37 Postings list corresponds to an increasing sequence of integers ids: the smaller Idea: Gaps between two adjacent integers can be much Each integer can be in [1 , N ] requiring log 2 ( N ) bits 25 26 29 12345 12347 25 1 3 1 2 5123 5234 5454 5591 5123 1 220 137 251235 251239 251239 251235 4 34
Variable Byte Compression 00111000 Number of Bytes Number Range Storage Cost 10000101 Index compression 10000110 Encoding Number Examples Use variable number of bytes to represent integers. Each byte Idea 8/37 contains 7 bits “payload” and one continuation bit. 824 5 0 − 127 1 128 − 16383 2 16384 − 2097151 3
Variable Byte Compression - Example 3. Extract the lowest So we write top bit to . bits and set . Write in lowest 5. Number smaller than ) (or bits. 4. Discard lowest mod bits. ) Index compression (or bits. 2. Discard lowest mod bits. 1. Extract the lowest How do we compress the number? bytes! . How many bytes? 9/37 which is Compress number 512312 or 1111101000100111000 in binary.
Variable Byte Compression - Example bits. So we write top bit to . bits and set . Write in lowest 5. Number smaller than ) (or bits. 4. Discard lowest mod 3. Extract the lowest Index compression ) (or bits. 2. Discard lowest mod bits. 1. Extract the lowest How do we compress the number? 9/37 which is Compress number 512312 or 1111101000100111000 in binary. How many bytes? 11111 | 0100010 | 0111000 . 3 bytes!
Variable Byte Compression - Example bits. So we write top bit to . bits and set . Write in lowest 5. Number smaller than ) (or bits. 4. Discard lowest mod 3. Extract the lowest Index compression ) (or bits. 2. Discard lowest mod bits. 1. Extract the lowest How do we compress the number? 9/37 which is Compress number 512312 or 1111101000100111000 in binary. How many bytes? 11111 | 0100010 | 0111000 . 3 bytes!
Variable Byte Compression - Example bits. So we write top bit to . bits and set . Write in lowest 5. Number smaller than ) (or bits. 4. Discard lowest mod 3. Extract the lowest Index compression ) (or bits. 2. Discard lowest mod How do we compress the number? 9/37 which is Compress number 512312 or 1111101000100111000 in binary. How many bytes? 11111 | 0100010 | 0111000 . 3 bytes! 1. Extract the lowest 7 bits.
Variable Byte Compression - Example bits. So we write top bit to . bits and set . Write in lowest 5. Number smaller than ) (or bits. 4. Discard lowest mod 3. Extract the lowest Index compression ) (or bits. 2. Discard lowest How do we compress the number? 9/37 which is Compress number 512312 or 1111101000100111000 in binary. How many bytes? 11111 | 0100010 | 0111000 . 3 bytes! 1. Extract the lowest 7 bits. 512312 mod 128 = 56 = 0111000
Variable Byte Compression - Example mod So we write top bit to . bits and set . Write in lowest 5. Number smaller than ) (or bits. 4. Discard lowest bits. Index compression 3. Extract the lowest ) (or How do we compress the number? 9/37 which is Compress number 512312 or 1111101000100111000 in binary. How many bytes? 11111 | 0100010 | 0111000 . 3 bytes! 1. Extract the lowest 7 bits. 512312 mod 128 = 56 = 0111000 2. Discard lowest 7 bits.
Variable Byte Compression - Example mod So we write top bit to . bits and set . Write in lowest 5. Number smaller than ) (or bits. 4. Discard lowest bits. Index compression 3. Extract the lowest How do we compress the number? 9/37 which is Compress number 512312 or 1111101000100111000 in binary. How many bytes? 11111 | 0100010 | 0111000 . 3 bytes! 1. Extract the lowest 7 bits. 512312 mod 128 = 56 = 0111000 2. Discard lowest 7 bits. 512312 ÷ 128 = 4002 (or 512312 >> 7 )
Variable Byte Compression - Example 4. Discard lowest So we write top bit to . bits and set . Write in lowest 5. Number smaller than ) (or bits. mod How do we compress the number? Index compression 9/37 which is Compress number 512312 or 1111101000100111000 in binary. How many bytes? 11111 | 0100010 | 0111000 . 3 bytes! 1. Extract the lowest 7 bits. 512312 mod 128 = 56 = 0111000 2. Discard lowest 7 bits. 512312 ÷ 128 = 4002 (or 512312 >> 7 ) 3. Extract the lowest 7 bits.
Variable Byte Compression - Example 4. Discard lowest So we write top bit to . bits and set . Write in lowest 5. Number smaller than ) (or bits. which is How do we compress the number? Index compression 9/37 Compress number 512312 or 1111101000100111000 in binary. How many bytes? 11111 | 0100010 | 0111000 . 3 bytes! 1. Extract the lowest 7 bits. 512312 mod 128 = 56 = 0111000 2. Discard lowest 7 bits. 512312 ÷ 128 = 4002 (or 512312 >> 7 ) 3. Extract the lowest 7 bits. 4002 mod 128 = 34 = 0100010
Variable Byte Compression - Example Index compression So we write top bit to . bits and set . Write in lowest 5. Number smaller than which is 9/37 How do we compress the number? Compress number 512312 or 1111101000100111000 in binary. How many bytes? 11111 | 0100010 | 0111000 . 3 bytes! 1. Extract the lowest 7 bits. 512312 mod 128 = 56 = 0111000 2. Discard lowest 7 bits. 512312 ÷ 128 = 4002 (or 512312 >> 7 ) 3. Extract the lowest 7 bits. 4002 mod 128 = 34 = 0100010 4. Discard lowest 7 bits. 4002 ÷ 128 = 31 (or 4002 >> 7 )
Variable Byte Compression - Example Index compression 9/37 How do we compress the number? Compress number 512312 or 1111101000100111000 in binary. How many bytes? 11111 | 0100010 | 0111000 . 3 bytes! 1. Extract the lowest 7 bits. 512312 mod 128 = 56 = 0111000 2. Discard lowest 7 bits. 512312 ÷ 128 = 4002 (or 512312 >> 7 ) 3. Extract the lowest 7 bits. 4002 mod 128 = 34 = 0100010 4. Discard lowest 7 bits. 4002 ÷ 128 = 31 (or 4002 >> 7 ) 5. Number smaller than 128 . Write in lowest 7 bits and set top bit to 1 . 31 = 11111 So we write 10011111 which is 31 + 128 = 159
Variable Byte Compression - Example Index compression 9/37 How do we compress the number? Compress number 512312 or 1111101000100111000 in binary. How many bytes? 11111 | 0100010 | 0111000 . 3 bytes! 1. Extract the lowest 7 bits. 512312 mod 128 = 56 = 0111000 2. Discard lowest 7 bits. 512312 ÷ 128 = 4002 (or 512312 >> 7 ) 3. Extract the lowest 7 bits. 4002 mod 128 = 34 = 0100010 4. Discard lowest 7 bits. 4002 ÷ 128 = 31 (or 4002 >> 7 ) 5. Number smaller than 128 . Write in lowest 7 bits and set top bit to 1 . 31 = 11111 So we write 10011111 which is 31 + 128 = 159
Variable Byte - Algorithm Decoding return x 10: 9: end while 8: 7: 6: 5: 4: 3: Index compression 1: function DECODE(bytes) 2: 7: end function 4: 10/37 Encoding 1: function ENCODE( x ) 2: 3: 11: end function 5: end while 6: while x > = 128 do x = 0 , s = 0 WRITE( x mod 128 ) y = READBYTE(bytes) x = x ÷ 128 while y < 128 do x = x ^ ( y << s ) WRITE( x + 128 ) s = s + 7 y = READBYTE(bytes) x = x ^ (( y − 128) << s )
Recommend
More recommend